dSalmon.trees

Indexing structures for fast stream data processing.

Classes

`MTree`([metric, metric_params, float_type, ...])	M-Tree efficient nearest-neighbor search in metric spaces.
`StatisticsTree`(window[, what, quantiles, ...])	Indexing structure for computing per-dimension statistics in a sliding window.

class dSalmon.trees.MTree(metric='euclidean', metric_params=None, float_type=<class 'numpy.float64'>, min_node_size=5, max_node_size=100, split_sampling=20, insert_jobs=2, query_jobs=-1, **kwargs)[source]

M-Tree efficient nearest-neighbor search in metric spaces.

A point within a tree can be accessed either via tree[k] using the point’s key k, or via tree.ix[i] using the point’s index i. Keys can be arbitrary integers and are returned by the insert, knn and neighbors functions. Indices are integers in the range 0…len(tree), sorted according to the points’ keys in ascending order.

Parameters

metric (string) – Which distance metric to use. Currently supported metrics include ‘chebyshev’, ‘cityblock’, ‘euclidean’ and ‘minkowsi’.
metric_params (dict) – Parameters passed to the metric. Minkowsi distance requires setting an integer p parameter.
float_type (np.float32 or np.float64) – Which floating point type to use for internal processing.
min_node_size (int) – The minimum number of children in tree nodes. Different parametrizations for min_node_size are guaranteed to return identical results.
max_node_size (int) – The maximum number of children in tree nodes. Different parametrizations for max_node_size are guaranteed to return identical results.
split_sampling (int) – The number of combinations to try when splitting a node. Different parametrizations for split_sampling are guaranteed to return identical results.
insert_jobs (int) – The number of additional threads to spawn for tree insertions. Since insertions can only partially be parallelized, using too many threads can harm performance.
query_jobs (int) – The number of threads to use for range- and knn-queries.

clear()[source]: Remove all points from the tree.

copy()[source]: Return a copy of the tree.

property dimension

get_points(keys)[source]

Retrieve points by key, skipping non-existent entries.

Parameters

keys (ndarray, shape (n_samples,)) – Keys for points to query as returned by insert().

Returns

points (ndarray, shape (n_samples, n_features)) – Coordinates of queried points or all-zero vectors if points were not found in the tree.
found (ndarray, shape (n_samples,)) – Whether the respective keys were found in the tree.

insert(data)[source]

Insert points and return indices.

Parameters: data (ndarray, shape (n_samples, n_features)) – The data to be inserted.
Returns: indices – The indices assigned to the newly inserted data points.
Return type: ndarray, shape (n_samples,)

itok(indices=None)[source]

Map indices to keys.

Parameters: indices (ndarray or slice, optional) – Indices or slice for which to return keys. If None, all keys are returned.
Returns: keys – The requested keys as numpy array.
Return type: ndarray

knn(data, k=1, sort=True, min_radius=0, max_radius=inf, reverse=False, extend_for_ties=False)[source]

Find the k nearest neighbors of points.

Parameters

data (ndarray, shape (n_samples, n_features)) – Points for which to perform a knn query.
k (int) – Number of nearest neighbors to consider.
sort (bool) – Whether the returned points should be sorted by distance.
min_radius (double) – Minimum distance for returned neighbor points.
max_radius (double) – Maximum distance for returned neighbor points.
reverse (bool) – If reverse == True, return the k most distant points instead of the k nearest neighbors.
extend_for_ties (bool) – Whether in the case of ties more than k points should be returned.

Returns

keys (ndarray, shape (n_total_neighbors,)) – Concatenation of keys of found neighbors.
distances (ndarray, shape (n_total_neighbors,)) – Concatenation of distances of neighbors to the respective query points.
lengths (ndarray, shape (n_samples,)) – The number of neighbors returned for each point, so that sum(length) == n_total_neighbors.

ktoi(keys)[source]

Map keys to indices.

Parameters: keys (ndarray) – Keys for which to return indices.
Returns: indices – The requested indices as numpy array.
Return type: ndarray

neighbors(data, radius)[source]

Return all points within a given radius.

Parameters

data (ndarray, shape (n_samples, n_features)) – Points for which the range query should be performed.
radius (double) – Radius for the search.

Returns

keys (ndarray, shape (n_total_neighbors,)) – Concatenation of keys of neighbors within radius.
distances (ndarray, shape (n_total_neighbors,)) – Concatenation of distances of neighbors to the respective query points.
lengths (ndarray, shape (n_samples,)) – The number of neighbors returned for each point, so that sum(length) == n_total_neighbors.

remove(keys)[source]

Remove points identified by keys, skipping non-existent entries.

Parameters: keys (ndarray, shape (n_samples,)) – Indices of the data points to be removed.
Returns: found – Boolean array indicating whether the removal was successful.
Return type: ndarray, shape (n_samples,)

class dSalmon.trees.StatisticsTree(window, what=[], quantiles=[], float_type=<class 'numpy.float64'>)[source]

Indexing structure for computing per-dimension statistics in a sliding window.

This implementation relies on an order statistic tree provided by Boost for achieving O(log(window)) time complexity for quantile computation.

Parameters

window (float) – Window length after which samples will be pruned.
what (list of strings, optional) – Which statistics to compute. Elements of what can be one of ‘sum’, ‘average’, ‘squares_sum’, ‘variance’, ‘min’, ‘max’ or ‘median’.
quantiles (list of floats, optional) – Quantile values to compute in addition to statistics in what. Elements should be floats in [0,1].
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.

fit_query(X, times=None)[source]

Process next chunk of data.

Parameters

X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.

Returns

statistics (ndarray, shape (n_samples, n_statistics, n_features)) – The computed statistics. Statistics for row i are evaluated after adding row i to the sliding window. Here, n_statistics = len(what) + len(quantiles).
counts (ndarray, shape (n_samples)) – The lengths of the sliding window after processing each row of X.