dSalmon.trees
Indexing structures for fast stream data processing.
Classes
|
M-Tree efficient nearest-neighbor search in metric spaces. |
|
Indexing structure for computing per-dimension statistics in a sliding window. |
- class dSalmon.trees.MTree(metric='euclidean', metric_params=None, float_type=<class 'numpy.float64'>, min_node_size=5, max_node_size=100, split_sampling=20, insert_jobs=2, query_jobs=-1, **kwargs)[source]
M-Tree efficient nearest-neighbor search in metric spaces.
A point within a tree can be accessed either via tree[k] using the point’s key k, or via tree.ix[i] using the point’s index i. Keys can be arbitrary integers and are returned by the insert, knn and neighbors functions. Indices are integers in the range 0…len(tree), sorted according to the points’ keys in ascending order.
- Parameters
metric (string) – Which distance metric to use. Currently supported metrics include ‘chebyshev’, ‘cityblock’, ‘euclidean’ and ‘minkowsi’.
metric_params (dict) – Parameters passed to the metric. Minkowsi distance requires setting an integer p parameter.
float_type (np.float32 or np.float64) – Which floating point type to use for internal processing.
min_node_size (int) – The minimum number of children in tree nodes. Different parametrizations for min_node_size are guaranteed to return identical results.
max_node_size (int) – The maximum number of children in tree nodes. Different parametrizations for max_node_size are guaranteed to return identical results.
split_sampling (int) – The number of combinations to try when splitting a node. Different parametrizations for split_sampling are guaranteed to return identical results.
insert_jobs (int) – The number of additional threads to spawn for tree insertions. Since insertions can only partially be parallelized, using too many threads can harm performance.
query_jobs (int) – The number of threads to use for range- and knn-queries.
- property dimension
- get_points(keys)[source]
Retrieve points by key, skipping non-existent entries.
- Parameters
keys (ndarray, shape (n_samples,)) – Keys for points to query as returned by insert().
- Returns
points (ndarray, shape (n_samples, n_features)) – Coordinates of queried points or all-zero vectors if points were not found in the tree.
found (ndarray, shape (n_samples,)) – Whether the respective keys were found in the tree.
- insert(data)[source]
Insert points and return indices.
- Parameters
data (ndarray, shape (n_samples, n_features)) – The data to be inserted.
- Returns
indices – The indices assigned to the newly inserted data points.
- Return type
ndarray, shape (n_samples,)
- itok(indices=None)[source]
Map indices to keys.
- Parameters
indices (ndarray or slice, optional) – Indices or slice for which to return keys. If None, all keys are returned.
- Returns
keys – The requested keys as numpy array.
- Return type
ndarray
- knn(data, k=1, sort=True, min_radius=0, max_radius=inf, reverse=False, extend_for_ties=False)[source]
Find the k nearest neighbors of points.
- Parameters
data (ndarray, shape (n_samples, n_features)) – Points for which to perform a knn query.
k (int) – Number of nearest neighbors to consider.
sort (bool) – Whether the returned points should be sorted by distance.
min_radius (double) – Minimum distance for returned neighbor points.
max_radius (double) – Maximum distance for returned neighbor points.
reverse (bool) – If reverse == True, return the k most distant points instead of the k nearest neighbors.
extend_for_ties (bool) – Whether in the case of ties more than k points should be returned.
- Returns
keys (ndarray, shape (n_total_neighbors,)) – Concatenation of keys of found neighbors.
distances (ndarray, shape (n_total_neighbors,)) – Concatenation of distances of neighbors to the respective query points.
lengths (ndarray, shape (n_samples,)) – The number of neighbors returned for each point, so that sum(length) == n_total_neighbors.
- ktoi(keys)[source]
Map keys to indices.
- Parameters
keys (ndarray) – Keys for which to return indices.
- Returns
indices – The requested indices as numpy array.
- Return type
ndarray
- neighbors(data, radius)[source]
Return all points within a given radius.
- Parameters
data (ndarray, shape (n_samples, n_features)) – Points for which the range query should be performed.
radius (double) – Radius for the search.
- Returns
keys (ndarray, shape (n_total_neighbors,)) – Concatenation of keys of neighbors within radius.
distances (ndarray, shape (n_total_neighbors,)) – Concatenation of distances of neighbors to the respective query points.
lengths (ndarray, shape (n_samples,)) – The number of neighbors returned for each point, so that sum(length) == n_total_neighbors.
- class dSalmon.trees.StatisticsTree(window, what=[], quantiles=[], float_type=<class 'numpy.float64'>)[source]
Indexing structure for computing per-dimension statistics in a sliding window.
This implementation relies on an order statistic tree provided by Boost for achieving O(log(window)) time complexity for quantile computation.
- Parameters
window (float) – Window length after which samples will be pruned.
what (list of strings, optional) – Which statistics to compute. Elements of what can be one of ‘sum’, ‘average’, ‘squares_sum’, ‘variance’, ‘min’, ‘max’ or ‘median’.
quantiles (list of floats, optional) – Quantile values to compute in addition to statistics in what. Elements should be floats in [0,1].
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
- fit_query(X, times=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
statistics (ndarray, shape (n_samples, n_statistics, n_features)) – The computed statistics. Statistics for row i are evaluated after adding row i to the sliding window. Here, n_statistics = len(what) + len(quantiles).
counts (ndarray, shape (n_samples)) – The lengths of the sliding window after processing each row of X.