dSalmon.trees

Indexing structures for fast stream data processing.

Classes

MTree([metric, metric_params, float_type, ...])

M-Tree efficient nearest-neighbor search in metric spaces.

StatisticsTree(window[, what, quantiles, ...])

Indexing structure for computing per-dimension statistics in a sliding window.

class dSalmon.trees.MTree(metric='euclidean', metric_params=None, float_type=<class 'numpy.float64'>, min_node_size=5, max_node_size=100, split_sampling=20, insert_jobs=2, query_jobs=-1, **kwargs)[source]

M-Tree efficient nearest-neighbor search in metric spaces.

A point within a tree can be accessed either via tree[k] using the point’s key k, or via tree.ix[i] using the point’s index i. Keys can be arbitrary integers and are returned by the insert, knn and neighbors functions. Indices are integers in the range 0…len(tree), sorted according to the points’ keys in ascending order.

Parameters
  • metric (string) – Which distance metric to use. Currently supported metrics include ‘chebyshev’, ‘cityblock’, ‘euclidean’ and ‘minkowsi’.

  • metric_params (dict) – Parameters passed to the metric. Minkowsi distance requires setting an integer p parameter.

  • float_type (np.float32 or np.float64) – Which floating point type to use for internal processing.

  • min_node_size (int) – The minimum number of children in tree nodes. Different parametrizations for min_node_size are guaranteed to return identical results.

  • max_node_size (int) – The maximum number of children in tree nodes. Different parametrizations for max_node_size are guaranteed to return identical results.

  • split_sampling (int) – The number of combinations to try when splitting a node. Different parametrizations for split_sampling are guaranteed to return identical results.

  • insert_jobs (int) – The number of additional threads to spawn for tree insertions. Since insertions can only partially be parallelized, using too many threads can harm performance.

  • query_jobs (int) – The number of threads to use for range- and knn-queries.

clear()[source]

Remove all points from the tree.

copy()[source]

Return a copy of the tree.

property dimension
get_points(keys)[source]

Retrieve points by key, skipping non-existent entries.

Parameters

keys (ndarray, shape (n_samples,)) – Keys for points to query as returned by insert().

Returns

  • points (ndarray, shape (n_samples, n_features)) – Coordinates of queried points or all-zero vectors if points were not found in the tree.

  • found (ndarray, shape (n_samples,)) – Whether the respective keys were found in the tree.

insert(data)[source]

Insert points and return indices.

Parameters

data (ndarray, shape (n_samples, n_features)) – The data to be inserted.

Returns

indices – The indices assigned to the newly inserted data points.

Return type

ndarray, shape (n_samples,)

itok(indices=None)[source]

Map indices to keys.

Parameters

indices (ndarray or slice, optional) – Indices or slice for which to return keys. If None, all keys are returned.

Returns

keys – The requested keys as numpy array.

Return type

ndarray

knn(data, k=1, sort=True, min_radius=0, max_radius=inf, reverse=False, extend_for_ties=False)[source]

Find the k nearest neighbors of points.

Parameters
  • data (ndarray, shape (n_samples, n_features)) – Points for which to perform a knn query.

  • k (int) – Number of nearest neighbors to consider.

  • sort (bool) – Whether the returned points should be sorted by distance.

  • min_radius (double) – Minimum distance for returned neighbor points.

  • max_radius (double) – Maximum distance for returned neighbor points.

  • reverse (bool) – If reverse == True, return the k most distant points instead of the k nearest neighbors.

  • extend_for_ties (bool) – Whether in the case of ties more than k points should be returned.

Returns

  • keys (ndarray, shape (n_total_neighbors,)) – Concatenation of keys of found neighbors.

  • distances (ndarray, shape (n_total_neighbors,)) – Concatenation of distances of neighbors to the respective query points.

  • lengths (ndarray, shape (n_samples,)) – The number of neighbors returned for each point, so that sum(length) == n_total_neighbors.

ktoi(keys)[source]

Map keys to indices.

Parameters

keys (ndarray) – Keys for which to return indices.

Returns

indices – The requested indices as numpy array.

Return type

ndarray

neighbors(data, radius)[source]

Return all points within a given radius.

Parameters
  • data (ndarray, shape (n_samples, n_features)) – Points for which the range query should be performed.

  • radius (double) – Radius for the search.

Returns

  • keys (ndarray, shape (n_total_neighbors,)) – Concatenation of keys of neighbors within radius.

  • distances (ndarray, shape (n_total_neighbors,)) – Concatenation of distances of neighbors to the respective query points.

  • lengths (ndarray, shape (n_samples,)) – The number of neighbors returned for each point, so that sum(length) == n_total_neighbors.

remove(keys)[source]

Remove points identified by keys, skipping non-existent entries.

Parameters

keys (ndarray, shape (n_samples,)) – Indices of the data points to be removed.

Returns

found – Boolean array indicating whether the removal was successful.

Return type

ndarray, shape (n_samples,)

class dSalmon.trees.StatisticsTree(window, what=[], quantiles=[], float_type=<class 'numpy.float64'>)[source]

Indexing structure for computing per-dimension statistics in a sliding window.

This implementation relies on an order statistic tree provided by Boost for achieving O(log(window)) time complexity for quantile computation.

Parameters
  • window (float) – Window length after which samples will be pruned.

  • what (list of strings, optional) – Which statistics to compute. Elements of what can be one of ‘sum’, ‘average’, ‘squares_sum’, ‘variance’, ‘min’, ‘max’ or ‘median’.

  • quantiles (list of floats, optional) – Quantile values to compute in addition to statistics in what. Elements should be floats in [0,1].

  • float_type (np.float32 or np.float64) – The floating point type to use for internal processing.

fit_query(X, times=None)[source]

Process next chunk of data.

Parameters
  • X (ndarray, shape (n_samples, n_features)) – The input data.

  • times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.

Returns

  • statistics (ndarray, shape (n_samples, n_statistics, n_features)) – The computed statistics. Statistics for row i are evaluated after adding row i to the sliding window. Here, n_statistics = len(what) + len(quantiles).

  • counts (ndarray, shape (n_samples)) – The lengths of the sliding window after processing each row of X.