dSalmon.outlier
Streaming outlier detection models.
Classes
|
Streaming Half-Space Trees [TTL11]. |
|
LODA [Pevny16]. |
Base class for outlier detectors. |
|
|
RS-Hash [SA16]. |
|
Streaming outlier detection based on Sparse Data Observers [HIZ20]. |
|
Distance based outlier detection by radius. |
|
Distance based outlier detection by k nearest neighbors. |
|
Local Outlier Factor [BKNS00] within a sliding window. |
|
Robust Random Cut Forest [GMRS16]. |
|
xStream [MLA18]. |
- class dSalmon.outlier.HSTrees(window, n_estimators, max_depth, size_limit=None, float_type=<class 'numpy.float64'>, seed=0, n_jobs=-1)[source]
Streaming Half-Space Trees [TTL11].
- Parameters
window (float) – Window length after which samples will be pruned.
n_estimators (int) – The number of trees in the ensemble.
max_depth (int) – The depth of each individual tree.
size_limit (int, optional) – The maximum size of nodes to consider for outlier scoring. If None, defaults to 0.1*window, as described in the corresponding paper.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
seed (int) – Random seed for tree construction.
n_jobs (int) – Number of threads to use for processing trees. Pass -1 to use as many jobs as there are CPU cores.
- fit(X, times=None)
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, times=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,)
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.
- class dSalmon.outlier.LODA(window, n_projections=None, n_bins=10, float_type=<class 'numpy.float64'>, seed=0, n_jobs=-1)[source]
LODA [Pevny16].
This detector performs outlier detection based on equi-depth histograms. If random projections are used, this corresponds to the LODA algorithm, otherwise behaviour corresponds to a sliding window adaptation of the HBOS [GD12] algorithm.
- Parameters
window (float) – Window length after which samples will be pruned.
n_projections (int, optional) – The number of random projections to use. If None, random projections are skipped.
n_bins (int) – The number of histogram bins.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
seed (int) – Seed for random projections.
n_jobs (int) – Number of threads to use for processing trees. Pass -1 to use as many jobs as there are CPU cores.
- fit(X, times=None)[source]
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, times=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,)
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- get_window()[source]
Return samples in the current window.
- Returns
data (ndarray, shape (n_samples, n_features)) – Samples in the current window. If n_projections is set, returns the projected data samples.
times (ndarray, shape (n_samples,)) – Expiry times of samples in the current window.
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.
- class dSalmon.outlier.OutlierDetector[source]
Base class for outlier detectors.
- fit(X, times=None)[source]
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- class dSalmon.outlier.RSHash(n_estimators, window, cms_w, cms_d, s_param=None, float_type=<class 'numpy.float64'>, seed=0, n_jobs=-1)[source]
RS-Hash [SA16].
This outlier detector assumes that features are normalized to a [0,1] range.
- Parameters
n_estimators (int) – Number of estimators in the ensemble.
window (float) – Window length after which samples will be pruned.
cms_w (int) – Number of hash functions per estimator for the count-min sketch.
cms_d (int) – Number of bins for the count-min sketch.
s_param (int, optional) – The s parameter of RS-Hash, which should be an estimate of the number of samples in a sliding window. If None, the value of window will be used for s_param, assuming that samples arrive with an inter-arrival time of 1.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
seed (int) – Random seed to use.
n_jobs (int) – Number of threads to use for processing trees. Pass -1 to use as many jobs as there are CPU cores.
- fit(X, times=None)
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, times=None)[source]
Process next chunk of data. Data in X is assumed to be normalized to [0,1].
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,)
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- get_window()[source]
Return samples in the current window.
- Returns
data (ndarray, shape (n_samples, n_features)) – Samples in the current window.
times (ndarray, shape (n_samples,)) – Expiry times of samples in the current window.
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.
- class dSalmon.outlier.SDOstream(k, T, qv=0.3, x=6, metric='euclidean', metric_params=None, float_type=<class 'numpy.float64'>, seed=0, return_sampling=False)[source]
Streaming outlier detection based on Sparse Data Observers [HIZ20].
- Parameters
k (int) – Number of observers to use.
T (int) – Characteristic time for the model. Increasing T makes the model adjust slower, decreasing T makes it adjust quicker.
qv (float, optional (default=0.3)) – Ratio of unused observers due to model cleaning.
x (int (default=6)) – Number of nearest observers to consider for outlier scoring and model cleaning.
metric (string) – Which distance metric to use. Currently supported metrics include ‘chebyshev’, ‘cityblock’, ‘euclidean’ and ‘minkowsi’.
metric_params (dict) – Parameters passed to the metric. Minkowsi distance requires setting an integer p parameter.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
seed (int (default=0)) – Random seed to use.
return_sampling (bool (default=False)) – Also return whether a data point was adopted as observer.
- fit(X, times=None)
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, times=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,)
- get_observers(time=None)[source]
Return observer data.
- Returns
data (ndarray, shape (n_observers, n_features)) – Sample used as observer.
observations (ndarray, shape (n_observers,)) – Exponential moving average of observations.
av_observations (ndarray, shape (n_observers,)) – Exponential moving average of observations normalized according to the theoretical maximum.
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.
- class dSalmon.outlier.SWDBOR(window, radius, metric='euclidean', metric_params=None, float_type=<class 'numpy.float64'>, min_node_size=5, max_node_size=100, split_sampling=20)[source]
Distance based outlier detection by radius.
When setting a threshold for the returned outlier scores to tranform outlier scores into binary labels, results coincide with ExactStorm [AF07], AbstractC [YRW09] or the COD family [KGP+11].
- Parameters
window (float) – Window length after which samples will be pruned.
radius (float) – Radius for classification as neighbor.
metric (string) – Which distance metric to use. Currently supported metrics include ‘chebyshev’, ‘cityblock’, ‘euclidean’ and ‘minkowsi’.
metric_params (dict) – Parameters passed to the metric. Minkowsi distance requires setting an integer p parameter.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
min_node_size (int, optional (default=5)) – Smallest possible size for M-Tree nodes. min_node_size is guaranteed to leave results unaffected.
max_node_size (int, optional (default=20)) – Largest possible size for M-Tree nodes. max_node_size is guaranteed to leave results unaffected.
split_sampling (int, optional (default=5)) – The number of key combinations to try when splitting M-Tree routing nodes. split_sampling is guaranteed to leave results unaffected.
- fit(X, times=None)
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, times=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,)
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- get_window()[source]
Return samples in the current window.
- Returns
data (ndarray, shape (n_samples, n_features)) – Samples in the current window.
times (ndarray, shape (n_samples,)) – Expiry times of samples in the current window.
neighbors (ndarray, shape (n_samples)) – Number of neighbors of samples in the current window.
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.
- class dSalmon.outlier.SWKNN(window, k, k_is_max=False, metric='euclidean', metric_params=None, float_type=<class 'numpy.float64'>, min_node_size=5, max_node_size=100, split_sampling=20)[source]
Distance based outlier detection by k nearest neighbors.
When setting a threshold for the returned outlier scores to tranform outlier scores into binary labels, results coincide with ExactStorm [AF07], AbstractC [YRW09] or the COD family [KGP+11].
- Parameters
window (float) – Window length after which samples will be pruned.
k (int) – Number of nearest neighbors to consider for outlier scoring.
k_is_max (bool (default=False)) – Whether scores should be returned for all neighbor values up to the provided k. Grid search for the optimal k can be performed by setting k_is_max=True.
metric (string) – Which distance metric to use. Currently supported metrics include ‘chebyshev’, ‘cityblock’, ‘euclidean’ and ‘minkowsi’.
metric_params (dict) – Parameters passed to the metric. Minkowsi distance requires setting an integer p parameter.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
min_node_size (int, optional (default=5)) – Smallest possible size for M-Tree nodes. min_node_size is guaranteed to leave results unaffected.
max_node_size (int, optional (default=20)) – Largest possible size for M-Tree nodes. max_node_size is guaranteed to leave results unaffected.
split_sampling (int, optional (default=5)) – The number of key combinations to try when splitting M-Tree routing nodes. split_sampling is guaranteed to leave results unaffected.
- fit(X, times=None)[source]
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, times=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,) or (n_samples,k)
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- get_window()[source]
Return samples in the current window.
- Returns
data (ndarray, shape (n_samples, n_features)) – Samples in the current window.
times (ndarray, shape (n_samples,)) – Expiry times of samples in the current window.
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.
- class dSalmon.outlier.SWLOF(window, k, simplified=False, k_is_max=False, metric='euclidean', metric_params=None, float_type=<class 'numpy.float64'>, min_node_size=5, max_node_size=100, split_sampling=20)[source]
Local Outlier Factor [BKNS00] within a sliding window.
- Parameters
window (float) – Window length after which samples will be pruned.
k (int) – Number of nearest neighbors to consider for outlier scoring.
simplified (bool (default=False)) – Whether to use simplified LOF.
k_is_max (bool (default=False)) – Whether scores should be returned for all neighbor values up to the provided k. Grid search for the optimal k can be performed by setting k_is_max=True.
metric (string) – Which distance metric to use. Currently supported metrics include ‘chebyshev’, ‘cityblock’, ‘euclidean’ and ‘minkowsi’.
metric_params (dict) – Parameters passed to the metric. Minkowsi distance requires setting an integer p parameter.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
min_node_size (int, optional (default=5)) – Smallest possible size for M-Tree nodes. min_node_size is guaranteed to leave results unaffected.
max_node_size (int, optional (default=20)) – Largest possible size for M-Tree nodes. max_node_size is guaranteed to leave results unaffected.
split_sampling (int, optional (default=5)) – The number of key combinations to try when splitting M-Tree routing nodes. split_sampling is guaranteed to leave results unaffected.
- fit(data, times=None)[source]
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, times=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,) or (n_samples,k)
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- get_window()[source]
Return samples in the current window.
- Returns
data (ndarray, shape (n_samples, n_features)) – Samples in the current window.
times (ndarray, shape (n_samples,)) – Expiry times of samples in the current window.
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.
- class dSalmon.outlier.SWRRCT(window, n_estimators=10, float_type=<class 'numpy.float64'>, seed=0, n_jobs=-1)[source]
Robust Random Cut Forest [GMRS16].
- Parameters
window (float) – Window length after which samples will be pruned.
n_estimators (int) – Number of trees in the ensemble.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
seed (int) – Random seed for tree construction.
n_jobs (int) – Number of threads to use for processing trees. Pass -1 to use as many jobs as there are CPU cores.
- fit(X, times=None)[source]
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, times=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,)
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- get_window()[source]
Return samples in the current window.
- Returns
data (ndarray, shape (n_samples, n_features)) – Samples in the current window.
times (ndarray, shape (n_samples,)) – Expiry times of samples in the current window.
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.
- class dSalmon.outlier.xStream(window, n_estimators, n_projections, depth, cms_w=5, cms_d=1000, float_type=<class 'numpy.float64'>, seed=0, n_jobs=-1)[source]
xStream [MLA18].
- Parameters
window (int) – Window length after which the current window will be switch to the reference window.
n_estimators (int) – The number of chains in the ensemble.
n_projections (int) – The number of StreamHash projections to use.
depth (int) – The length of each half-space chain.
cms_w (int) – Number of hash functions for the count-min sketches.
cms_d (int) – Number of bins for the count-min sketches.
float_type (np.float32 or np.float64) – The floating point type to use for internal processing.
seed (int) – Random seed for tree construction.
n_jobs (int) – Number of threads to use for processing trees. Pass -1 to use as many jobs as there are CPU cores.
- fit(X, times=None)
Process next chunk of data without returning outlier scores.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
times (ndarray, shape (n_samples,), optional) – Timestamps for input data. If None, timestamps are linearly increased for each sample.
- fit_predict(X, features=None)[source]
Process next chunk of data.
- Parameters
X (ndarray, shape (n_samples, n_features)) – The input data.
features (list, optional) – Feature names used for StreamHash. The repr() of list elements is used as basis for hashing, hence elements do not necessarily have to be strings. If None, range(n_features) is used as feature names.
- Returns
y – Outlier scores for provided input data.
- Return type
ndarray, shape (n_samples,)
- get_params(deep=True)
Return the used algorithm parameters as dictionary.
- Parameters
deep (bool, default=True) – Ignored. Only for compatibility with scikit-learn.
- Returns
params – Dictionary of parameters.
- Return type
dict
- set_initial_sample(data, features=None)[source]
Optionally set the initial sample used for estimating the range of projected features. If no initial sample is provided, ranges will be estimated from the first window data points. In this case, the first window data points are stored to construct the reference window as soon as range estimates are available.
- Parameters
data (ndarray, shape (n_samples, n_features)) – The initial sample.
features (list, optional) – Feature names used for StreamHash. The repr() of list elements is used as basis for hashing, hence elements do not necessarily have to be strings. If None, range(n_features) is used as feature names.
- set_params(**params)
Reset the model and set the parameters in accordance to the supplied dictionary.
- Parameters
**params (dict) – Dictionary of parameters.