Isolation Forest Algorithm.
Return the anomaly score of each sample using the IsolationForest algorithm
The IsolationForest 'isolates' observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.
This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.
Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.
Read more in the :ref:`User Guide <isolation_forest>`.
.. versionadded:: 0.18
Parameters ---------- n_estimators : int, optional (default=100) The number of base estimators in the ensemble.
max_samples : int or float, optional (default="auto") The number of samples to draw from X to train each base estimator.
- If int, then draw `max_samples` samples.
- If float, then draw `max_samples * X.shape
0
` samples. - If "auto", then `max_samples=min(256, n_samples)`.
If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).
contamination : 'auto' or float, optional (default='auto') The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples.
- If 'auto', the threshold is determined as in the original paper.
- If float, the contamination should be in the range
0, 0.5
.
.. versionchanged:: 0.22 The default value of ``contamination`` changed from 0.1 to ``'auto'``.
max_features : int or float, optional (default=1.0) The number of features to draw from X to train each base estimator.
- If int, then draw `max_features` features.
- If float, then draw `max_features * X.shape
1
` features.
bootstrap : bool, optional (default=False) If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
n_jobs : int or None, optional (default=None) The number of jobs to run in parallel for both :meth:`fit` and :meth:`predict`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.
behaviour : str, default='deprecated' This parameter has not effect, is deprecated, and will be removed.
.. versionadded:: 0.20 ``behaviour`` is added in 0.20 for back-compatibility purpose.
.. deprecated:: 0.20 ``behaviour='old'`` is deprecated in 0.20 and will not be possible in 0.22.
.. deprecated:: 0.22 ``behaviour`` parameter is deprecated in 0.22 and removed in 0.24.
random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`.
verbose : int, optional (default=0) Controls the verbosity of the tree building process.
warm_start : bool, optional (default=False) When set to ``True``, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See :term:`the Glossary <warm_start>`.
.. versionadded:: 0.21
Attributes ---------- estimators_ : list of DecisionTreeClassifier The collection of fitted sub-estimators.
estimators_samples_ : list of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator.
max_samples_ : integer The actual number of samples
offset_ : float Offset used to define the decision function from the raw scores. We have the relation: ``decision_function = score_samples - offset_``. ``offset_`` is defined as follows. When the contamination parameter is set to "auto", the offset is equal to -0.5 as the scores of inliers are close to 0 and the scores of outliers are close to -1. When a contamination parameter different than "auto" is provided, the offset is defined in such a way we obtain the expected number of outliers (samples with decision function < 0) in training.
Notes ----- The implementation is based on an ensemble of ExtraTreeRegressor. The maximum depth of each tree is set to ``ceil(log_2(n))`` where :math:`n` is the number of samples used to build the tree (see (Liu et al., 2008) for more details).
References ---------- .. 1
Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest." Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. .. 2
Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 6.1 (2012): 3.
See Also ---------- sklearn.covariance.EllipticEnvelope : An object for detecting outliers in a Gaussian distributed dataset. sklearn.svm.OneClassSVM : Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm. sklearn.neighbors.LocalOutlierFactor : Unsupervised Outlier Detection using Local Outlier Factor (LOF).
Examples -------- >>> from sklearn.ensemble import IsolationForest >>> X = [-1.1], [0.3], [0.5], [100]
>>> clf = IsolationForest(random_state=0).fit(X) >>> clf.predict([0.1], [0], [90]
) array( 1, 1, -1
)