scedar.knn¶
scedar.knn.detection¶
-
class
scedar.knn.detection.
RareSampleDetection
(sdm)[source]¶ Bases:
object
K nearest neighbor detection of rare samples
Perform the rare sample detection procedure in parallel, with each combination of parameters as a process. Because this procedure runs iteratively, parallelizing each individual parameter combination run is not implemented.
Stores the results for further lookup.
Parameters: sdm (SampleDistanceMatrix or its subclass) – -
_sdm
¶ Type: SampleDistanceMatrix
-
_res_lut
¶ lookup table of KNN rare sample detection results
Type: dict
-
detect_rare_samples
(k, d_cutoff, n_iter, nprocs=1, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None)[source]¶ KNN rare sample detection with multiple parameter combinations
Assuming that there are at least k samples look similar in this dataset, the samples with less than k similar neighbors may be rare. The rare samples can either be really distinct from the general populaton or caused by technical errors.
This procedure iteratively detects samples according to their k-th nearest neighbors. The samples most distinct from its k-th nearest neighbors are detected first. Then, the left samples are detected by less stringent distance cutoff. The distance cutoff decreases linearly from maximum distance to d_cutoff with n_iter iterations.
Parameters: - k (int list or scalar) – K nearest neighbors to detect rare samples.
- d_cutoff (float list or scalar) – Samples with >= d_cutoff distances are distinct from each other. Minimum (>=) distance to be called as rare.
- n_iter (int list or scalar) – N progressive iNN detections on the dataset.
- metric ({'cosine', 'euclidean', None}) – If none, self._sdm._metric is used.
- use_pca (bool) – Use PCA for nearest neighbors or not.
- use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor.
- index_params (dict) –
Parameters used by HNSW in indexing.
- efConstruction: int
- Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000.
- M: int
- Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100.
- delaunay_type: {0, 1, 2, 3}
- Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good.
- post: {0, 1, 2}
- Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing.
- indexThreadQty: int
- Default self._nprocs. The number of threads used.
- query_params (dict) –
Parameters used by HNSW in querying.
- efSearch: int
- Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
- nprocs (int) – N processes to run all parameter tuples.
Returns: Indices of non-rare samples of each corresponding parameter tuple.
Return type: res_list
Notes
If parameters are provided as lists of equal length n, the n corresponding parameter tuples will be executed parallely.
Example:
k = [10, 15, 20]
d_cutoff = [1, 2, 3]
n_iter = [10, 20, 30]
(k, d_cutoff, n_iter) tuples (10, 1, 10), (15, 2, 20), (20, 3, 30) will be tried parallely with nprocs.
-
scedar.knn.imputation¶
-
class
scedar.knn.imputation.
FeatureImputation
(sdm)[source]¶ Bases:
object
Impute dropped out features using K nearest neighbors approach
If the value of a feature is below min_present_val in a sample, and all its KNNs have above min_present_val, replace the value with the summary statistic (default is median) of KNN above threshold values.
-
_sdm
¶ Type: SampleDistanceMatrix
-
_res_lut
¶ lookup table of the results. {(k, n_do, min_present_val, n_iter): (pu_sdm, pu_idc_arr, stats), …}
Type: dict
-
impute_features
(k, n_do, min_present_val, n_iter, nprocs=1, statistic_fun=<function median>, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]¶ Runs KNN imputation on multiple parameter sets parallely.
Each parameter set will be executed in one process.
Parameters: - k (int) – Look at k nearest neighbors to decide whether to impute or not.
- n_do (int) – Minimum (>=) number of above min_present_val neighbors among KNN to be callsed as dropout, so that imputation will be performed.
- min_present_val (float) – Minimum (>=) values of a feature to be called as present.
- n_iter (int) – The number of iterations to run.
- statistic_fun (callable) – The summary statistic used to correct gene dropouts. Default is median.
Returns: resl – list of results, [(pu_sdm, pu_idc_arr, stats), …].
- pu_sdm: SampleDistanceMatrix
SampleDistanceMatrix after imputation
- pu_idc_arr: array of shape (n_samples, n_features)
Indicator matrix of the ith iteration an entry is being imputed.
- stats: str
Stats of the run.
Return type: list
Notes
If parameters are provided as lists of equal length n, the n corresponding parameter tuples will be executed parallely.
Example
If k = [10, 15], n_do = [1, 2], min_present_val = [5, 6], and n_iter = [10, 20], (k, n_do, min_present_val, n_iter) tuples (10, 1, 5, 10) and (15, 2, 6, 20) will be tried parallely with nprocs.
n_do, min_present_val, n_iter
-