scedar.knn¶

scedar.knn.detection¶

class scedar.knn.detection.RareSampleDetection(sdm)[source]¶

Bases: object

K nearest neighbor detection of rare samples

Perform the rare sample detection procedure in parallel, with each combination of parameters as a process. Because this procedure runs iteratively, parallelizing each individual parameter combination run is not implemented.

Stores the results for further lookup.

Parameters:	sdm (SampleDistanceMatrix or its subclass) –

_sdm¶

Type:	SampleDistanceMatrix

_res_lut¶

lookup table of KNN rare sample detection results

Type:	dict

detect_rare_samples(k, d_cutoff, n_iter, nprocs=1, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None)[source]¶

KNN rare sample detection with multiple parameter combinations

Assuming that there are at least k samples look similar in this dataset, the samples with less than k similar neighbors may be rare. The rare samples can either be really distinct from the general populaton or caused by technical errors.

This procedure iteratively detects samples according to their k-th nearest neighbors. The samples most distinct from its k-th nearest neighbors are detected first. Then, the left samples are detected by less stringent distance cutoff. The distance cutoff decreases linearly from maximum distance to d_cutoff with n_iter iterations.

Parameters:	k (int list or scalar) – K nearest neighbors to detect rare samples. d_cutoff (float list or scalar) – Samples with >= d_cutoff distances are distinct from each other. Minimum (>=) distance to be called as rare. n_iter (int list or scalar) – N progressive iNN detections on the dataset. metric ({'cosine', 'euclidean', None}) – If none, self._sdm._metric is used. use_pca (bool) – Use PCA for nearest neighbors or not. use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor. index_params (dict) – Parameters used by HNSW in indexing. efConstruction: int Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000. M: int Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100. delaunay_type: {0, 1, 2, 3} Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good. post: {0, 1, 2} Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing. indexThreadQty: int Default self._nprocs. The number of threads used. query_params (dict) – Parameters used by HNSW in querying. efSearch: int Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000. nprocs (int) – N processes to run all parameter tuples.
Returns:	Indices of non-rare samples of each corresponding parameter tuple.
Return type:	res_list

Notes

If parameters are provided as lists of equal length n, the n corresponding parameter tuples will be executed parallely.

Example:

k = [10, 15, 20]

d_cutoff = [1, 2, 3]

n_iter = [10, 20, 30]

(k, d_cutoff, n_iter) tuples (10, 1, 10), (15, 2, 20), (20, 3, 30) will be tried parallely with nprocs.

scedar.knn.imputation¶

class scedar.knn.imputation.FeatureImputation(sdm)[source]¶

Bases: object

Impute dropped out features using K nearest neighbors approach

If the value of a feature is below min_present_val in a sample, and all its KNNs have above min_present_val, replace the value with the summary statistic (default is median) of KNN above threshold values.

_sdm¶

Type:	SampleDistanceMatrix

_res_lut¶

lookup table of the results. {(k, n_do, min_present_val, n_iter): (pu_sdm, pu_idc_arr, stats), …}

Type:	dict

impute_features(k, n_do, min_present_val, n_iter, nprocs=1, statistic_fun=<function median>, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]¶

Runs KNN imputation on multiple parameter sets parallely.

Each parameter set will be executed in one process.

Parameters:

k (int) – Look at k nearest neighbors to decide whether to impute or not.
n_do (int) – Minimum (>=) number of above min_present_val neighbors among KNN to be callsed as dropout, so that imputation will be performed.
min_present_val (float) – Minimum (>=) values of a feature to be called as present.
n_iter (int) – The number of iterations to run.
statistic_fun (callable) – The summary statistic used to correct gene dropouts. Default is median.

Returns:

resl – list of results, [(pu_sdm, pu_idc_arr, stats), …].

pu_sdm: SampleDistanceMatrix: SampleDistanceMatrix after imputation
pu_idc_arr: array of shape (n_samples, n_features): Indicator matrix of the ith iteration an entry is being imputed.
stats: str: Stats of the run.

Return type:

list

Notes

If parameters are provided as lists of equal length n, the n corresponding parameter tuples will be executed parallely.

Example

If k = [10, 15], n_do = [1, 2], min_present_val = [5, 6], and n_iter = [10, 20], (k, n_do, min_present_val, n_iter) tuples (10, 1, 5, 10) and (15, 2, 6, 20) will be tried parallely with nprocs.

n_do, min_present_val, n_iter