scedar.knn

scedar.knn.detection

class scedar.knn.detection.RareSampleDetection(sdm)[source]

Bases: object

K nearest neighbor detection of rare samples

Perform the rare sample detection procedure in parallel, with each combination of parameters as a process. Because this procedure runs iteratively, parallelizing each individual parameter combination run is not implemented.

Stores the results for further lookup.

Parameters:sdm (SampleDistanceMatrix or its subclass) –
_sdm
Type:SampleDistanceMatrix
_res_lut

lookup table of KNN rare sample detection results

Type:dict
detect_rare_samples(k, d_cutoff, n_iter, nprocs=1, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None)[source]

KNN rare sample detection with multiple parameter combinations

Assuming that there are at least k samples look similar in this dataset, the samples with less than k similar neighbors may be rare. The rare samples can either be really distinct from the general populaton or caused by technical errors.

This procedure iteratively detects samples according to their k-th nearest neighbors. The samples most distinct from its k-th nearest neighbors are detected first. Then, the left samples are detected by less stringent distance cutoff. The distance cutoff decreases linearly from maximum distance to d_cutoff with n_iter iterations.

Parameters:
  • k (int list or scalar) – K nearest neighbors to detect rare samples.
  • d_cutoff (float list or scalar) – Samples with >= d_cutoff distances are distinct from each other. Minimum (>=) distance to be called as rare.
  • n_iter (int list or scalar) – N progressive iNN detections on the dataset.
  • metric ({'cosine', 'euclidean', None}) – If none, self._sdm._metric is used.
  • use_pca (bool) – Use PCA for nearest neighbors or not.
  • use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor.
  • index_params (dict) –

    Parameters used by HNSW in indexing.

    efConstruction: int
    Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000.
    M: int
    Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100.
    delaunay_type: {0, 1, 2, 3}
    Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good.
    post: {0, 1, 2}
    Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing.
    indexThreadQty: int
    Default self._nprocs. The number of threads used.
  • query_params (dict) –

    Parameters used by HNSW in querying.

    efSearch: int
    Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
  • nprocs (int) – N processes to run all parameter tuples.
Returns:

Indices of non-rare samples of each corresponding parameter tuple.

Return type:

res_list

Notes

If parameters are provided as lists of equal length n, the n corresponding parameter tuples will be executed parallely.

Example:

k = [10, 15, 20]

d_cutoff = [1, 2, 3]

n_iter = [10, 20, 30]

(k, d_cutoff, n_iter) tuples (10, 1, 10), (15, 2, 20), (20, 3, 30) will be tried parallely with nprocs.

scedar.knn.imputation

class scedar.knn.imputation.FeatureImputation(sdm)[source]

Bases: object

Impute dropped out features using K nearest neighbors approach

If the value of a feature is below min_present_val in a sample, and all its KNNs have above min_present_val, replace the value with the summary statistic (default is median) of KNN above threshold values.

_sdm
Type:SampleDistanceMatrix
_res_lut

lookup table of the results. {(k, n_do, min_present_val, n_iter): (pu_sdm, pu_idc_arr, stats), …}

Type:dict
impute_features(k, n_do, min_present_val, n_iter, nprocs=1, statistic_fun=<function median>, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]

Runs KNN imputation on multiple parameter sets parallely.

Each parameter set will be executed in one process.

Parameters:
  • k (int) – Look at k nearest neighbors to decide whether to impute or not.
  • n_do (int) – Minimum (>=) number of above min_present_val neighbors among KNN to be callsed as dropout, so that imputation will be performed.
  • min_present_val (float) – Minimum (>=) values of a feature to be called as present.
  • n_iter (int) – The number of iterations to run.
  • statistic_fun (callable) – The summary statistic used to correct gene dropouts. Default is median.
Returns:

resl – list of results, [(pu_sdm, pu_idc_arr, stats), …].

pu_sdm: SampleDistanceMatrix

SampleDistanceMatrix after imputation

pu_idc_arr: array of shape (n_samples, n_features)

Indicator matrix of the ith iteration an entry is being imputed.

stats: str

Stats of the run.

Return type:

list

Notes

If parameters are provided as lists of equal length n, the n corresponding parameter tuples will be executed parallely.

Example

If k = [10, 15], n_do = [1, 2], min_present_val = [5, 6], and n_iter = [10, 20], (k, n_do, min_present_val, n_iter) tuples (10, 1, 5, 10) and (15, 2, 6, 20) will be tried parallely with nprocs.

n_do, min_present_val, n_iter