scedar.cluster¶
scedar.cluster.mirac¶
-
class
scedar.cluster.mirac.
MIRAC
(x, d=None, metric='cosine', sids=None, fids=None, hac_tree=None, nprocs=1, cl_mdl_scale_factor=1, min_cl_n=25, encode_type='auto', mdl_method=None, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, linkage='complete', optimal_ordering=False, dim_reduct_method=None, verbose=False)[source]¶ Bases:
object
MIRAC: MDL iteratively regularized agglomerative clustering.
Parameters: - x (float array) – Data matrix.
- d (float array) – Distance matrix.
- metric (str) – Type of distance metric.
- sids (sid list) – List of sample ids.
- fids (fid list) – List of feature ids.
- hac_tree (HCTree) – Hierarchical tree built by agglomerative clustering to divide in MIRAC. If provided, distance matrix will not be used for building another tree.
- nprocs (int) – Number of processes to run MIRAC parallely.
- cl_mdl_scale_factor (float) – Scale factor of cluster overhead mdl.
- min_cl_n (int) – Minimum # samples in a cluster.
- encode_type ({"auto", "data", or "distance"}) – Type of values to encode. If “auto”, encode data when n_features <= 100.
- mdl_method (mdl.Mdl) – If None, use ZeroIGKdeMdl for encoded values with >= 50% zeros, and use GKdeMdl otherwise.
- linkage (str) – Linkage type for generating the hierarchy.
- optimal_ordering (bool) – To require hierarchical clustering tree with optimal ordering. Default value is False.
- dim_reduct_method ({"PCA", "t-SNE", "UMAP", None}) – If None, no dimensionality reduction before clustering.
- verbose (bool) – Print stats for each iteration.
-
_sdm
¶ Data and distance matrices.
Type: SampleDistanceMatrix
-
_min_cl_n
¶ Stored parameter.
Type: int
-
_encode_type
¶ Encode type. If “auto” provided, this attribute will store the determined encode type.
Type: str
-
_mdl_method
¶ Mdl method. If None is provided, this attribute will store the determined mdl method.
Type: mdl.Mdl
-
labs
¶ Labels of clustered samples. 1-to-1 matching to from first to last.
Type: label list
-
_hac_tree
¶ Root node of the hierarchical agglomerative clustering tree.
Type: eda.hct.HClustTree
-
_run_log
¶ String containing the log of the MIRAC run.
Type: str
-
TODO
¶
-
* Dendrogram representation of the splitting process.
-
* Take HCTree as parameter. Computing it is non-trivial when n is large.
-
* Simplify splitting criteria.
-
dmat_heatmap
(selected_labels=None, col_labels=None, transform=None, title=None, xlab=None, ylab=None, figsize=(10, 10), **kwargs)[source]¶
-
labs
scedar.cluster.community¶
-
class
scedar.cluster.community.
Community
(x, d=None, graph=None, metric='cosine', sids=None, fids=None, use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=1, random_state=None, n_iter=2, nprocs=1, verbose=False)[source]¶ Bases:
object
Community clustering
Parameters: - x (float array) – Data matrix.
- d (float array) – Distance matrix.
- graph (igraph.Graph) – Need to have a weight attribute as affinity. If this argument is not None, the graph will directly be used for community clustering.
- metric ({'cosine', 'euclidean'}) – Metric used for nearest neighbor computation.
- sids (sid list) – List of sample ids.
- fids (fid list) – List of feature ids.
- use_pdist (boolean) – To use the pairwise distance matrix or not. The pairwise distance matrix may be too large to save for datasets with a large number of cells.
- k (int) – The number of nearest neighbors.
- use_pca (bool) – Use PCA for nearest neighbors or not.
- use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor.
- index_params (dict) –
Parameters used by HNSW in indexing.
- efConstruction : int
- Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000.
- M : int
- Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100.
- delaunay_type : {0, 1, 2, 3}
- Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good.
- post : {0, 1, 2}
- Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing.
- indexThreadQty : int
- Default self._nprocs. The number of threads used.
- query_params (dict) –
Parameters used by HNSW in querying.
- efSearch : int
- Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
- aff_scale (float > 0) – Scaling factor used for converting distance to affinity. Affinity = (max(distance) - distance) * aff_scale.
- partition_method (str) –
Following methods are implemented in leidenalg package:
- RBConfigurationVertexPartition: only well-defined for positive edge weights.
- RBERVertexPartition: well-defined only for positive edge weights.
- CPMVertexPartition: well-defined for both positive and negative edge weights.
- SignificanceVertexPartition: well-defined only for unweighted graphs.
- SurpriseVertexPartition: well-defined only for positive edge weights.
- resolution (float > 0) – Resolution used for community clustering. Higer value produces more clusters.
- random_state (int) – Random number generator seed used for community clustering.
- n_iter (int) – Number of iterations used for community clustering.
- nprocs (int > 0) – The number of processes/cores used for community clustering.
- verbose (bool) – Print progress or not.
-
labs
¶ Labels of clustered samples. 1-to-1 matching to from first to last.
Type: label list
-
_sdm
¶ Data and distance matrices.
Type: SampleDistanceMatrix
-
_graph
¶ Graph used for clustering.
Type: igraph.Graph
-
_la_res
¶ Partition results computed by leidenalg.
Type: leidenalg.VertexPartition
-
_k
¶
-
_use_pca
¶
-
_use_hnsw
¶
-
_index_params
¶
-
_query_params
¶
-
_aff_scale
¶
-
labs
scedar.cluster.community_mirac¶
-
class
scedar.cluster.community_mirac.
CommunityMIRAC
(x, d=None, sids=None, fids=None, nprocs=1, verbose=False)[source]¶ Bases:
object
CommunityMIRAC: Community + MIRAC clustering
Run community clustering with high resolution to get a large number of clusters. Then, run MIRAC on the community clusters.
Parameters: - x (float array) – Data matrix.
- d (float array) – Distance matrix.
- sids (sid list) – List of sample ids.
- fids (fid list) – List of feature ids.
- nprocs (int > 0) – The number of processes/cores used for community clustering.
- verbose (bool) – Print progress or not.
-
_x
¶ Data matrix.
Type: float array
-
_d
¶ Distance matrix.
Type: float array
-
_sids
¶ List of sample ids.
Type: sid list
-
_fids
¶ List of feature ids.
Type: fid list
-
_nprocs
¶ The number of processes/cores used for community clustering.
Type: int > 0
-
_verbose
¶ Print progress or not.
Type: bool
-
_cm_res
¶ Community clustering result.
Type: cluster.Community
-
_cm_clp_x
¶ Data array with samples collapsed by community clustering labels. For each cluster, the mean of all samples is a row in this array.
Type: array
-
_mirac_res
¶ MIRAC clustering results on _cm_clp_x
Type: cluster.MIRAC
-
labs
¶ list of labels
Type: list
-
labs
-
run
(graph=None, metric='cosine', use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=100, random_state=None, n_iter=2, hac_tree=None, cl_mdl_scale_factor=1, min_cl_n=25, encode_type='auto', mdl_method=None, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, linkage='complete', optimal_ordering=False, nprocs=None)[source]¶
-
run_community
(graph=None, metric='cosine', use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=100, random_state=None, n_iter=2, nprocs=None)[source]¶