scedar.cluster¶

scedar.cluster.mirac¶

class scedar.cluster.mirac.MIRAC(x, d=None, metric='cosine', sids=None, fids=None, hac_tree=None, nprocs=1, cl_mdl_scale_factor=1, min_cl_n=25, encode_type='auto', mdl_method=None, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, linkage='complete', optimal_ordering=False, dim_reduct_method=None, verbose=False)[source]¶

Bases: object

MIRAC: MDL iteratively regularized agglomerative clustering.

Parameters:

x (float array) – Data matrix.
d (float array) – Distance matrix.
metric (str) – Type of distance metric.
sids (sid list) – List of sample ids.
fids (fid list) – List of feature ids.
hac_tree (HCTree) – Hierarchical tree built by agglomerative clustering to divide in MIRAC. If provided, distance matrix will not be used for building another tree.
nprocs (int) – Number of processes to run MIRAC parallely.
cl_mdl_scale_factor (float) – Scale factor of cluster overhead mdl.
min_cl_n (int) – Minimum # samples in a cluster.
encode_type ({"auto", "data", or "distance"}) – Type of values to encode. If “auto”, encode data when n_features <= 100.
mdl_method (mdl.Mdl) – If None, use ZeroIGKdeMdl for encoded values with >= 50% zeros, and use GKdeMdl otherwise.
linkage (str) – Linkage type for generating the hierarchy.
optimal_ordering (bool) – To require hierarchical clustering tree with optimal ordering. Default value is False.
dim_reduct_method ({"PCA", "t-SNE", "UMAP", None}) – If None, no dimensionality reduction before clustering.
verbose (bool) – Print stats for each iteration.

_sdm¶

Data and distance matrices.

Type:	SampleDistanceMatrix

_min_cl_n¶

Stored parameter.

Type:	int

_encode_type¶

Encode type. If “auto” provided, this attribute will store the determined encode type.

Type:	str

_mdl_method¶

Mdl method. If None is provided, this attribute will store the determined mdl method.

Type:	mdl.Mdl

labs¶

Labels of clustered samples. 1-to-1 matching to from first to last.

Type:	label list

_hac_tree¶

Root node of the hierarchical agglomerative clustering tree.

Type:	eda.hct.HClustTree

_run_log¶

String containing the log of the MIRAC run.

Type:	str

TODO¶

* Dendrogram representation of the splitting process.

* Take HCTree as parameter. Computing it is non-trivial when n is large.

* Simplify splitting criteria.

dmat_heatmap(selected_labels=None, col_labels=None, transform=None, title=None, xlab=None, ylab=None, figsize=(10, 10), **kwargs)[source]¶

labs

tune_parameters(cl_mdl_scale_factor=1, min_cl_n=25, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, verbose=False)[source]¶

scedar.cluster.community¶

class scedar.cluster.community.Community(x, d=None, graph=None, metric='cosine', sids=None, fids=None, use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=1, random_state=None, n_iter=2, nprocs=1, verbose=False)[source]¶

Bases: object

Community clustering

Parameters:

x (float array) – Data matrix.
d (float array) – Distance matrix.
graph (igraph.Graph) – Need to have a weight attribute as affinity. If this argument is not None, the graph will directly be used for community clustering.
metric ({'cosine', 'euclidean'}) – Metric used for nearest neighbor computation.
sids (sid list) – List of sample ids.
fids (fid list) – List of feature ids.
use_pdist (boolean) – To use the pairwise distance matrix or not. The pairwise distance matrix may be too large to save for datasets with a large number of cells.
k (int) – The number of nearest neighbors.
use_pca (bool) – Use PCA for nearest neighbors or not.
use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor.
index_params (dict) –
Parameters used by HNSW in indexing.

efConstruction : int

Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000.

M : int

Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100.

delaunay_type : {0, 1, 2, 3}

Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good.

post : {0, 1, 2}

Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing.

indexThreadQty : int

Default self._nprocs. The number of threads used.
query_params (dict) –
Parameters used by HNSW in querying.

efSearch : int

Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
aff_scale (float > 0) – Scaling factor used for converting distance to affinity. Affinity = (max(distance) - distance) * aff_scale.
partition_method (str) –
Following methods are implemented in leidenalg package:
- RBConfigurationVertexPartition: only well-defined for positive edge weights.
- RBERVertexPartition: well-defined only for positive edge weights.
- CPMVertexPartition: well-defined for both positive and negative edge weights.
- SignificanceVertexPartition: well-defined only for unweighted graphs.
- SurpriseVertexPartition: well-defined only for positive edge weights.
resolution (float > 0) – Resolution used for community clustering. Higer value produces more clusters.
random_state (int) – Random number generator seed used for community clustering.
n_iter (int) – Number of iterations used for community clustering.
nprocs (int > 0) – The number of processes/cores used for community clustering.
verbose (bool) – Print progress or not.

labs¶

Labels of clustered samples. 1-to-1 matching to from first to last.

Type:	label list

_sdm¶

Data and distance matrices.

Type:	SampleDistanceMatrix

_graph¶

Graph used for clustering.

Type:	igraph.Graph

_la_res¶

Partition results computed by leidenalg.

Type:	leidenalg.VertexPartition

_k¶

_use_pca¶

_use_hnsw¶

_index_params¶

_query_params¶

_aff_scale¶

labs

scedar.cluster.community_mirac¶

class scedar.cluster.community_mirac.CommunityMIRAC(x, d=None, sids=None, fids=None, nprocs=1, verbose=False)[source]¶

Bases: object

CommunityMIRAC: Community + MIRAC clustering

Run community clustering with high resolution to get a large number of clusters. Then, run MIRAC on the community clusters.

Parameters:	x (float array) – Data matrix. d (float array) – Distance matrix. sids (sid list) – List of sample ids. fids (fid list) – List of feature ids. nprocs (int > 0) – The number of processes/cores used for community clustering. verbose (bool) – Print progress or not.

_x¶

Data matrix.

Type:	float array

_d¶

Distance matrix.

Type:	float array

_sids¶

List of sample ids.

Type:	sid list

_fids¶

List of feature ids.

Type:	fid list

_nprocs¶

The number of processes/cores used for community clustering.

Type:	int > 0

_verbose¶

Print progress or not.

Type:	bool

_cm_res¶

Community clustering result.

Type:	cluster.Community

_cm_clp_x¶

Data array with samples collapsed by community clustering labels. For each cluster, the mean of all samples is a row in this array.

Type:	array

_mirac_res¶

MIRAC clustering results on _cm_clp_x

Type:	cluster.MIRAC

labs¶

list of labels

Type:	list

static collapse_clusters(data_x, cluster_labs)[source]¶

labs

run(graph=None, metric='cosine', use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=100, random_state=None, n_iter=2, hac_tree=None, cl_mdl_scale_factor=1, min_cl_n=25, encode_type='auto', mdl_method=None, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, linkage='complete', optimal_ordering=False, nprocs=None)[source]¶

run_community(graph=None, metric='cosine', use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=100, random_state=None, n_iter=2, nprocs=None)[source]¶

run_mirac(metric='cosine', hac_tree=None, cl_mdl_scale_factor=1, min_cl_n=25, encode_type='auto', mdl_method=None, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, linkage='complete', optimal_ordering=False, dim_reduct_method=None, nprocs=None)[source]¶

tune_mirac(cl_mdl_scale_factor=1, min_cl_n=25, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, verbose=False)[source]¶