scedar.cluster

scedar.cluster.mirac

class scedar.cluster.mirac.MIRAC(x, d=None, metric='cosine', sids=None, fids=None, hac_tree=None, nprocs=1, cl_mdl_scale_factor=1, min_cl_n=25, encode_type='auto', mdl_method=None, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, linkage='complete', optimal_ordering=False, dim_reduct_method=None, verbose=False)[source]

Bases: object

MIRAC: MDL iteratively regularized agglomerative clustering.

Parameters:
  • x (float array) – Data matrix.
  • d (float array) – Distance matrix.
  • metric (str) – Type of distance metric.
  • sids (sid list) – List of sample ids.
  • fids (fid list) – List of feature ids.
  • hac_tree (HCTree) – Hierarchical tree built by agglomerative clustering to divide in MIRAC. If provided, distance matrix will not be used for building another tree.
  • nprocs (int) – Number of processes to run MIRAC parallely.
  • cl_mdl_scale_factor (float) – Scale factor of cluster overhead mdl.
  • min_cl_n (int) – Minimum # samples in a cluster.
  • encode_type ({"auto", "data", or "distance"}) – Type of values to encode. If “auto”, encode data when n_features <= 100.
  • mdl_method (mdl.Mdl) – If None, use ZeroIGKdeMdl for encoded values with >= 50% zeros, and use GKdeMdl otherwise.
  • linkage (str) – Linkage type for generating the hierarchy.
  • optimal_ordering (bool) – To require hierarchical clustering tree with optimal ordering. Default value is False.
  • dim_reduct_method ({"PCA", "t-SNE", "UMAP", None}) – If None, no dimensionality reduction before clustering.
  • verbose (bool) – Print stats for each iteration.
_sdm

Data and distance matrices.

Type:SampleDistanceMatrix
_min_cl_n

Stored parameter.

Type:int
_encode_type

Encode type. If “auto” provided, this attribute will store the determined encode type.

Type:str
_mdl_method

Mdl method. If None is provided, this attribute will store the determined mdl method.

Type:mdl.Mdl
labs

Labels of clustered samples. 1-to-1 matching to from first to last.

Type:label list
_hac_tree

Root node of the hierarchical agglomerative clustering tree.

Type:eda.hct.HClustTree
_run_log

String containing the log of the MIRAC run.

Type:str
TODO
* Dendrogram representation of the splitting process.
* Take HCTree as parameter. Computing it is non-trivial when n is large.
* Simplify splitting criteria.
dmat_heatmap(selected_labels=None, col_labels=None, transform=None, title=None, xlab=None, ylab=None, figsize=(10, 10), **kwargs)[source]
labs
tune_parameters(cl_mdl_scale_factor=1, min_cl_n=25, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, verbose=False)[source]

scedar.cluster.community

class scedar.cluster.community.Community(x, d=None, graph=None, metric='cosine', sids=None, fids=None, use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=1, random_state=None, n_iter=2, nprocs=1, verbose=False)[source]

Bases: object

Community clustering

Parameters:
  • x (float array) – Data matrix.
  • d (float array) – Distance matrix.
  • graph (igraph.Graph) – Need to have a weight attribute as affinity. If this argument is not None, the graph will directly be used for community clustering.
  • metric ({'cosine', 'euclidean'}) – Metric used for nearest neighbor computation.
  • sids (sid list) – List of sample ids.
  • fids (fid list) – List of feature ids.
  • use_pdist (boolean) – To use the pairwise distance matrix or not. The pairwise distance matrix may be too large to save for datasets with a large number of cells.
  • k (int) – The number of nearest neighbors.
  • use_pca (bool) – Use PCA for nearest neighbors or not.
  • use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor.
  • index_params (dict) –

    Parameters used by HNSW in indexing.

    efConstruction : int
    Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000.
    M : int
    Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100.
    delaunay_type : {0, 1, 2, 3}
    Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good.
    post : {0, 1, 2}
    Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing.
    indexThreadQty : int
    Default self._nprocs. The number of threads used.
  • query_params (dict) –

    Parameters used by HNSW in querying.

    efSearch : int
    Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
  • aff_scale (float > 0) – Scaling factor used for converting distance to affinity. Affinity = (max(distance) - distance) * aff_scale.
  • partition_method (str) –

    Following methods are implemented in leidenalg package:

    • RBConfigurationVertexPartition: only well-defined for positive edge weights.
    • RBERVertexPartition: well-defined only for positive edge weights.
    • CPMVertexPartition: well-defined for both positive and negative edge weights.
    • SignificanceVertexPartition: well-defined only for unweighted graphs.
    • SurpriseVertexPartition: well-defined only for positive edge weights.
  • resolution (float > 0) – Resolution used for community clustering. Higer value produces more clusters.
  • random_state (int) – Random number generator seed used for community clustering.
  • n_iter (int) – Number of iterations used for community clustering.
  • nprocs (int > 0) – The number of processes/cores used for community clustering.
  • verbose (bool) – Print progress or not.
labs

Labels of clustered samples. 1-to-1 matching to from first to last.

Type:label list
_sdm

Data and distance matrices.

Type:SampleDistanceMatrix
_graph

Graph used for clustering.

Type:igraph.Graph
_la_res

Partition results computed by leidenalg.

Type:leidenalg.VertexPartition
_k
_use_pca
_use_hnsw
_index_params
_query_params
_aff_scale
labs

scedar.cluster.community_mirac

class scedar.cluster.community_mirac.CommunityMIRAC(x, d=None, sids=None, fids=None, nprocs=1, verbose=False)[source]

Bases: object

CommunityMIRAC: Community + MIRAC clustering

Run community clustering with high resolution to get a large number of clusters. Then, run MIRAC on the community clusters.

Parameters:
  • x (float array) – Data matrix.
  • d (float array) – Distance matrix.
  • sids (sid list) – List of sample ids.
  • fids (fid list) – List of feature ids.
  • nprocs (int > 0) – The number of processes/cores used for community clustering.
  • verbose (bool) – Print progress or not.
_x

Data matrix.

Type:float array
_d

Distance matrix.

Type:float array
_sids

List of sample ids.

Type:sid list
_fids

List of feature ids.

Type:fid list
_nprocs

The number of processes/cores used for community clustering.

Type:int > 0
_verbose

Print progress or not.

Type:bool
_cm_res

Community clustering result.

Type:cluster.Community
_cm_clp_x

Data array with samples collapsed by community clustering labels. For each cluster, the mean of all samples is a row in this array.

Type:array
_mirac_res

MIRAC clustering results on _cm_clp_x

Type:cluster.MIRAC
labs

list of labels

Type:list
static collapse_clusters(data_x, cluster_labs)[source]
labs
run(graph=None, metric='cosine', use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=100, random_state=None, n_iter=2, hac_tree=None, cl_mdl_scale_factor=1, min_cl_n=25, encode_type='auto', mdl_method=None, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, linkage='complete', optimal_ordering=False, nprocs=None)[source]
run_community(graph=None, metric='cosine', use_pdist=False, k=15, use_pca=True, use_hnsw=True, index_params=None, query_params=None, aff_scale=1, partition_method='RBConfigurationVertexPartition', resolution=100, random_state=None, n_iter=2, nprocs=None)[source]
run_mirac(metric='cosine', hac_tree=None, cl_mdl_scale_factor=1, min_cl_n=25, encode_type='auto', mdl_method=None, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, linkage='complete', optimal_ordering=False, dim_reduct_method=None, nprocs=None)[source]
tune_mirac(cl_mdl_scale_factor=1, min_cl_n=25, min_split_mdl_red_ratio=0.2, soft_min_subtree_size=1, verbose=False)[source]