scedar.eda

scedar.eda.mdl

class scedar.eda.mdl.GKdeMdl(x, kde_bw_method='scott', dtype=dtype('float64'), copy=True)[source]

Bases: scedar.eda.mdl.Mdl

Use Gaussian kernel density estimation to compute mdl

Parameters:
  • x (1D np.number array) – data used for fit mdl
  • bandwidth_method – string KDE bandwidth estimation method bing passed to scipy.stats.gaussian_kde. Types: * “scott”: Scott’s rule of thumb. * “silverman”: Silverman”s rule of thumb. * constant: constant will be timed by x.std(ddof=1) internally, because scipy times bw_method value by std. “Scipy weights its bandwidth by the ovariance of the input data” [3]. * callable: scipy calls the function on self
  • dtype (np.dtype) – default to 64-bit float
  • copy (bool) – passed to np.array()
_x

data to fit

Type:1d float array
_n

number of elements in data

Type:int
_bw_method

bandwidth method

Type:str
_kde
Type:scipy kde
_logdens

log density

Type:1d float array
bandwidth
encode(qx, mdl_scale_factor=1)[source]

Encode query data using fitted KDE code

Parameters:
  • qx (1d float array) –
  • mdl_scale_factor (number) – times mdl by this number
Returns:

mdl

Return type:

float

static gaussian_kde_logdens(x, bandwidth_method='scott', ret_kernel=False)[source]

Estimate Gaussian kernel density estimation bandwidth for input x.

Parameters:
  • x (float array of shape (n_samples) or (n_samples, n_features)) – Data points for KDE estimation.
  • bandwidth_method (string) – KDE bandwidth estimation method bing passed to scipy.stats.gaussian_kde.
kde
mdl
class scedar.eda.mdl.Mdl(x, dtype=dtype('float64'), copy=True)[source]

Bases: abc.ABC

Minimum description length abstract base class

Interface of various mdl schemas. Subclasses must implement mdl property
and encode method.
_x

data used for fit mdl

Type:1D np.number array
_n

number of points in x

Type:np.int
encode(x)[source]

Encode another 1D number array with fitted code :param x: data to encode :type x: 1D np.number array

mdl
x
class scedar.eda.mdl.MultinomialMdl(x, dtype=dtype('float64'), copy=True)[source]

Bases: scedar.eda.mdl.Mdl

Encode discrete values using multinomial distribution

Parameters:
  • x (1D np.number array) – data used for fit mdl
  • dtype (np.dtype) – default to 64-bit float
  • copy (bool) – passed to np.array()

Note

When x only has 1 uniq value. Encode the the number of values only.

encode(qx, use_adjescent_when_absent=False)[source]

Encode another 1D float array with fitted code

Parameters:
  • qx (1d float array) – query data
  • use_adjescent_when_absent (bool) – whether to use adjascent value to compute query mdl. If not, uniform mdl is used. If adjascent values have same distance to query value, choose the one with smaller mdl.
Returns:

qmdl (float)

mdl
class scedar.eda.mdl.ZeroIGKdeMdl(x, kde_bw_method='scott', dtype=dtype('float64'), copy=True)[source]

Bases: scedar.eda.mdl.Mdl

Zero indicator Gaussian KDE MDL

Encode the 0s and non-0s using bernoulli distribution. Then, encode non-0s using gaussian kde. Finally, one ternary val indicates all 0s, all non-0s, or otherwise

Parameters:
  • x (1D np.number array) – data used for fit mdl
  • bandwidth_method – string KDE bandwidth estimation method bing passed to scipy.stats.gaussian_kde. Types: * “scott”: Scott’s rule of thumb. * “silverman”: Silverman”s rule of thumb. * constant: constant will be timed by x.std(ddof=1) internally, because scipy times bw_method value by std. “Scipy weights its bandwidth by the ovariance of the input data” [3]. * callable: scipy calls the function on self
  • dtype (np.dtype) – default to 64-bit float
  • copy (bool) – passed to np.array()

References

[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html

[2] https://en.wikipedia.org/wiki/Kernel_density_estimation

[3] https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

[4] https://github.com/scipy/scipy/blob/v1.0.0/scipy/stats/kde.py#L42-L564

bandwidth
encode(qx)[source]

Encode qx

Parameters:qx (1d np number array) –
Returns:mdl (float)
kde
kde_mdl
mdl
x_nonzero
zi_mdl
class scedar.eda.mdl.ZeroIMdl(x, dtype=dtype('float64'), copy=True)[source]

Bases: scedar.eda.mdl.Mdl

Encode an indicator vector of 0s and non-0s

encode(qx)[source]

Encode another 1D number array with fitted code :param x: data to encode :type x: 1D np.number array

mdl
class scedar.eda.mdl.ZeroIMultinomialMdl(x, dtype=dtype('float64'), copy=True)[source]

Bases: scedar.eda.mdl.Mdl

encode(qx, use_adjescent_when_absent=False)[source]

Encode another 1D number array with fitted code :param x: data to encode :type x: 1D np.number array

mdl
scedar.eda.mdl.np_number_1d(x, dtype=dtype('float64'), copy=True)[source]

Convert x to 1d np number array

Parameters:
  • x (1d sequence of values convertable to np.number) –
  • dtype (np number type) – default to 64-bit float
  • copy (bool) – passed to np.array()
Returns:

xarr (1d np.number array)

Raises:

ValueError – If x is not convertable to provided dtype or non-1d. If dtype is not subdtype of np number.

scedar.eda.mtype

scedar.eda.mtype.check_is_valid_labs(labs)[source]
scedar.eda.mtype.check_is_valid_sfids(sfids)[source]
scedar.eda.mtype.is_uniq_np1darr(x)[source]

Test whether x is a 1D np array that only contains unique values.

scedar.eda.mtype.is_valid_full_cut_tree_mat(cmat)[source]

Validate scipy hierarchical clustering cut tree Number of clusters should decrease from n to 1

scedar.eda.mtype.is_valid_lab(lab)[source]
scedar.eda.mtype.is_valid_sfid(sfid)[source]

scedar.eda.plot

scedar.eda.plot.cluster_scatter(projection2d, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, gradient=None, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Scatter plot for clustering illustration

Parameters:
  • projection2d (2 col numeric array) – (n, 2) matrix to plot
  • labels (list of labels) – labels of n samples
  • selected_labels (list of labels) – selected labels to plot
  • plot_different_markers (bool) – plot different markers for samples with different labels
  • label_markers (list of marker shapes) – passed to matplotlib plot
  • shuffle_label_colors (bool) – shuffle the color of labels to avoid similar colors show up in close clusters
  • gradient (list of number) – color gradient of n samples
  • title (str) –
  • xlab (str) – x axis label
  • ylab (str) – y axis label
  • figsize (tuple of two number) – (width, height)
  • add_legend (bool) –
  • n_txt_per_cluster (number) – the number of text to plot per cluster. Could be 0.
  • alpha (number) –
  • s (number) – size of the points
  • random_state (int) – random seed to shuffle features
  • **kwards – passed to matplotlib plot
Returns:

matplotlib figure of the created scatter plot

scedar.eda.plot.heatmap(x, row_labels=None, col_labels=None, title=None, xlab=None, ylab=None, figsize=(20, 20), transform=None, shuffle_row_colors=False, shuffle_col_colors=False, random_state=None, row_label_order=None, col_label_order=None, **kwargs)[source]
scedar.eda.plot.hist_dens_plot(x, title=None, xlab=None, ylab=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot histogram and density plot of x.

scedar.eda.plot.labs_to_cmap(labels, return_lut=False, shuffle_colors=False, random_state=None)[source]
scedar.eda.plot.networkx_graph(ng, pos=None, alpha=0.05, figsize=(20, 20), gradient=None, labels=None, different_label_markers=True, node_size=30, node_with_labels=False, nx_draw_kwargs=None)[source]
scedar.eda.plot.regression_scatter(x, y, title=None, xlab=None, ylab=None, figsize=(5, 5), alpha=1, s=0.5, ax=None, **kwargs)[source]

Paired vector scatter plot.

scedar.eda.plot.swarm(x, labels=None, selected_labels=None, title=None, xlab=None, ylab=None, figsize=(10, 10), ax=None, **kwargs)[source]

scedar.eda.sdm

class scedar.eda.sdm.HClustTree(node, prev=None)[source]

Bases: object

Hierarchical clustering tree.

Implement simple tree operation routines. HCT is binary unbalanced tree.

node

current node

Type:scipy.cluster.hierarchy.ClusterNode
prev

parent of current node

Type:HClustTree
bi_partition(soft_min_subtree_size=1, return_subtrees=False)[source]

soft_min_subtree_size: when curr tree size < 2 * soft_min_subtree_size, it is impossible to have a bipartition with a minimum sub tree size bigger than soft_min_subtree_size. In this case, return the first partition.

When soft_min_subtree_size = 1, the performance is the same as taking the first bipartition.

When curr size = 1, the first bipartition gives (1, 0). Because curr size < 2 * soft_min_subtree_size, it goes directly to return.

When curr size = 2, the first bipartition guarantees to give (1, 1), with the invariant that parent nodes of leaves always have 2 child nodes. This also goes directly to return.

When curr size >= 3, the first bipartition guarantees to give two subtrees with size >= 1, with the same invariant in size = 2.

static cluster_id_to_lab_list(cl_sid_list, sid_list)[source]

Convert nested clustered ID list into cluster label list.

For example, convert [[0, 1, 2], [3,4]] to [0, 0, 0, 1, 1] according to id_arr [0, 1, 2, 3, 4]

Parameters

cl_sid_list: list[list[id]]
Nested list with each sublist as a sert of IDs from a cluster.
sid_list: list[id]
Flat list of lists.
count()[source]
static hclust_linkage(dmat, linkage='complete', n_eval_rounds=None, is_euc_dist=False, optimal_ordering=False, verbose=False)[source]
static hclust_tree(dmat, linkage='complete', n_eval_rounds=None, is_euc_dist=False, optimal_ordering=False, verbose=False)[source]
static hct_from_lkg(hac_z)[source]
leaf_ids()[source]

Returns the list of leaf IDs from left to right

Returns:list of leaf IDs
left()[source]
left_count()[source]
left_leaf_ids()[source]
n_round_bipar_cnt(n)[source]
prev
right()[source]
right_count()[source]
right_leaf_ids()[source]
static sort_x_by_d(x, dmat=None, metric='cosine', linkage='auto', n_eval_rounds=None, optimal_ordering=False, nprocs=None, verbose=False)[source]
class scedar.eda.sdm.SampleDistanceMatrix(x, d=None, metric='cosine', use_pdist=True, sids=None, fids=None, nprocs=None)[source]

Bases: scedar.eda.sfm.SampleFeatureMatrix

SampleDistanceMatrix: data with pairwise distance matrix

Parameters:
  • x (ndarray or list) – data matrix (n_samples, n_features)
  • d (ndarray or list or None) – distance matrix (n_samples, n_samples) If is None, d will be computed with x, metric, and nprocs.
  • metric (string) – distance metric
  • use_pdist (boolean) – to use the pairwise distance matrix or not. The pairwise distance matrix may be too large to save for datasets with a large number of cells.
  • sids (homogenous list of int or string) – sample ids. Should not contain duplicated elements.
  • fids (homogenous list of int or string) – feature ids. Should not contain duplicated elements.
  • nprocs (int) – the number of processes for computing pairwise distance matrix
_x

data matrix (n_samples, n_features)

Type:ndarray
_d

distance matrix (n_samples, n_samples)

Type:ndarray
_metric

distance metric

Type:string
_sids

sample ids.

Type:ndarray
_fids

sample ids.

Type:ndarray
_tsne_lut

lookup table for previous tsne calculations. Each run has an indexed entry, {(param_str, index) : tsne_res}

Type:dict
_last_tsne

The last stored tsne results. In no tsne performed before, a run with default parameters will be performed.

Type:float array
_hnsw_index_lut
Type:{string_index_parameters: hnsw_index}
_last_k

The last k used for s_knns computation.

Type:int
_last_knns

The last computed s_knns.

Type:(knn_indices, knn_distances)
_knn_ng_lut

{(k, aff_scale): knn_graph}

Type:dict
static correlation_pdist(x)[source]

Compute pairwise correlation pdist for x (n_samples, n_features).

Adapted from Waylon Flinn’s post on https://stackoverflow.com/a/20687984/4638182 .

Parameters:x (ndarray) – (n_samples, n_features)
Returns:d – Pairwise distance matrix, (n_samples, n_samples).
Return type:ndarray
static cosine_pdist(x)[source]

Compute pairwise cosine pdist for x (n_samples, n_features).

Adapted from Waylon Flinn’s post on https://stackoverflow.com/a/20687984/4638182 .

Cosine distance is undefined if one of the vectors contain only 0s.

Parameters:x (ndarray) – (n_samples, n_features)
Returns:d – Pairwise distance matrix, (n_samples, n_samples).
Return type:ndarray
d
get_tsne_kv(key)[source]

Get t-SNE results from the lookup table. Return None if non-existent.

Returns:res_tuple – (key, val) pair of tsne result.
Return type:tuple
id_x(selected_sids=None, selected_fids=None)[source]

Subset samples by (sample IDs, feature IDs).

Parameters:
  • selected_s_inds (int array) – Index array of selected samples. If is None, select all.
  • selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:

subset

Return type:

SampleDistanceMatrix

ind_x(selected_s_inds=None, selected_f_inds=None)[source]

Subset samples by (sample IDs, feature IDs).

Parameters:
  • selected_s_inds (int array) – Index array of selected samples. If is None, select all.
  • selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:

subset

Return type:

SampleDistanceMatrix

static knn_conn_mat_to_aff_graph(knn_conn_mat, aff_scale=1)[source]
metric
static num_correct_dist_mat(dmat, upper_bound=None)[source]
par_tsne(param_list, store_res=True, nprocs=1)[source]

Run t-SNE with multiple sets of parameters parallely.

Parameters:
  • param_list (list of dict) – List of parameters being passed to t-SNE.
  • nprocs (int) – Number of processes.
Returns:

tsne_res_list – List of t-SNE results of corresponding parameter set.

Return type:

list of float arrays

Notes

Parallel running results cannot be stored during the run, because racing conditions may happen.

pca_feature_gradient_plot(fid, component_ind_pair=(0, 1), transform=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Plot the last PCA projection with the provided gradient as color.

Parameters:
  • component_ind_pair (tuple of two ints) – Indices of the components to plot.
  • fid (feature id scalar) – ID of the feature to be used for gradient plot.
  • transform (callable) – Map transform on feature before plotting.
  • labels (label array) – Labels assigned to each point, (n_samples,).
  • selected_labels (label array) – Show gradient only for selected labels. Do not show non-selected.
pca_plot(component_ind_pair=(0, 1), gradient=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Plot the PCA projection with the provided gradient as color. Gradient is None by default.

put_tsne(str_params, res)[source]

Put t-SNE results into the lookup table.

s_ith_nn_d(i)[source]

Computes the distances of the i-th nearest neighbor of all samples.

s_ith_nn_d_dist(i, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distances of the i-th nearest neighbor of all samples.

s_ith_nn_ind(i)[source]

Computes the sample indices of the i-th nearest neighbor of all samples.

s_knn_connectivity_matrix(k, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]

Computes the connectivity matrix of KNN of samples. If an entry (i, j) has value 0, node i is not in node j’s KNN. If an entry (i, j) has value != 0, node i is in j’s KNN, and their distance is the entry value. If two NNs have distance euqal to 0, 0 will be replaced by -np.inf.

Parameters:
  • k (int) – The number of nearest neighbors.
  • metric ({'cosine', 'euclidean', None}) – If none, self._metric is used.
  • use_pca (bool) – Use PCA for nearest neighbors or not.
  • use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor.
  • index_params (dict) –

    Parameters used by HNSW in indexing.

    efConstruction: int
    Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000.
    M: int
    Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100.
    delaunay_type: {0, 1, 2, 3}
    Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good.
    post: {0, 1, 2}
    Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing.
    indexThreadQty: int
    Default self._nprocs. The number of threads used.
  • query_params (dict) –

    Parameters used by HNSW in querying.

    efSearch: int
    Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
Returns:

knn_conn_mat – (n_samples, n_samles) Non-zero entries are nearest neighbors (NNs). The values are distances. If two NNs have distance euqal to 0, 0 will be replaced by -np.inf.

Return type:

float array

s_knn_graph(k, gradient=None, labels=None, different_label_markers=True, aff_scale=1, iterations=2000, figsize=(20, 20), node_size=30, alpha=0.05, random_state=None, init_pos=None, node_with_labels=False, fa2_kwargs=None, nx_draw_kwargs=None)[source]

Draw KNN graph of SampleDistanceMatrix. Graph layout using forceatlas2 for its speed on large graph.

Parameters:
  • k (int) –
  • gradient (float array) – (n_samples,) color gradient
  • labels (label list) – (n_samples,) labels
  • different_label_markers (bool) – whether plot different labels with different markers
  • aff_scale (float) – Affinity is calculated by (max(distance) - distance) * aff_scale
  • iterations (int) – ForceAtlas2 iterations
  • figsize ((float, float)) –
  • node_size (float) –
  • alpha (float) –
  • random_state (int) –
  • init_pos (float array) – Initial position of ForceAtlas2, (n_samples, 2).
  • node_with_labels (bool) –
  • fa2_kwargs (dict) –
  • nx_draw_kwargs (dict) –
Returns:

fig – KNN graph.

Return type:

matplotlib figure

s_knn_ind_lut(k, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]

Computes the lookup table for sample i and its KNN indices, i.e. {i : [1st_NN_ind, 2nd_NN_ind, …, nth_NN_ind], …}

s_knns(k, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]

Computes the k-nearest neighbors (KNNs) of samples.

Parameters:
  • k (int) – The number of nearest neighbors.
  • metric ({'cosine', 'euclidean', None}) – If none, self._metric is used.
  • use_pca (bool) – Use PCA for nearest neighbors or not.
  • use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor.
  • index_params (dict) –

    Parameters used by HNSW in indexing.

    efConstruction: int
    Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000.
    M: int
    Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100.
    delaunay_type: {0, 1, 2, 3}
    Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good.
    post: {0, 1, 2}
    Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing.
    indexThreadQty: int
    Default self._nprocs. The number of threads used.
  • query_params (dict) –

    Parameters used by HNSW in querying.

    efSearch: int
    Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
  • verbose (bool) –
Returns:

  • knn_indices (list of numpy arrays) – The i-th array is the KNN indices of the i-th sample.
  • knn_distances (list of numpy arrays) – The i-th array is the KNN distances of the i-th sample.

sort_features(fdist_metric='cosine', optimal_ordering=False)[source]
to_classified(labels)[source]

Convert to SingleLabelClassifiedSamples

Parameters:labels (list of labels) – sample labels.
Returns:SingleLabelClassifiedSamples
tsne(store_res=True, **kwargs)[source]

Run t-SNE on distance matrix.

Parameters:
  • store_res (bool) – Store the results in lookup table or not.
  • **kwargs – Keyword arguments passed to tsne computation.
Returns:

tsne_res – t-SNE projections, (n_samples, m dimensions).

Return type:

float array

tsne_feature_gradient_plot(fid, transform=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Plot the last t-SNE projection with the provided gradient as color.

Parameters:
  • fid (feature id scalar) – ID of the feature to be used for gradient plot.
  • transform (callable) – Map transform on feature before plotting.
  • labels (label array) – Labels assigned to each point, (n_samples,).
  • selected_labels (label array) – Show gradient only for selected labels. Do not show non-selected.
tsne_lut
tsne_plot(gradient=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Plot the last t-SNE projection with the provided gradient as color. Gradient is None by default.

umap(use_pca=True, n_neighbors=5, n_components=2, n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, metric_kwds=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, verbose=False)[source]
umap_feature_gradient_plot(fid, component_ind_pair=(0, 1), transform=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Plot the last UMAP projection with the provided gradient as color.

Parameters:
  • component_ind_pair (tuple of two ints) – Indices of the components to plot.
  • fid (feature id scalar) – ID of the feature to be used for gradient plot.
  • transform (callable) – Map transform on feature before plotting.
  • labels (label array) – Labels assigned to each point, (n_samples,).
  • selected_labels (label array) – Show gradient only for selected labels. Do not show non-selected.
umap_plot(component_ind_pair=(0, 1), gradient=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Plot the UMAP projection with the provided gradient as color. Gradient is None by default.

TODO: refactor plotting interface. Merge multiple plotting methods into one.

scedar.eda.sdm.tsne(x, n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', init='random', verbose=0, random_state=None, method='barnes_hut', angle=0.5)[source]

scedar.eda.sfm

class scedar.eda.sfm.SampleFeatureMatrix(x, sids=None, fids=None)[source]

Bases: object

SampleFeatureMatrix is a (n_samples, n_features) matrix.

In this package, we are only interested in float features as measured expression levels.

Parameters:
  • x ({array-like, sparse matrix}) – data matrix (n_samples, n_features)
  • sids (homogenous list of int or string) – sample ids. Should not contain duplicated elements.
  • fids (homogenous list of int or string) – feature ids. Should not contain duplicated elements.
_x

data matrix (n_samples, n_features)

Type:{array-like, sparse matrix}
_is_sparse

whether the data matrix is sparse matrix or not

Type:boolean
_sids

sample ids.

Type:ndarray
_fids

sample ids.

Type:ndarray
f_cv(f_cv_filter=None)[source]

For each sample, compute the coefficient of variation of all features.

Returns:xf – (filtered_n_samples,)
Return type:float array
f_cv_dist(f_cv_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distribution of the feature sum of each sample, (n_samples,).

f_gc(f_gc_filter=None)[source]

For each sample, compute the Gini coefficients of all features.

Returns:xf – (filtered_n_samples,)
Return type:float array
f_gc_dist(f_gc_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distribution of the feature Gini coefficient of each sample, (n_samples,).

f_id_dist(f_id, sample_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]
f_id_regression_scatter(xf_id, yf_id, sample_filter=None, xlab=None, ylab=None, title=None, **kwargs)[source]

Regression plot on two features with xf_id and yf_id.

Parameters:
  • xf_id (int) – Sample ID of x.
  • yf_ind (int) – Sample ID of y.
  • sample_filter (bool array, or int array, or callable(x, y)) – If sample_filter is bool / int array, directly select features with it. If sample_filter is callable, it will be applied on each (x, y) value tuple.
  • xlab (str) –
  • ylab (str) –
  • title (str) –
f_id_to_ind(selected_fids)[source]

Convert a list of feature IDs into feature indices.

f_id_x_vec(f_id, sample_filter=None)[source]
f_ind_dist(f_ind, sample_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]
f_ind_regression_scatter(xf_ind, yf_ind, sample_filter=None, xlab=None, ylab=None, title=None, **kwargs)[source]

Regression plot on two features with xf_ind and yf_ind.

Parameters:
  • xf_ind (int) – Sample index of x.
  • yf_ind (int) – Sample index of y.
  • sample_filter (bool array, or int array, or callable(x, y)) – If sample_filter is bool / int array, directly select features with it. If sample_filter is callable, it will be applied on each (x, y) value tuple.
  • xlab (str) –
  • ylab (str) –
  • title (str) –
f_ind_x_pair(xf_ind, yf_ind, sample_filter=None)[source]
f_ind_x_vec(f_ind, sample_filter=None, transform=None)[source]

Access a single vector of a sample.

f_n_above_threshold(closed_threshold)[source]

For each sample, compute the number of features above a closed threshold.

f_n_above_threshold_dist(closed_threshold, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distribution of the the number of above threshold samples of each feature, (n_features,).

f_sum(f_sum_filter=None)[source]

For each sample, compute the sum of all features.

Returns:rowsum – (filtered_n_samples,)
Return type:float array
f_sum_dist(f_sum_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distribution of the feature sum of each sample, (n_samples,).

fids
static filter_1d_inds(f, x)[source]
id_x(selected_sids=None, selected_fids=None)[source]

Subset samples by (sample IDs, feature IDs).

Parameters:
  • selected_sids (id array) – ID array of selected samples. If is None, select all.
  • selected_fids (id array) – ID array of selected features. If is None, select all.
Returns:

subset

Return type:

SampleFeatureMatrix

ind_x(selected_s_inds=None, selected_f_inds=None)[source]

Subset samples by (sample IDs, feature IDs).

Parameters:
  • selected_s_inds (int array) – Index array of selected samples. If is None, select all.
  • selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:

subset

Return type:

SampleFeatureMatrix

s_cv(s_cv_filter=None)[source]

For each feature, compute the coefficient of variation of all samples.

Returns:xf – (n_features,)
Return type:float array
s_cv_dist(s_cv_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distribution of the sample coefficient of variation of each feature, (n_features,).

s_gc(s_gc_filter=None)[source]

For each feature, compute the Gini coefficient of all samples.

Returns:xf – (n_features,)
Return type:float array
s_gc_dist(s_gc_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distribution of the sample Gini coefficients of each feature, (n_features,).

s_id_dist(s_id, feature_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]
s_id_regression_scatter(xs_id, ys_id, feature_filter=None, xlab=None, ylab=None, title=None, **kwargs)[source]

Regression plot on two samples with xs_id and ys_id.

Parameters:
  • xs_ind (int) – Sample ID of x.
  • ys_ind (int) – Sample ID of y.
  • feature_filter (bool array, or int array, or callable(x, y)) – If feature_filter is bool / int array, directly select features with it. If feature_filter is callable, it will be applied on each (x, y) value tuple.
  • xlab (str) –
  • ylab (str) –
  • title (str) –
s_id_to_ind(selected_sids)[source]

Convert a list of sample IDs into sample indices.

s_ind_dist(s_ind, feature_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]
s_ind_regression_scatter(xs_ind, ys_ind, feature_filter=None, xlab=None, ylab=None, title=None, **kwargs)[source]

Regression plot on two samples with xs_ind and ys_ind.

Parameters:
  • xs_ind (int) – Sample index of x.
  • ys_ind (int) – Sample index of y.
  • feature_filter (bool array, or int array, or callable(x, y)) – If feature_filter is bool / int array, directly select features with it. If feature_filter is callable, it will be applied on each (x, y) value tuple.
  • xlab (str) –
  • ylab (str) –
  • title (str) –
s_ind_x_pair(xs_ind, ys_ind, feature_filter=None)[source]
s_ind_x_vec(s_ind, feature_filter=None)[source]

Access a single vector of a sample.

s_n_above_threshold(closed_threshold)[source]

For each feature, compute the number of samples above a closed threshold.

s_n_above_threshold_dist(closed_threshold, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distribution of the the number of above threshold samples of each feature, (n_features,).

s_sum(s_sum_filter=None)[source]

For each feature, computer the sum of all samples.

Returns:xf – (filtered_n_features,)
Return type:float array
s_sum_dist(s_sum_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]

Plot the distribution of the sample sum of each feature, (n_features,).

sids
x

scedar.eda.slcs

class scedar.eda.slcs.MDLSingleLabelClassifiedSamples(x, labs, sids=None, fids=None, encode_type='data', mdl_method=<class 'scedar.eda.mdl.ZeroIGKdeMdl'>, d=None, metric='correlation', nprocs=None)[source]

Bases: scedar.eda.slcs.SingleLabelClassifiedSamples

MDLSingleLabelClassifiedSamples inherits SingleLabelClassifiedSamples to offer MDL operations.

Parameters:
  • x (2d number array) – data matrix
  • labs (list of str or int) – labels
  • sids (list of str or int) – sample ids
  • fids (list of str or int) – feature ids
  • encode_type ("auto", "data", or "distance") – Type of values to encode. If “auto”, encode data when n_features <= 100.
  • mdl_method (mdl.Mdl) – If None, use ZeroIGKdeMdl for encoded values with >= 50% zeros, and use GKdeMdl otherwise.
  • d (2d number array) – distance matrix
  • metric (str) – distance metric for scipy
  • nprocs (int) –
_mdl_method
Type:mdl.Mdl
class LabMdlResult(ulab_mdl_sum, ulab_s_inds, ulab_cnts, ulab_mdls, cluster_mdl)

Bases: tuple

cluster_mdl

Alias for field number 4

ulab_cnts

Alias for field number 2

ulab_mdl_sum

Alias for field number 0

ulab_mdls

Alias for field number 3

ulab_s_inds

Alias for field number 1

static compute_cluster_mdl(labs, cl_mdl_scale_factor=1)[source]

Additional MDL for encoding the cluster

  • labels are encoded by multinomial distribution
  • parameters are encoded by 32bit float np.log(2**32) = 22.18070977791825
  • scaled by factor

TODO: formalize parameter mdl

encode(qx, col_summary_func=<built-in function sum>, non_zero_only=False, nprocs=1, verbose=False)[source]

Encode input array qx with fitted code without label

Parameters:
  • qx (2d np number array) –
  • col_summary_func (callable) – function applied on column mdls
  • non_zero_only (bool) – whether to encode non-zero entries only
  • nprocs (int) –
  • verbose (bool) –
Returns:

mdl for encoding qx

Return type:

float

lab_mdl(cl_mdl_scale_factor=1, nprocs=1, verbose=False, ret_internal=False)[source]

Compute mdl of each feature after separating samples by labels

Parameters:
  • cl_mdl_scale_factor (float) – multiplies cluster related mdl by this number
  • nprocs (int) –
  • verbose (bool) – Not implemented
Returns:

mdl of matrix after separating sampels by labels

Return type:

float

no_lab_mdl(nprocs=1, verbose=False)[source]

Compute mdl of each feature without separating samples by labels

Parameters:
  • nprocs (int) –
  • verbose (bool) – Not implemented
Returns:

mdl of matrix without separating samples by labels

Return type:

float

static per_col_encoders(x, encode_type, mdl_method=<class 'scedar.eda.mdl.ZeroIGKdeMdl'>, nprocs=1, verbose=False)[source]

Compute mdl encoder for each column

Parameters:
  • x (2d number array) –
  • encode_type ("data" or "distance") –
  • mdl_method (mdl.Mdl) –
  • nprocs (int) –
  • verbose (bool) –
Returns:

obj: list of column mdl encoders of x

class scedar.eda.slcs.SingleLabelClassifiedSamples(x, labs, sids=None, fids=None, d=None, metric='cosine', use_pdist=True, nprocs=None)[source]

Bases: scedar.eda.sdm.SampleDistanceMatrix

Data structure of single label classified samples

_x

(n_samples, n_features) data matrix.

Type:2D number array
_d

(n_samples, n_samples) distance matrix.

Type:2D number array
_labs

list of labels in the same type, int or str.

Type:list of labels
_fids

list of feature IDs in the same type, int or str.

Type:list of feature IDs
_sids

list of sample IDs in the same type, int or str.

Type:list of sample IDs
_metric

Distance metric.

Type:str

Note

If sort by labels, the samples will be reordered, so that samples from left to right are from one label to another.

cross_labs(q_slc_samples)[source]
dmat_heatmap(selected_labels=None, col_labels=None, transform=None, title=None, xlab=None, ylab=None, figsize=(10, 10), **kwargs)[source]

Plot distance matrix with rows colored by current labels.

feature_importance_across_labs(selected_labs, test_size=0.3, num_boost_round=10, nprocs=1, random_state=None, silent=1, xgb_params=None, num_bootstrap_round=0, bootstrap_size=None, shuffle_features=False)[source]

Use xgboost to determine the importance of features determining the difference between samples with different labels.

Run cross validation on dataset and obtain import features.

Parameters:
  • selected_labs (label list) – Labels to compare using xgboost.
  • test_size (float in range (0, 1)) – Ratio of samples to be used for testing
  • num_bootstrap_round (int) – Do num_bootstrap_round times of simple bootstrapping if num_bootstrap_round > 0.
  • bootstrap_size (int) – The number of samples for each bootstrapping run.
  • shuffle_features (bool) –
  • num_boost_round (int) – The number of rounds for xgboost training.
  • random_state (int) –
  • nprocs (int) –
  • xgb_params (dict) – Parameters for xgboost run. If None, default will be used. If provided, they will be directly used for xgbooster.
Returns:

  • feature_importance_list (list of feature importance of each run) – [(feature_id, mean of fscore across all bootstrapping rounds, standard deviation of fscore across all bootstrapping rounds, number of times used all bootstrapping rounds), …]
  • bst_list (list of xgb Booster) – Fitted boost tree model

Notes

If multiple features are highly correlated, they may not all show up in the resulting tree. You could try to reduce redundant features first before comparing different clusters, or you could also interpret the important features further after obtaining the important features.

For details about xgboost parameters, check the following links:

[1] https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

[2] http://xgboost.readthedocs.io/en/latest/python/python_intro.html

[3] http://xgboost.readthedocs.io/en/latest/parameter.html

[4] https://xgboost.readthedocs.io/en/latest/how_to/param_tuning.html

[5] https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

feature_importance_distintuishing_labs(selected_labs, test_size=0.3, num_boost_round=10, nprocs=1, random_state=None, silent=1, xgb_params=None, num_bootstrap_round=0, bootstrap_size=None, shuffle_features=False)[source]

Use xgboost to compare selected labels and others.

feature_importance_each_lab(test_size=0.3, num_boost_round=10, nprocs=1, random_state=None, silent=1, xgb_params=None, num_bootstrap_round=0, bootstrap_size=None, shuffle_features=False)[source]

Use xgboost to compare each label with others. Experimental.

feature_swarm_plot(fid, transform=None, labels=None, selected_labels=None, title=None, xlab=None, ylab=None, figsize=(10, 10))[source]
filter_min_class_n(min_class_n)[source]
id_x(selected_sids=None, selected_fids=None)[source]

Subset samples by (sample IDs, feature IDs).

Parameters:
  • selected_s_inds (int array) – Index array of selected samples. If is None, select all.
  • selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:

subset

Return type:

SingleLabelClassifiedSamples

ind_x(selected_s_inds=None, selected_f_inds=None)[source]

Subset samples by (sample IDs, feature IDs).

Parameters:
  • selected_s_inds (int array) – Index array of selected samples. If is None, select all.
  • selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:

subset

Return type:

SingleLabelClassifiedSamples

lab_sorted_sids(ref_sid_order=None)[source]
lab_x(selected_labs)[source]
lab_x_bool_inds(selected_labs)[source]
labs
labs_to_sids(labs)[source]
merge_labels(orig_labs_to_merge, new_lab)[source]

Merge selected labels into a new label

Parameters:
  • orig_labs_to_merge (list of unique labels) – original labels to be merged into a new label
  • new_lab (label) – new label of the merged labels
Returns:

None

Note

Update labels in place.

relabel(labels)[source]

Return a new SingleLabelClassifiedSamples with new labels.

static select_labs_bool_inds(ref_labs, selected_labs)[source]
sids_to_labs(sids)[source]
sort_by_labels()[source]

Return a copy with sorted sample indices by labels and distances.

tsne_feature_gradient_plot(fid, transform=None, labels=None, selected_labels=None, shuffle_label_colors=False, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Plot the last t-SNE projection with the provided gradient as color.

Parameters:
  • fid (feature id scalar) – ID of the feature to be used for gradient plot.
  • transform (callable) – Map transform on feature before plotting.
tsne_plot(gradient=None, labels=None, selected_labels=None, shuffle_label_colors=False, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]

Plot the last t-SNE projection with the provided gradient as color.

xmat_heatmap(selected_labels=None, selected_fids=None, col_labels=None, transform=None, title=None, xlab=None, ylab=None, figsize=(10, 10), **kwargs)[source]

Plot x as heatmap.

scedar.eda.stats

scedar.eda.stats.bidir_ReLU(x, start, end, lb=0, ub=1)[source]
scedar.eda.stats.gc1d(x)[source]

Compute Gini Index for 1D array.

References

[1] http://mathworld.wolfram.com/GiniCoefficient.html

[2] Damgaard, C. and Weiner, J. “Describing Inequality in Plant Size or Fecundity.” Ecology 81, 1139-1142, 2000.

[3] Dixon, P. M.; Weiner, J.; Mitchell-Olds, T.; and Woodley, R. “Bootstrapping the Gini Coefficient of Inequality.” Ecology 68, 1548-1551, 1987.

[4] Dixon, P. M.; Weiner, J.; Mitchell-Olds, T.; and Woodley, R. “Erratum to ‘Bootstrapping the Gini Coefficient of Inequality.’ ” Ecology 69, 1307, 1988.

[5] https://en.wikipedia.org/wiki/Gini_coefficient

[6] https://github.com/oliviaguest/gini/blob/master/gini.py

scedar.eda.stats.multiple_testing_correction(pvalues, correction_type='FDR')[source]

Consistent with R.

correct_pvalues_for_multiple_testing([0.0, 0.01, 0.029, 0.03, 0.031, 0.05,
0.069, 0.07, 0.071, 0.09, 0.1])