scedar.eda¶

scedar.eda.mdl¶

class scedar.eda.mdl.GKdeMdl(x, kde_bw_method='scott', dtype=dtype('float64'), copy=True)[source]¶

Bases: scedar.eda.mdl.Mdl

Use Gaussian kernel density estimation to compute mdl

Parameters:

x (1D np.number array) – data used for fit mdl
bandwidth_method – string KDE bandwidth estimation method bing passed to scipy.stats.gaussian_kde. Types: * “scott”: Scott’s rule of thumb. * “silverman”: Silverman”s rule of thumb. * constant: constant will be timed by x.std(ddof=1) internally, because scipy times bw_method value by std. “Scipy weights its bandwidth by the ovariance of the input data” [3]. * callable: scipy calls the function on self
dtype (np.dtype) – default to 64-bit float
copy (bool) – passed to np.array()

_x¶

data to fit

Type:	1d float array

_n¶

number of elements in data

Type:	int

_bw_method¶

bandwidth method

Type:	str

_kde¶

Type:	`scipy kde`

_logdens¶

log density

Type:	1d float array

bandwidth¶

encode(qx, mdl_scale_factor=1)[source]¶

Encode query data using fitted KDE code

Parameters:	qx (1d float array) – mdl_scale_factor (number) – times mdl by this number
Returns:	mdl
Return type:	float

static gaussian_kde_logdens(x, bandwidth_method='scott', ret_kernel=False)[source]¶

Estimate Gaussian kernel density estimation bandwidth for input x.

Parameters:	x (float array of shape (n_samples) or (n_samples, n_features)) – Data points for KDE estimation. bandwidth_method (string) – KDE bandwidth estimation method bing passed to scipy.stats.gaussian_kde.

kde¶

mdl¶

class scedar.eda.mdl.Mdl(x, dtype=dtype('float64'), copy=True)[source]¶

Bases: abc.ABC

Minimum description length abstract base class

Interface of various mdl schemas. Subclasses must implement mdl property: and encode method.

_x¶

data used for fit mdl

Type:	1D np.number array

_n¶

number of points in x

Type:	np.int

encode(x)[source]¶: Encode another 1D number array with fitted code :param x: data to encode :type x: 1D np.number array

mdl¶

x¶

class scedar.eda.mdl.MultinomialMdl(x, dtype=dtype('float64'), copy=True)[source]¶

Bases: scedar.eda.mdl.Mdl

Encode discrete values using multinomial distribution

Parameters:	x (1D np.number array) – data used for fit mdl dtype (np.dtype) – default to 64-bit float copy (bool) – passed to np.array()

Note

When x only has 1 uniq value. Encode the the number of values only.

encode(qx, use_adjescent_when_absent=False)[source]¶

Encode another 1D float array with fitted code

Parameters:	qx (1d float array) – query data use_adjescent_when_absent (bool) – whether to use adjascent value to compute query mdl. If not, uniform mdl is used. If adjascent values have same distance to query value, choose the one with smaller mdl.
Returns:	qmdl (float)

mdl¶

class scedar.eda.mdl.ZeroIGKdeMdl(x, kde_bw_method='scott', dtype=dtype('float64'), copy=True)[source]¶

Bases: scedar.eda.mdl.Mdl

Zero indicator Gaussian KDE MDL

Encode the 0s and non-0s using bernoulli distribution. Then, encode non-0s using gaussian kde. Finally, one ternary val indicates all 0s, all non-0s, or otherwise

Parameters:

x (1D np.number array) – data used for fit mdl
bandwidth_method – string KDE bandwidth estimation method bing passed to scipy.stats.gaussian_kde. Types: * “scott”: Scott’s rule of thumb. * “silverman”: Silverman”s rule of thumb. * constant: constant will be timed by x.std(ddof=1) internally, because scipy times bw_method value by std. “Scipy weights its bandwidth by the ovariance of the input data” [3]. * callable: scipy calls the function on self
dtype (np.dtype) – default to 64-bit float
copy (bool) – passed to np.array()

References

[1] https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html

[2] https://en.wikipedia.org/wiki/Kernel_density_estimation

[3] https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

[4] https://github.com/scipy/scipy/blob/v1.0.0/scipy/stats/kde.py#L42-L564

bandwidth¶

encode(qx)[source]¶

Encode qx

Parameters:	qx (1d np number array) –
Returns:	mdl (float)

kde¶

kde_mdl¶

mdl¶

x_nonzero¶

zi_mdl¶

class scedar.eda.mdl.ZeroIMdl(x, dtype=dtype('float64'), copy=True)[source]¶

Bases: scedar.eda.mdl.Mdl

Encode an indicator vector of 0s and non-0s

encode(qx)[source]¶: Encode another 1D number array with fitted code :param x: data to encode :type x: 1D np.number array

mdl¶

class scedar.eda.mdl.ZeroIMultinomialMdl(x, dtype=dtype('float64'), copy=True)[source]¶

Bases: scedar.eda.mdl.Mdl

encode(qx, use_adjescent_when_absent=False)[source]¶: Encode another 1D number array with fitted code :param x: data to encode :type x: 1D np.number array

mdl¶

scedar.eda.mdl.np_number_1d(x, dtype=dtype('float64'), copy=True)[source]¶

Convert x to 1d np number array

Parameters:	x (1d sequence of values convertable to np.number) – dtype (np number type) – default to 64-bit float copy (bool) – passed to np.array()
Returns:	xarr (1d np.number array)
Raises:	`ValueError` – If x is not convertable to provided dtype or non-1d. If dtype is not subdtype of np number.

scedar.eda.mtype¶

scedar.eda.mtype.check_is_valid_labs(labs)[source]¶

scedar.eda.mtype.check_is_valid_sfids(sfids)[source]¶

scedar.eda.mtype.is_uniq_np1darr(x)[source]¶: Test whether x is a 1D np array that only contains unique values.

scedar.eda.mtype.is_valid_full_cut_tree_mat(cmat)[source]¶: Validate scipy hierarchical clustering cut tree Number of clusters should decrease from n to 1

scedar.eda.mtype.is_valid_lab(lab)[source]¶

scedar.eda.mtype.is_valid_sfid(sfid)[source]¶

scedar.eda.plot¶

scedar.eda.plot.cluster_scatter(projection2d, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, gradient=None, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶

Scatter plot for clustering illustration

Parameters:

projection2d (2 col numeric array) – (n, 2) matrix to plot
labels (list of labels) – labels of n samples
selected_labels (list of labels) – selected labels to plot
plot_different_markers (bool) – plot different markers for samples with different labels
label_markers (list of marker shapes) – passed to matplotlib plot
shuffle_label_colors (bool) – shuffle the color of labels to avoid similar colors show up in close clusters
gradient (list of number) – color gradient of n samples
title (str) –
xlab (str) – x axis label
ylab (str) – y axis label
figsize (tuple of two number) – (width, height)
add_legend (bool) –
n_txt_per_cluster (number) – the number of text to plot per cluster. Could be 0.
alpha (number) –
s (number) – size of the points
random_state (int) – random seed to shuffle features
**kwards – passed to matplotlib plot

Returns:

matplotlib figure of the created scatter plot

scedar.eda.plot.heatmap(x, row_labels=None, col_labels=None, title=None, xlab=None, ylab=None, figsize=(20, 20), transform=None, shuffle_row_colors=False, shuffle_col_colors=False, random_state=None, row_label_order=None, col_label_order=None, **kwargs)[source]¶

scedar.eda.plot.hist_dens_plot(x, title=None, xlab=None, ylab=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot histogram and density plot of x.

scedar.eda.plot.labs_to_cmap(labels, return_lut=False, shuffle_colors=False, random_state=None)[source]¶

scedar.eda.plot.networkx_graph(ng, pos=None, alpha=0.05, figsize=(20, 20), gradient=None, labels=None, different_label_markers=True, node_size=30, node_with_labels=False, nx_draw_kwargs=None)[source]¶

scedar.eda.plot.regression_scatter(x, y, title=None, xlab=None, ylab=None, figsize=(5, 5), alpha=1, s=0.5, ax=None, **kwargs)[source]¶: Paired vector scatter plot.

scedar.eda.plot.swarm(x, labels=None, selected_labels=None, title=None, xlab=None, ylab=None, figsize=(10, 10), ax=None, **kwargs)[source]¶

scedar.eda.sdm¶

class scedar.eda.sdm.HClustTree(node, prev=None)[source]¶

Bases: object

Hierarchical clustering tree.

Implement simple tree operation routines. HCT is binary unbalanced tree.

node¶

current node

Type:	scipy.cluster.hierarchy.ClusterNode

prev¶

parent of current node

Type:	HClustTree

bi_partition(soft_min_subtree_size=1, return_subtrees=False)[source]¶

soft_min_subtree_size: when curr tree size < 2 * soft_min_subtree_size, it is impossible to have a bipartition with a minimum sub tree size bigger than soft_min_subtree_size. In this case, return the first partition.

When soft_min_subtree_size = 1, the performance is the same as taking the first bipartition.

When curr size = 1, the first bipartition gives (1, 0). Because curr size < 2 * soft_min_subtree_size, it goes directly to return.

When curr size = 2, the first bipartition guarantees to give (1, 1), with the invariant that parent nodes of leaves always have 2 child nodes. This also goes directly to return.

When curr size >= 3, the first bipartition guarantees to give two subtrees with size >= 1, with the same invariant in size = 2.

static cluster_id_to_lab_list(cl_sid_list, sid_list)[source]¶

Convert nested clustered ID list into cluster label list.

For example, convert [[0, 1, 2], [3,4]] to [0, 0, 0, 1, 1] according to id_arr [0, 1, 2, 3, 4]

Parameters

cl_sid_list: list[list[id]]: Nested list with each sublist as a sert of IDs from a cluster.
sid_list: list[id]: Flat list of lists.

count()[source]¶

static hclust_linkage(dmat, linkage='complete', n_eval_rounds=None, is_euc_dist=False, optimal_ordering=False, verbose=False)[source]¶

static hclust_tree(dmat, linkage='complete', n_eval_rounds=None, is_euc_dist=False, optimal_ordering=False, verbose=False)[source]¶

static hct_from_lkg(hac_z)[source]¶

leaf_ids()[source]¶

Returns the list of leaf IDs from left to right

Returns:	`list` of leaf IDs

left()[source]¶

left_count()[source]¶

left_leaf_ids()[source]¶

n_round_bipar_cnt(n)[source]¶

prev

right()[source]¶

right_count()[source]¶

right_leaf_ids()[source]¶

static sort_x_by_d(x, dmat=None, metric='cosine', linkage='auto', n_eval_rounds=None, optimal_ordering=False, nprocs=None, verbose=False)[source]¶

class scedar.eda.sdm.SampleDistanceMatrix(x, d=None, metric='cosine', use_pdist=True, sids=None, fids=None, nprocs=None)[source]¶

Bases: scedar.eda.sfm.SampleFeatureMatrix

SampleDistanceMatrix: data with pairwise distance matrix

Parameters:

x (ndarray or list) – data matrix (n_samples, n_features)
d (ndarray or list or None) – distance matrix (n_samples, n_samples) If is None, d will be computed with x, metric, and nprocs.
metric (string) – distance metric
use_pdist (boolean) – to use the pairwise distance matrix or not. The pairwise distance matrix may be too large to save for datasets with a large number of cells.
sids (homogenous list of int or string) – sample ids. Should not contain duplicated elements.
fids (homogenous list of int or string) – feature ids. Should not contain duplicated elements.
nprocs (int) – the number of processes for computing pairwise distance matrix

_x¶

data matrix (n_samples, n_features)

Type:	ndarray

_d¶

distance matrix (n_samples, n_samples)

Type:	ndarray

_metric¶

distance metric

Type:	string

_sids¶

sample ids.

Type:	ndarray

_fids¶

sample ids.

Type:	ndarray

_tsne_lut¶

lookup table for previous tsne calculations. Each run has an indexed entry, {(param_str, index) : tsne_res}

Type:	dict

_last_tsne¶

The last stored tsne results. In no tsne performed before, a run with default parameters will be performed.

Type:	float array

_hnsw_index_lut¶

Type:	{string_index_parameters: hnsw_index}

_last_k¶

The last k used for s_knns computation.

Type:	int

_last_knns¶

The last computed s_knns.

Type:	(knn_indices, knn_distances)

_knn_ng_lut¶

{(k, aff_scale): knn_graph}

Type:	dict

static correlation_pdist(x)[source]¶

Compute pairwise correlation pdist for x (n_samples, n_features).

Adapted from Waylon Flinn’s post on https://stackoverflow.com/a/20687984/4638182 .

Parameters:	x (ndarray) – (n_samples, n_features)
Returns:	d – Pairwise distance matrix, (n_samples, n_samples).
Return type:	ndarray

static cosine_pdist(x)[source]¶

Compute pairwise cosine pdist for x (n_samples, n_features).

Adapted from Waylon Flinn’s post on https://stackoverflow.com/a/20687984/4638182 .

Cosine distance is undefined if one of the vectors contain only 0s.

Parameters:	x (ndarray) – (n_samples, n_features)
Returns:	d – Pairwise distance matrix, (n_samples, n_samples).
Return type:	ndarray

d¶

get_tsne_kv(key)[source]¶

Get t-SNE results from the lookup table. Return None if non-existent.

Returns:	res_tuple – (key, val) pair of tsne result.
Return type:	tuple

id_x(selected_sids=None, selected_fids=None)[source]¶

Subset samples by (sample IDs, feature IDs).

Parameters:	selected_s_inds (int array) – Index array of selected samples. If is None, select all. selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:	subset
Return type:	SampleDistanceMatrix

ind_x(selected_s_inds=None, selected_f_inds=None)[source]¶

Subset samples by (sample IDs, feature IDs).

Parameters:	selected_s_inds (int array) – Index array of selected samples. If is None, select all. selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:	subset
Return type:	SampleDistanceMatrix

static knn_conn_mat_to_aff_graph(knn_conn_mat, aff_scale=1)[source]¶

metric¶

static num_correct_dist_mat(dmat, upper_bound=None)[source]¶

par_tsne(param_list, store_res=True, nprocs=1)[source]¶

Run t-SNE with multiple sets of parameters parallely.

Parameters:	param_list (list of dict) – List of parameters being passed to t-SNE. nprocs (int) – Number of processes.
Returns:	tsne_res_list – List of t-SNE results of corresponding parameter set.
Return type:	list of float arrays

Notes

Parallel running results cannot be stored during the run, because racing conditions may happen.

pca_feature_gradient_plot(fid, component_ind_pair=(0, 1), transform=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶

Plot the last PCA projection with the provided gradient as color.

Parameters:

component_ind_pair (tuple of two ints) – Indices of the components to plot.
fid (feature id scalar) – ID of the feature to be used for gradient plot.
transform (callable) – Map transform on feature before plotting.
labels (label array) – Labels assigned to each point, (n_samples,).
selected_labels (label array) – Show gradient only for selected labels. Do not show non-selected.

pca_plot(component_ind_pair=(0, 1), gradient=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶: Plot the PCA projection with the provided gradient as color. Gradient is None by default.

put_tsne(str_params, res)[source]¶: Put t-SNE results into the lookup table.

s_ith_nn_d(i)[source]¶: Computes the distances of the i-th nearest neighbor of all samples.

s_ith_nn_d_dist(i, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distances of the i-th nearest neighbor of all samples.

s_ith_nn_ind(i)[source]¶: Computes the sample indices of the i-th nearest neighbor of all samples.

s_knn_connectivity_matrix(k, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]¶

Computes the connectivity matrix of KNN of samples. If an entry (i, j) has value 0, node i is not in node j’s KNN. If an entry (i, j) has value != 0, node i is in j’s KNN, and their distance is the entry value. If two NNs have distance euqal to 0, 0 will be replaced by -np.inf.

Parameters:	k (int) – The number of nearest neighbors. metric ({'cosine', 'euclidean', None}) – If none, self._metric is used. use_pca (bool) – Use PCA for nearest neighbors or not. use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor. index_params (dict) – Parameters used by HNSW in indexing. efConstruction: int Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000. M: int Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100. delaunay_type: {0, 1, 2, 3} Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good. post: {0, 1, 2} Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing. indexThreadQty: int Default self._nprocs. The number of threads used. query_params (dict) – Parameters used by HNSW in querying. efSearch: int Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
Returns:	knn_conn_mat – (n_samples, n_samles) Non-zero entries are nearest neighbors (NNs). The values are distances. If two NNs have distance euqal to 0, 0 will be replaced by -np.inf.
Return type:	float array

s_knn_graph(k, gradient=None, labels=None, different_label_markers=True, aff_scale=1, iterations=2000, figsize=(20, 20), node_size=30, alpha=0.05, random_state=None, init_pos=None, node_with_labels=False, fa2_kwargs=None, nx_draw_kwargs=None)[source]¶

Draw KNN graph of SampleDistanceMatrix. Graph layout using forceatlas2 for its speed on large graph.

Parameters:	k (int) – gradient (float array) – (n_samples,) color gradient labels (label list) – (n_samples,) labels different_label_markers (bool) – whether plot different labels with different markers aff_scale (float) – Affinity is calculated by (max(distance) - distance) * aff_scale iterations (int) – ForceAtlas2 iterations figsize ((float, float)) – node_size (float) – alpha (float) – random_state (int) – init_pos (float array) – Initial position of ForceAtlas2, (n_samples, 2). node_with_labels (bool) – fa2_kwargs (dict) – nx_draw_kwargs (dict) –
Returns:	fig – KNN graph.
Return type:	matplotlib figure

s_knn_ind_lut(k, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]¶: Computes the lookup table for sample i and its KNN indices, i.e. {i : [1st_NN_ind, 2nd_NN_ind, …, nth_NN_ind], …}

s_knns(k, metric=None, use_pca=False, use_hnsw=False, index_params=None, query_params=None, verbose=False)[source]¶

Computes the k-nearest neighbors (KNNs) of samples.

Parameters:

k (int) – The number of nearest neighbors.
metric ({'cosine', 'euclidean', None}) – If none, self._metric is used.
use_pca (bool) – Use PCA for nearest neighbors or not.
use_hnsw (bool) – Use Hierarchical Navigable Small World graph to compute approximate nearest neighbor.
index_params (dict) –
Parameters used by HNSW in indexing.

efConstruction: int

Default 100. Higher value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. The reasonable range of values is 100-2000.

M: int

Default 5. Higher value leads to better recall and shorter retrieval times, at the expense of longer indexing time. The reasonable range of values is 5-100.

delaunay_type: {0, 1, 2, 3}

Default 2. Pruning heuristic, which affects the trade-off between retrieval performance and indexing time. The default is usually quite good.

post: {0, 1, 2}

Default 0. The amount and type of postprocessing applied to the constructed graph. 0 means no processing. 2 means more processing.

indexThreadQty: int

Default self._nprocs. The number of threads used.
query_params (dict) –
Parameters used by HNSW in querying.

efSearch: int

Default 100. Higher value improves recall at the expense of longer retrieval time. The reasonable range of values is 100-2000.
verbose (bool) –

Returns:

knn_indices (list of numpy arrays) – The i-th array is the KNN indices of the i-th sample.
knn_distances (list of numpy arrays) – The i-th array is the KNN distances of the i-th sample.

sort_features(fdist_metric='cosine', optimal_ordering=False)[source]¶

to_classified(labels)[source]¶

Convert to SingleLabelClassifiedSamples

Parameters:	labels (list of labels) – sample labels.
Returns:	SingleLabelClassifiedSamples

tsne(store_res=True, **kwargs)[source]¶

Run t-SNE on distance matrix.

Parameters:	store_res (bool) – Store the results in lookup table or not. **kwargs – Keyword arguments passed to tsne computation.
Returns:	tsne_res – t-SNE projections, (n_samples, m dimensions).
Return type:	float array

tsne_feature_gradient_plot(fid, transform=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶

Plot the last t-SNE projection with the provided gradient as color.

Parameters:	fid (feature id scalar) – ID of the feature to be used for gradient plot. transform (callable) – Map transform on feature before plotting. labels (label array) – Labels assigned to each point, (n_samples,). selected_labels (label array) – Show gradient only for selected labels. Do not show non-selected.

tsne_lut¶

tsne_plot(gradient=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶: Plot the last t-SNE projection with the provided gradient as color. Gradient is None by default.

umap(use_pca=True, n_neighbors=5, n_components=2, n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, metric_kwds=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, verbose=False)[source]¶

umap_feature_gradient_plot(fid, component_ind_pair=(0, 1), transform=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶

Plot the last UMAP projection with the provided gradient as color.

Parameters:

component_ind_pair (tuple of two ints) – Indices of the components to plot.
fid (feature id scalar) – ID of the feature to be used for gradient plot.
transform (callable) – Map transform on feature before plotting.
labels (label array) – Labels assigned to each point, (n_samples,).
selected_labels (label array) – Show gradient only for selected labels. Do not show non-selected.

umap_plot(component_ind_pair=(0, 1), gradient=None, labels=None, selected_labels=None, plot_different_markers=False, label_markers=None, shuffle_label_colors=False, xlim=None, ylim=None, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶

Plot the UMAP projection with the provided gradient as color. Gradient is None by default.

TODO: refactor plotting interface. Merge multiple plotting methods into one.

scedar.eda.sdm.tsne(x, n_components=2, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', init='random', verbose=0, random_state=None, method='barnes_hut', angle=0.5)[source]¶

scedar.eda.sfm¶

class scedar.eda.sfm.SampleFeatureMatrix(x, sids=None, fids=None)[source]¶

Bases: object

SampleFeatureMatrix is a (n_samples, n_features) matrix.

In this package, we are only interested in float features as measured expression levels.

Parameters:	x ({array-like, sparse matrix}) – data matrix (n_samples, n_features) sids (homogenous list of int or string) – sample ids. Should not contain duplicated elements. fids (homogenous list of int or string) – feature ids. Should not contain duplicated elements.

_x¶

data matrix (n_samples, n_features)

Type:	{array-like, sparse matrix}

_is_sparse¶

whether the data matrix is sparse matrix or not

Type:	boolean

_sids¶

sample ids.

Type:	ndarray

_fids¶

sample ids.

Type:	ndarray

f_cv(f_cv_filter=None)[source]¶

For each sample, compute the coefficient of variation of all features.

Returns:	xf – (filtered_n_samples,)
Return type:	float array

f_cv_dist(f_cv_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distribution of the feature sum of each sample, (n_samples,).

f_gc(f_gc_filter=None)[source]¶

For each sample, compute the Gini coefficients of all features.

Returns:	xf – (filtered_n_samples,)
Return type:	float array

f_gc_dist(f_gc_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distribution of the feature Gini coefficient of each sample, (n_samples,).

f_id_dist(f_id, sample_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶

f_id_regression_scatter(xf_id, yf_id, sample_filter=None, xlab=None, ylab=None, title=None, **kwargs)[source]¶

Regression plot on two features with xf_id and yf_id.

Parameters:	xf_id (int) – Sample ID of x. yf_ind (int) – Sample ID of y. sample_filter (bool array, or int array, or callable(x, y)) – If sample_filter is bool / int array, directly select features with it. If sample_filter is callable, it will be applied on each (x, y) value tuple. xlab (str) – ylab (str) – title (str) –

f_id_to_ind(selected_fids)[source]¶: Convert a list of feature IDs into feature indices.

f_id_x_vec(f_id, sample_filter=None)[source]¶

f_ind_dist(f_ind, sample_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶

f_ind_regression_scatter(xf_ind, yf_ind, sample_filter=None, xlab=None, ylab=None, title=None, **kwargs)[source]¶

Regression plot on two features with xf_ind and yf_ind.

Parameters:	xf_ind (int) – Sample index of x. yf_ind (int) – Sample index of y. sample_filter (bool array, or int array, or callable(x, y)) – If sample_filter is bool / int array, directly select features with it. If sample_filter is callable, it will be applied on each (x, y) value tuple. xlab (str) – ylab (str) – title (str) –

f_ind_x_pair(xf_ind, yf_ind, sample_filter=None)[source]¶

f_ind_x_vec(f_ind, sample_filter=None, transform=None)[source]¶: Access a single vector of a sample.

f_n_above_threshold(closed_threshold)[source]¶: For each sample, compute the number of features above a closed threshold.

f_n_above_threshold_dist(closed_threshold, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distribution of the the number of above threshold samples of each feature, (n_features,).

f_sum(f_sum_filter=None)[source]¶

For each sample, compute the sum of all features.

Returns:	rowsum – (filtered_n_samples,)
Return type:	float array

f_sum_dist(f_sum_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distribution of the feature sum of each sample, (n_samples,).

fids¶

static filter_1d_inds(f, x)[source]¶

id_x(selected_sids=None, selected_fids=None)[source]¶

Subset samples by (sample IDs, feature IDs).

Parameters:	selected_sids (id array) – ID array of selected samples. If is None, select all. selected_fids (id array) – ID array of selected features. If is None, select all.
Returns:	subset
Return type:	SampleFeatureMatrix

ind_x(selected_s_inds=None, selected_f_inds=None)[source]¶

Subset samples by (sample IDs, feature IDs).

Parameters:	selected_s_inds (int array) – Index array of selected samples. If is None, select all. selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:	subset
Return type:	SampleFeatureMatrix

s_cv(s_cv_filter=None)[source]¶

For each feature, compute the coefficient of variation of all samples.

Returns:	xf – (n_features,)
Return type:	float array

s_cv_dist(s_cv_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distribution of the sample coefficient of variation of each feature, (n_features,).

s_gc(s_gc_filter=None)[source]¶

For each feature, compute the Gini coefficient of all samples.

Returns:	xf – (n_features,)
Return type:	float array

s_gc_dist(s_gc_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distribution of the sample Gini coefficients of each feature, (n_features,).

s_id_dist(s_id, feature_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶

s_id_regression_scatter(xs_id, ys_id, feature_filter=None, xlab=None, ylab=None, title=None, **kwargs)[source]¶

Regression plot on two samples with xs_id and ys_id.

Parameters:	xs_ind (int) – Sample ID of x. ys_ind (int) – Sample ID of y. feature_filter (bool array, or int array, or callable(x, y)) – If feature_filter is bool / int array, directly select features with it. If feature_filter is callable, it will be applied on each (x, y) value tuple. xlab (str) – ylab (str) – title (str) –

s_id_to_ind(selected_sids)[source]¶: Convert a list of sample IDs into sample indices.

s_ind_dist(s_ind, feature_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶

s_ind_regression_scatter(xs_ind, ys_ind, feature_filter=None, xlab=None, ylab=None, title=None, **kwargs)[source]¶

Regression plot on two samples with xs_ind and ys_ind.

Parameters:	xs_ind (int) – Sample index of x. ys_ind (int) – Sample index of y. feature_filter (bool array, or int array, or callable(x, y)) – If feature_filter is bool / int array, directly select features with it. If feature_filter is callable, it will be applied on each (x, y) value tuple. xlab (str) – ylab (str) – title (str) –

s_ind_x_pair(xs_ind, ys_ind, feature_filter=None)[source]¶

s_ind_x_vec(s_ind, feature_filter=None)[source]¶: Access a single vector of a sample.

s_n_above_threshold(closed_threshold)[source]¶: For each feature, compute the number of samples above a closed threshold.

s_n_above_threshold_dist(closed_threshold, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distribution of the the number of above threshold samples of each feature, (n_features,).

s_sum(s_sum_filter=None)[source]¶

For each feature, computer the sum of all samples.

Returns:	xf – (filtered_n_features,)
Return type:	float array

s_sum_dist(s_sum_filter=None, xlab=None, ylab=None, title=None, figsize=(5, 5), ax=None, **kwargs)[source]¶: Plot the distribution of the sample sum of each feature, (n_features,).

sids¶

x¶

scedar.eda.slcs¶

class scedar.eda.slcs.MDLSingleLabelClassifiedSamples(x, labs, sids=None, fids=None, encode_type='data', mdl_method=<class 'scedar.eda.mdl.ZeroIGKdeMdl'>, d=None, metric='correlation', nprocs=None)[source]¶

Bases: scedar.eda.slcs.SingleLabelClassifiedSamples

MDLSingleLabelClassifiedSamples inherits SingleLabelClassifiedSamples to offer MDL operations.

Parameters:

x (2d number array) – data matrix
labs (list of str or int) – labels
sids (list of str or int) – sample ids
fids (list of str or int) – feature ids
encode_type ("auto", "data", or "distance") – Type of values to encode. If “auto”, encode data when n_features <= 100.
mdl_method (mdl.Mdl) – If None, use ZeroIGKdeMdl for encoded values with >= 50% zeros, and use GKdeMdl otherwise.
d (2d number array) – distance matrix
metric (str) – distance metric for scipy
nprocs (int) –

_mdl_method¶

Type:	mdl.Mdl

class LabMdlResult(ulab_mdl_sum, ulab_s_inds, ulab_cnts, ulab_mdls, cluster_mdl)¶

Bases: tuple

cluster_mdl¶: Alias for field number 4

ulab_cnts¶: Alias for field number 2

ulab_mdl_sum¶: Alias for field number 0

ulab_mdls¶: Alias for field number 3

ulab_s_inds¶: Alias for field number 1

static compute_cluster_mdl(labs, cl_mdl_scale_factor=1)[source]¶

Additional MDL for encoding the cluster

labels are encoded by multinomial distribution
parameters are encoded by 32bit float np.log(2**32) = 22.18070977791825
scaled by factor

TODO: formalize parameter mdl

encode(qx, col_summary_func=<built-in function sum>, non_zero_only=False, nprocs=1, verbose=False)[source]¶

Encode input array qx with fitted code without label

Parameters:	qx (2d np number array) – col_summary_func (callable) – function applied on column mdls non_zero_only (bool) – whether to encode non-zero entries only nprocs (int) – verbose (bool) –
Returns:	mdl for encoding qx
Return type:	float

lab_mdl(cl_mdl_scale_factor=1, nprocs=1, verbose=False, ret_internal=False)[source]¶

Compute mdl of each feature after separating samples by labels

Parameters:	cl_mdl_scale_factor (float) – multiplies cluster related mdl by this number nprocs (int) – verbose (bool) – Not implemented
Returns:	mdl of matrix after separating sampels by labels
Return type:	float

no_lab_mdl(nprocs=1, verbose=False)[source]¶

Compute mdl of each feature without separating samples by labels

Parameters:	nprocs (int) – verbose (bool) – Not implemented
Returns:	mdl of matrix without separating samples by labels
Return type:	float

static per_col_encoders(x, encode_type, mdl_method=<class 'scedar.eda.mdl.ZeroIGKdeMdl'>, nprocs=1, verbose=False)[source]¶

Compute mdl encoder for each column

Parameters:	x (2d number array) – encode_type ("data" or "distance") – mdl_method (mdl.Mdl) – nprocs (int) – verbose (bool) –
Returns:	obj: list of column mdl encoders of x

class scedar.eda.slcs.SingleLabelClassifiedSamples(x, labs, sids=None, fids=None, d=None, metric='cosine', use_pdist=True, nprocs=None)[source]¶

Bases: scedar.eda.sdm.SampleDistanceMatrix

Data structure of single label classified samples

_x¶

(n_samples, n_features) data matrix.

Type:	2D number array

_d¶

(n_samples, n_samples) distance matrix.

Type:	2D number array

_labs¶

list of labels in the same type, int or str.

Type:	list of labels

_fids¶

list of feature IDs in the same type, int or str.

Type:	list of feature IDs

_sids¶

list of sample IDs in the same type, int or str.

Type:	list of sample IDs

_metric¶

Distance metric.

Type:	str

Note

If sort by labels, the samples will be reordered, so that samples from left to right are from one label to another.

cross_labs(q_slc_samples)[source]¶

dmat_heatmap(selected_labels=None, col_labels=None, transform=None, title=None, xlab=None, ylab=None, figsize=(10, 10), **kwargs)[source]¶: Plot distance matrix with rows colored by current labels.

feature_importance_across_labs(selected_labs, test_size=0.3, num_boost_round=10, nprocs=1, random_state=None, silent=1, xgb_params=None, num_bootstrap_round=0, bootstrap_size=None, shuffle_features=False)[source]¶

Use xgboost to determine the importance of features determining the difference between samples with different labels.

Run cross validation on dataset and obtain import features.

Parameters:

selected_labs (label list) – Labels to compare using xgboost.
test_size (float in range (0, 1)) – Ratio of samples to be used for testing
num_bootstrap_round (int) – Do num_bootstrap_round times of simple bootstrapping if num_bootstrap_round > 0.
bootstrap_size (int) – The number of samples for each bootstrapping run.
shuffle_features (bool) –
num_boost_round (int) – The number of rounds for xgboost training.
random_state (int) –
nprocs (int) –
xgb_params (dict) – Parameters for xgboost run. If None, default will be used. If provided, they will be directly used for xgbooster.

Returns:

feature_importance_list (list of feature importance of each run) – [(feature_id, mean of fscore across all bootstrapping rounds, standard deviation of fscore across all bootstrapping rounds, number of times used all bootstrapping rounds), …]
bst_list (list of xgb Booster) – Fitted boost tree model

Notes

If multiple features are highly correlated, they may not all show up in the resulting tree. You could try to reduce redundant features first before comparing different clusters, or you could also interpret the important features further after obtaining the important features.

For details about xgboost parameters, check the following links:

[1] https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

[2] http://xgboost.readthedocs.io/en/latest/python/python_intro.html

[3] http://xgboost.readthedocs.io/en/latest/parameter.html

[4] https://xgboost.readthedocs.io/en/latest/how_to/param_tuning.html

[5] https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

feature_importance_distintuishing_labs(selected_labs, test_size=0.3, num_boost_round=10, nprocs=1, random_state=None, silent=1, xgb_params=None, num_bootstrap_round=0, bootstrap_size=None, shuffle_features=False)[source]¶: Use xgboost to compare selected labels and others.

feature_importance_each_lab(test_size=0.3, num_boost_round=10, nprocs=1, random_state=None, silent=1, xgb_params=None, num_bootstrap_round=0, bootstrap_size=None, shuffle_features=False)[source]¶: Use xgboost to compare each label with others. Experimental.

feature_swarm_plot(fid, transform=None, labels=None, selected_labels=None, title=None, xlab=None, ylab=None, figsize=(10, 10))[source]¶

filter_min_class_n(min_class_n)[source]¶

id_x(selected_sids=None, selected_fids=None)[source]¶

Subset samples by (sample IDs, feature IDs).

Parameters:	selected_s_inds (int array) – Index array of selected samples. If is None, select all. selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:	subset
Return type:	SingleLabelClassifiedSamples

ind_x(selected_s_inds=None, selected_f_inds=None)[source]¶

Subset samples by (sample IDs, feature IDs).

Parameters:	selected_s_inds (int array) – Index array of selected samples. If is None, select all. selected_f_inds (int array) – Index array of selected features. If is None, select all.
Returns:	subset
Return type:	SingleLabelClassifiedSamples

lab_sorted_sids(ref_sid_order=None)[source]¶

lab_x(selected_labs)[source]¶

lab_x_bool_inds(selected_labs)[source]¶

labs¶

labs_to_sids(labs)[source]¶

merge_labels(orig_labs_to_merge, new_lab)[source]¶

Merge selected labels into a new label

Parameters:	orig_labs_to_merge (list of unique labels) – original labels to be merged into a new label new_lab (label) – new label of the merged labels
Returns:	None

Note

Update labels in place.

relabel(labels)[source]¶: Return a new SingleLabelClassifiedSamples with new labels.

static select_labs_bool_inds(ref_labs, selected_labs)[source]¶

sids_to_labs(sids)[source]¶

sort_by_labels()[source]¶: Return a copy with sorted sample indices by labels and distances.

tsne_feature_gradient_plot(fid, transform=None, labels=None, selected_labels=None, shuffle_label_colors=False, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶

Plot the last t-SNE projection with the provided gradient as color.

Parameters:	fid (feature id scalar) – ID of the feature to be used for gradient plot. transform (callable) – Map transform on feature before plotting.

tsne_plot(gradient=None, labels=None, selected_labels=None, shuffle_label_colors=False, title=None, xlab=None, ylab=None, figsize=(20, 20), add_legend=True, n_txt_per_cluster=3, alpha=1, s=0.5, random_state=None, **kwargs)[source]¶: Plot the last t-SNE projection with the provided gradient as color.

xmat_heatmap(selected_labels=None, selected_fids=None, col_labels=None, transform=None, title=None, xlab=None, ylab=None, figsize=(10, 10), **kwargs)[source]¶: Plot x as heatmap.

scedar.eda.stats¶

scedar.eda.stats.bidir_ReLU(x, start, end, lb=0, ub=1)[source]¶

scedar.eda.stats.gc1d(x)[source]¶

Compute Gini Index for 1D array.

References

[1] http://mathworld.wolfram.com/GiniCoefficient.html

[2] Damgaard, C. and Weiner, J. “Describing Inequality in Plant Size or Fecundity.” Ecology 81, 1139-1142, 2000.

[3] Dixon, P. M.; Weiner, J.; Mitchell-Olds, T.; and Woodley, R. “Bootstrapping the Gini Coefficient of Inequality.” Ecology 68, 1548-1551, 1987.

[4] Dixon, P. M.; Weiner, J.; Mitchell-Olds, T.; and Woodley, R. “Erratum to ‘Bootstrapping the Gini Coefficient of Inequality.’ ” Ecology 69, 1307, 1988.

[5] https://en.wikipedia.org/wiki/Gini_coefficient

[6] https://github.com/oliviaguest/gini/blob/master/gini.py

scedar.eda.stats.multiple_testing_correction(pvalues, correction_type='FDR')[source]¶

Consistent with R.

correct_pvalues_for_multiple_testing([0.0, 0.01, 0.029, 0.03, 0.031, 0.05,: 0.069, 0.07, 0.071, 0.09, 0.1])