Other functions¶

A set of generic tools grouped here are used by the class Analysis.

Submodule commonFunctions

Common functions used in analysis pipeline

Functions:

`KeyInStore`(key, file)	Check if a key present in HDF store file
`KeysOfStore`(file)	Get list of keys in HDF store file
`adjustTexts1D`(texts, fig, ax[, w, …])
`autoDetect1dGroups`(se[, halfWindowSize, …])
`binomialEnrichmentProbability`(nx_obj, …[, …])	Takes in a network x object and list of enriched genes and calculates the binomial enrichment based on the number of enriched interaction there are.
`cleanListString`(c)
`clusterData`(data[, n_clusters, method, …])	Return cluster labels
`downloadFile`(url, saveDir[, saveName])	Download a file from internet
`getDistanceOfBatch`(args)	Calculate correlation distance of metric for given batch
`getGenesOfPeak`(se[, peak, heightCutoff, …])	Find peak region of greatest value
`getPanglaoDBAnnotationsSummaryDf`(dirName[, …])
`getPeaks`(se[, threshold, distance, …])	Find peak regions
`getROC`(data)	Calculate axes of ROC (false positive rates and true positive rates) using index as thresholds
`get_df_distance`(df[, metric, genes, …])	Calculate distance measurement
`get_mean_std_cov_ofDataframe`(df)	Calculate mean, standard deviation, and covariance
`makeBarplot`(labels, saveDir, saveName)
`metric_euclidean_missing`(u, v)	Metric of euclidean distance between two arrays, excluding missing points
`movingAverageCentered`(a, halfWindowSize[, …])	Function to smooth a 1d signal
`normSum1`(data)
`readRDataFile`(fullPath[, …])	Read R data file
`reduce`(vIn[, size])	Interpolate data to reduce the number of data points
`reindexSeries`(args)	Assists in reindexing Series
`silhouette`(data, n_clusters, cluster_labels, …)
`testNumbersOfClusters`(data[, text, n_min, …])	Test different number of clusters with KMeans, calculate ARI for each

cleanListString(c)¶

movingAverageCentered(a, halfWindowSize, looped=False)[source]¶

Function to smooth a 1d signal

Parameters:

a: ndarray: Input data
halfWindowSize: int: Size of half-window for averaging
looped: boolean, Default False: Determined looped behaviour at the boundaries

Returns:

ndarray: Smoothed signal

Usage:

movingAverageCentered(a, halfWindowSize)

get_mean_std_cov_ofDataframe(df)[source]¶

Calculate mean, standard deviation, and covariance

Parameters:

df: pandas.DataFrame: Data with bootstrap experiment data

Returns:

pandas.DataFrame:: DataFrame with columns expressing mean, standard deviation, and covariance of input columns

Usage:

get_mean_std_cov_ofDataFrame(df)

getGenesOfPeak(se, peak=None, heightCutoff=0.5, maxDistance=None)[source]¶

Find peak region of greatest value

Parameters:

se: Series: Normalized aggregated data
peak: ndarray, Default None: Indices of max values in data
heighCutoff: float, Default 0.5: Height/value considered to be in peak
maxDistance: int, Default None: Maximum distance away considered to be in peak

Returns:

Genes appearing the peak

Usage:

getGenesOfPeak(se)

getPeaks(se, threshold=0.2, distance=50, prominence=0.05, returnAllInfo=False)[source]¶

Find peak regions

Parameters:

se: Series: Normalized aggregated data
threshold: float, Default 0.2: Minimum value to be considered peak
distance: int, Default 50: Minimum horizontal distance between peaks

Returns:

Indices of peaks satisfying input conditions

Usage:

getPeaks(se)

getDistanceOfBatch(args)[source]¶

Calculate correlation distance of metric for given batch

Parameters:

args: tuple

Tuple that contains:

batch: str: Batch identifier
df_sample: pandas.DataFrame: Expression data
metric: str: Metric name (e.g. ‘correlation’)
genes: list or 1d numpy.array: Genes of interest
minSize: int: Minimum size of input pandas.DataFrame
cutoff: float: Cutoff for percent expression of input data

Returns:

tuple:

Results in form of a tuple:

pandas.Series: Series containting correlation distance
str: Batch identifier
pandas.Series: Series of genes

Usage:

getDistanceOfBatch

reindexSeries(args)[source]¶

Assists in reindexing Series

Parameters:

args: tuple

Tuple that contains:

se: pandas.Series: Series to perform reindexing
batch: str: Batch identifier
index: list or 1d numpy.array: List of genes

Returns:

tuple:

Results in form of a tuple:

pandas.Series: Reindexed pandas.Series
str: Batch identifier

Usage:

reindexSeries

get_df_distance(df, metric='correlation', genes=[], analyzeBy='batch', minSize=10, groupBatches=True, pname=None, cutoff=0.05, nCPUs=4)[source]¶

Calculate distance measurement

Parameters:

df: pandas.DataFrame: Expression data
metric: str, Default ‘correlation’: Metric name (e.g. ‘correlation’)
genes: list, Default []: Genes for analysis
analyzeBy: str, Default ‘batch’: Level to analyze data by (e.g. batches)
minSize: int, Default 10: Minimum size of input pandas.DataFrame
groupBatches: boolean, Default True: Whether to group batched or save per-batch distance measure
pname: Default None: Deprecated
cutoff: float, Default 0.05: Cutoff for percent expression of input data
nCPUs: int, Default 4: Number of CPUs to use for multiprocessing

Returns:

pandas.DataFrame: Distance measure

Usage:

get_df_distance(df)

reduce(vIn, size=100)[source]¶

Interpolate data to reduce the number of data points

Parameters:

vIn: 1d vector: Data to resample
size: int, Default 100: New data size

Returns:

Resampled data

Usage:

reduce(vIn)

metric_euclidean_missing(u, v)[source]¶

Metric of euclidean distance between two arrays, excluding missing points

Parameters:

u: 1d vector: Data array
v: 1d vector: Data array

Returns:

ndarray: Non-negative squareroot of the array, element-wise

Usage:

metric_euclidean_missing(u, v)

binomialEnrichmentProbability(nx_obj, enriched_genes, target_genes=False, background_genes=False, PCNpath='data/')[source]¶

Takes in a network x object and list of enriched genes and calculates the binomial enrichment based on the number of enriched interaction there are. Uses Survival function (also defined as 1 - cdf): scipy.stats.binom.sf(k, n, p, loc=0)

Parameters:

nx_obj: networkx.Graph or str: A networkx undirected graph or edge list file name.
enriched_genes: list: List of enriched genes.
target_genes: list or boolean, Default False: List of target_genes. Default use all genes in background.
background_genes: list or boolean, Default False: List of genes to use as background probability. Default use all genes in the network.

Returns:

pandas.DataFrame

getROC(data)[source]¶

Calculate axes of ROC (false positive rates and true positive rates) using index as thresholds

Parameters:

data: 1d vector: Input data

Returns:

np.array: Array holding false positive rates
np.array: Array holding true positive rates

Usage:

getROC(data)

clusterData(data, n_clusters=None, method='Spectral', random_state=None)[source]¶

Return cluster labels

Parameters:

data: np.array: Array of shape (features, objects)
n_clusters: int, Default None: Number of desired clusters
method: str or int , Default ‘Spectral’: Method used cluster the data Options: 1 or ‘Agglomerative’, 2 or ‘Spectral’, 3 or ‘KMeans’
random_state: int, Default None: Used to determine randomness deterministic Methods 2 and 3 initial state is random, unless random_state is specified

Returns:

list: Cluster assignment of each object

Usage:

clusterData(data)

testNumbersOfClusters(data, text='', n_min=2, n_max=20, k=10)[source]¶

Test different number of clusters with KMeans, calculate ARI for each

Parameters:

data: np.array: Array of shape (features, objects)
text: str, Default ‘’: String identifier
n_min: int, Default 2: Minimum number of cluster
n_max: int, Default 20: Maximum number of cluster
k: int, Default 10: Number of iterations for each clusters number

Usage:

testNumbersofClusters(data)

KeyInStore(key, file)[source]¶

Check if a key present in HDF store file

Parameters:

key: str: The key to check
file: str: Path to HDF store file

Returns

True or False: Whether key is in store or not

Usage:

KeyInStore(key, file)

KeysOfStore(file)[source]¶

Get list of keys in HDF store file

Parameters:

file: str: Path to HDF store file

Returns

list: Keys

Usage:

KeysOfStore(file)

downloadFile(url, saveDir, saveName=None)[source]¶

Download a file from internet

Parameters:

url: str: URL to download from
saveDir: str: Path to save downloaded file to
saveName: str, Default None: New name for downloaded file

Returns

None

Usage:

downloadFile(url, saveDir)

readRDataFile(fullPath, takeGeneSymbolOnly=True, saveToHDF=True, returnSizeOnly=False)[source]¶

Read R data file

Parameters:

fullPath: str: Path to file
takeGeneSymbolOnly: boolean, Default True: Wether to save gene symbol only
saveToHDF: boolean, Default True: Wether to save data in HDF format
returnSizeOnly: boolean, Default False: Get data size and return, without reading the data itself

Returns

None

Usage:

readRDataFile(path)

normSum1(data)[source]¶

silhouette(data, n_clusters, cluster_labels, saveDir, saveName)[source]¶

getPanglaoDBAnnotationsSummaryDf(dirName, saveToFile=True, printDf=False)[source]¶

makeBarplot(labels, saveDir, saveName)[source]¶

autoDetect1dGroups(se, halfWindowSize=50, gaussianWidth=15, **kwargs)[source]¶

adjustTexts1D(texts, fig, ax, w='auto', direction='auto', maxIterations=1000, tolerance=0.02)[source]¶