Other functions

A set of generic tools grouped here are used by the class Analysis.

Submodule commonFunctions

Common functions used in analysis pipeline

Functions:

KeyInStore(key, file)

Check if a key present in HDF store file

KeysOfStore(file)

Get list of keys in HDF store file

adjustTexts1D(texts, fig, ax[, w, …])

autoDetect1dGroups(se[, halfWindowSize, …])

binomialEnrichmentProbability(nx_obj, …[, …])

Takes in a network x object and list of enriched genes and calculates the binomial enrichment based on the number of enriched interaction there are.

cleanListString(c)

clusterData(data[, n_clusters, method, …])

Return cluster labels

downloadFile(url, saveDir[, saveName])

Download a file from internet

getDistanceOfBatch(args)

Calculate correlation distance of metric for given batch

getGenesOfPeak(se[, peak, heightCutoff, …])

Find peak region of greatest value

getPanglaoDBAnnotationsSummaryDf(dirName[, …])

getPeaks(se[, threshold, distance, …])

Find peak regions

getROC(data)

Calculate axes of ROC (false positive rates and true positive rates) using index as thresholds

get_df_distance(df[, metric, genes, …])

Calculate distance measurement

get_mean_std_cov_ofDataframe(df)

Calculate mean, standard deviation, and covariance

makeBarplot(labels, saveDir, saveName)

metric_euclidean_missing(u, v)

Metric of euclidean distance between two arrays, excluding missing points

movingAverageCentered(a, halfWindowSize[, …])

Function to smooth a 1d signal

normSum1(data)

readRDataFile(fullPath[, …])

Read R data file

reduce(vIn[, size])

Interpolate data to reduce the number of data points

reindexSeries(args)

Assists in reindexing Series

silhouette(data, n_clusters, cluster_labels, …)

testNumbersOfClusters(data[, text, n_min, …])

Test different number of clusters with KMeans, calculate ARI for each

cleanListString(c)
movingAverageCentered(a, halfWindowSize, looped=False)[source]

Function to smooth a 1d signal

Parameters:
a: ndarray

Input data

halfWindowSize: int

Size of half-window for averaging

looped: boolean, Default False

Determined looped behaviour at the boundaries

Returns:
ndarray

Smoothed signal

Usage:

movingAverageCentered(a, halfWindowSize)

get_mean_std_cov_ofDataframe(df)[source]

Calculate mean, standard deviation, and covariance

Parameters:
df: pandas.DataFrame

Data with bootstrap experiment data

Returns:
pandas.DataFrame:

DataFrame with columns expressing mean, standard deviation, and covariance of input columns

Usage:

get_mean_std_cov_ofDataFrame(df)

getGenesOfPeak(se, peak=None, heightCutoff=0.5, maxDistance=None)[source]

Find peak region of greatest value

Parameters:
se: Series

Normalized aggregated data

peak: ndarray, Default None

Indices of max values in data

heighCutoff: float, Default 0.5

Height/value considered to be in peak

maxDistance: int, Default None

Maximum distance away considered to be in peak

Returns:

Genes appearing the peak

Usage:

getGenesOfPeak(se)

getPeaks(se, threshold=0.2, distance=50, prominence=0.05, returnAllInfo=False)[source]

Find peak regions

Parameters:
se: Series

Normalized aggregated data

threshold: float, Default 0.2

Minimum value to be considered peak

distance: int, Default 50

Minimum horizontal distance between peaks

Returns:

Indices of peaks satisfying input conditions

Usage:

getPeaks(se)

getDistanceOfBatch(args)[source]

Calculate correlation distance of metric for given batch

Parameters:
args: tuple
Tuple that contains:
batch: str

Batch identifier

df_sample: pandas.DataFrame

Expression data

metric: str

Metric name (e.g. ‘correlation’)

genes: list or 1d numpy.array

Genes of interest

minSize: int

Minimum size of input pandas.DataFrame

cutoff: float

Cutoff for percent expression of input data

Returns:
tuple:
Results in form of a tuple:
pandas.Series

Series containting correlation distance

str

Batch identifier

pandas.Series

Series of genes

Usage:

getDistanceOfBatch

reindexSeries(args)[source]

Assists in reindexing Series

Parameters:
args: tuple
Tuple that contains:
se: pandas.Series

Series to perform reindexing

batch: str

Batch identifier

index: list or 1d numpy.array

List of genes

Returns:
tuple:
Results in form of a tuple:
pandas.Series

Reindexed pandas.Series

str

Batch identifier

Usage:

reindexSeries

get_df_distance(df, metric='correlation', genes=[], analyzeBy='batch', minSize=10, groupBatches=True, pname=None, cutoff=0.05, nCPUs=4)[source]

Calculate distance measurement

Parameters:
df: pandas.DataFrame

Expression data

metric: str, Default ‘correlation’

Metric name (e.g. ‘correlation’)

genes: list, Default []

Genes for analysis

analyzeBy: str, Default ‘batch’

Level to analyze data by (e.g. batches)

minSize: int, Default 10

Minimum size of input pandas.DataFrame

groupBatches: boolean, Default True

Whether to group batched or save per-batch distance measure

pname: Default None

Deprecated

cutoff: float, Default 0.05

Cutoff for percent expression of input data

nCPUs: int, Default 4

Number of CPUs to use for multiprocessing

Returns:
pandas.DataFrame

Distance measure

Usage:

get_df_distance(df)

reduce(vIn, size=100)[source]

Interpolate data to reduce the number of data points

Parameters:
vIn: 1d vector

Data to resample

size: int, Default 100

New data size

Returns:

Resampled data

Usage:

reduce(vIn)

metric_euclidean_missing(u, v)[source]

Metric of euclidean distance between two arrays, excluding missing points

Parameters:
u: 1d vector

Data array

v: 1d vector

Data array

Returns:
ndarray

Non-negative squareroot of the array, element-wise

Usage:

metric_euclidean_missing(u, v)

binomialEnrichmentProbability(nx_obj, enriched_genes, target_genes=False, background_genes=False, PCNpath='data/')[source]

Takes in a network x object and list of enriched genes and calculates the binomial enrichment based on the number of enriched interaction there are. Uses Survival function (also defined as 1 - cdf): scipy.stats.binom.sf(k, n, p, loc=0)

Parameters:
nx_obj: networkx.Graph or str

A networkx undirected graph or edge list file name.

enriched_genes: list

List of enriched genes.

target_genes: list or boolean, Default False

List of target_genes. Default use all genes in background.

background_genes: list or boolean, Default False

List of genes to use as background probability. Default use all genes in the network.

Returns:

pandas.DataFrame

getROC(data)[source]

Calculate axes of ROC (false positive rates and true positive rates) using index as thresholds

Parameters:
data: 1d vector

Input data

Returns:
np.array

Array holding false positive rates

np.array

Array holding true positive rates

Usage:

getROC(data)

clusterData(data, n_clusters=None, method='Spectral', random_state=None)[source]

Return cluster labels

Parameters:
data: np.array

Array of shape (features, objects)

n_clusters: int, Default None

Number of desired clusters

method: str or int , Default ‘Spectral’

Method used cluster the data Options: 1 or ‘Agglomerative’, 2 or ‘Spectral’, 3 or ‘KMeans’

random_state: int, Default None

Used to determine randomness deterministic Methods 2 and 3 initial state is random, unless random_state is specified

Returns:
list

Cluster assignment of each object

Usage:

clusterData(data)

testNumbersOfClusters(data, text='', n_min=2, n_max=20, k=10)[source]

Test different number of clusters with KMeans, calculate ARI for each

Parameters:
data: np.array

Array of shape (features, objects)

text: str, Default ‘’

String identifier

n_min: int, Default 2

Minimum number of cluster

n_max: int, Default 20

Maximum number of cluster

k: int, Default 10

Number of iterations for each clusters number

Usage:

testNumbersofClusters(data)

KeyInStore(key, file)[source]

Check if a key present in HDF store file

Parameters:
key: str

The key to check

file: str

Path to HDF store file

Returns
True or False

Whether key is in store or not

Usage:

KeyInStore(key, file)

KeysOfStore(file)[source]

Get list of keys in HDF store file

Parameters:
file: str

Path to HDF store file

Returns
list

Keys

Usage:

KeysOfStore(file)

downloadFile(url, saveDir, saveName=None)[source]

Download a file from internet

Parameters:
url: str

URL to download from

saveDir: str

Path to save downloaded file to

saveName: str, Default None

New name for downloaded file

Returns

None

Usage:

downloadFile(url, saveDir)

readRDataFile(fullPath, takeGeneSymbolOnly=True, saveToHDF=True, returnSizeOnly=False)[source]

Read R data file

Parameters:
fullPath: str

Path to file

takeGeneSymbolOnly: boolean, Default True

Wether to save gene symbol only

saveToHDF: boolean, Default True

Wether to save data in HDF format

returnSizeOnly: boolean, Default False

Get data size and return, without reading the data itself

Returns

None

Usage:

readRDataFile(path)

normSum1(data)[source]
silhouette(data, n_clusters, cluster_labels, saveDir, saveName)[source]
getPanglaoDBAnnotationsSummaryDf(dirName, saveToFile=True, printDf=False)[source]
makeBarplot(labels, saveDir, saveName)[source]
autoDetect1dGroups(se, halfWindowSize=50, gaussianWidth=15, **kwargs)[source]
adjustTexts1D(texts, fig, ax, w='auto', direction='auto', maxIterations=1000, tolerance=0.02)[source]