Other functions¶
A set of generic tools grouped here are used by the class Analysis.
Submodule commonFunctions
Common functions used in analysis pipeline
Functions:
|
Check if a key present in HDF store file |
|
Get list of keys in HDF store file |
|
|
|
|
|
Takes in a network x object and list of enriched genes and calculates the binomial enrichment based on the number of enriched interaction there are. |
|
Return cluster labels |
|
Download a file from internet |
|
Calculate correlation distance of metric for given batch |
|
Find peak region of greatest value |
|
|
|
Find peak regions |
|
Calculate axes of ROC (false positive rates and true positive rates) using index as thresholds |
|
Calculate distance measurement |
Calculate mean, standard deviation, and covariance |
|
|
|
|
Metric of euclidean distance between two arrays, excluding missing points |
|
Function to smooth a 1d signal |
|
|
|
Read R data file |
|
Interpolate data to reduce the number of data points |
|
Assists in reindexing Series |
|
|
|
Test different number of clusters with KMeans, calculate ARI for each |
-
cleanListString(c)¶
-
movingAverageCentered(a, halfWindowSize, looped=False)[source]¶ Function to smooth a 1d signal
- Parameters:
- a: ndarray
Input data
- halfWindowSize: int
Size of half-window for averaging
- looped: boolean, Default False
Determined looped behaviour at the boundaries
- Returns:
- ndarray
Smoothed signal
- Usage:
movingAverageCentered(a, halfWindowSize)
-
get_mean_std_cov_ofDataframe(df)[source]¶ Calculate mean, standard deviation, and covariance
- Parameters:
- df: pandas.DataFrame
Data with bootstrap experiment data
- Returns:
- pandas.DataFrame:
DataFrame with columns expressing mean, standard deviation, and covariance of input columns
- Usage:
get_mean_std_cov_ofDataFrame(df)
-
getGenesOfPeak(se, peak=None, heightCutoff=0.5, maxDistance=None)[source]¶ Find peak region of greatest value
- Parameters:
- se: Series
Normalized aggregated data
- peak: ndarray, Default None
Indices of max values in data
- heighCutoff: float, Default 0.5
Height/value considered to be in peak
- maxDistance: int, Default None
Maximum distance away considered to be in peak
- Returns:
Genes appearing the peak
- Usage:
getGenesOfPeak(se)
-
getPeaks(se, threshold=0.2, distance=50, prominence=0.05, returnAllInfo=False)[source]¶ Find peak regions
- Parameters:
- se: Series
Normalized aggregated data
- threshold: float, Default 0.2
Minimum value to be considered peak
- distance: int, Default 50
Minimum horizontal distance between peaks
- Returns:
Indices of peaks satisfying input conditions
- Usage:
getPeaks(se)
-
getDistanceOfBatch(args)[source]¶ Calculate correlation distance of metric for given batch
- Parameters:
- args: tuple
- Tuple that contains:
- batch: str
Batch identifier
- df_sample: pandas.DataFrame
Expression data
- metric: str
Metric name (e.g. ‘correlation’)
- genes: list or 1d numpy.array
Genes of interest
- minSize: int
Minimum size of input pandas.DataFrame
- cutoff: float
Cutoff for percent expression of input data
- Returns:
- tuple:
- Results in form of a tuple:
- pandas.Series
Series containting correlation distance
- str
Batch identifier
- pandas.Series
Series of genes
- Usage:
getDistanceOfBatch
-
reindexSeries(args)[source]¶ Assists in reindexing Series
- Parameters:
- args: tuple
- Tuple that contains:
- se: pandas.Series
Series to perform reindexing
- batch: str
Batch identifier
- index: list or 1d numpy.array
List of genes
- Returns:
- tuple:
- Results in form of a tuple:
- pandas.Series
Reindexed pandas.Series
- str
Batch identifier
- Usage:
reindexSeries
-
get_df_distance(df, metric='correlation', genes=[], analyzeBy='batch', minSize=10, groupBatches=True, pname=None, cutoff=0.05, nCPUs=4)[source]¶ Calculate distance measurement
- Parameters:
- df: pandas.DataFrame
Expression data
- metric: str, Default ‘correlation’
Metric name (e.g. ‘correlation’)
- genes: list, Default []
Genes for analysis
- analyzeBy: str, Default ‘batch’
Level to analyze data by (e.g. batches)
- minSize: int, Default 10
Minimum size of input pandas.DataFrame
- groupBatches: boolean, Default True
Whether to group batched or save per-batch distance measure
- pname: Default None
Deprecated
- cutoff: float, Default 0.05
Cutoff for percent expression of input data
- nCPUs: int, Default 4
Number of CPUs to use for multiprocessing
- Returns:
- pandas.DataFrame
Distance measure
- Usage:
get_df_distance(df)
-
reduce(vIn, size=100)[source]¶ Interpolate data to reduce the number of data points
- Parameters:
- vIn: 1d vector
Data to resample
- size: int, Default 100
New data size
- Returns:
Resampled data
- Usage:
reduce(vIn)
-
metric_euclidean_missing(u, v)[source]¶ Metric of euclidean distance between two arrays, excluding missing points
- Parameters:
- u: 1d vector
Data array
- v: 1d vector
Data array
- Returns:
- ndarray
Non-negative squareroot of the array, element-wise
- Usage:
metric_euclidean_missing(u, v)
-
binomialEnrichmentProbability(nx_obj, enriched_genes, target_genes=False, background_genes=False, PCNpath='data/')[source]¶ Takes in a network x object and list of enriched genes and calculates the binomial enrichment based on the number of enriched interaction there are. Uses Survival function (also defined as 1 - cdf): scipy.stats.binom.sf(k, n, p, loc=0)
- Parameters:
- nx_obj: networkx.Graph or str
A networkx undirected graph or edge list file name.
- enriched_genes: list
List of enriched genes.
- target_genes: list or boolean, Default False
List of target_genes. Default use all genes in background.
- background_genes: list or boolean, Default False
List of genes to use as background probability. Default use all genes in the network.
- Returns:
pandas.DataFrame
-
getROC(data)[source]¶ Calculate axes of ROC (false positive rates and true positive rates) using index as thresholds
- Parameters:
- data: 1d vector
Input data
- Returns:
- np.array
Array holding false positive rates
- np.array
Array holding true positive rates
- Usage:
getROC(data)
-
clusterData(data, n_clusters=None, method='Spectral', random_state=None)[source]¶ Return cluster labels
- Parameters:
- data: np.array
Array of shape (features, objects)
- n_clusters: int, Default None
Number of desired clusters
- method: str or int , Default ‘Spectral’
Method used cluster the data Options: 1 or ‘Agglomerative’, 2 or ‘Spectral’, 3 or ‘KMeans’
- random_state: int, Default None
Used to determine randomness deterministic Methods 2 and 3 initial state is random, unless random_state is specified
- Returns:
- list
Cluster assignment of each object
- Usage:
clusterData(data)
-
testNumbersOfClusters(data, text='', n_min=2, n_max=20, k=10)[source]¶ Test different number of clusters with KMeans, calculate ARI for each
- Parameters:
- data: np.array
Array of shape (features, objects)
- text: str, Default ‘’
String identifier
- n_min: int, Default 2
Minimum number of cluster
- n_max: int, Default 20
Maximum number of cluster
- k: int, Default 10
Number of iterations for each clusters number
- Usage:
testNumbersofClusters(data)
-
KeyInStore(key, file)[source]¶ Check if a key present in HDF store file
- Parameters:
- key: str
The key to check
- file: str
Path to HDF store file
- Returns
- True or False
Whether key is in store or not
- Usage:
KeyInStore(key, file)
-
KeysOfStore(file)[source]¶ Get list of keys in HDF store file
- Parameters:
- file: str
Path to HDF store file
- Returns
- list
Keys
- Usage:
KeysOfStore(file)
-
downloadFile(url, saveDir, saveName=None)[source]¶ Download a file from internet
- Parameters:
- url: str
URL to download from
- saveDir: str
Path to save downloaded file to
- saveName: str, Default None
New name for downloaded file
- Returns
None
- Usage:
downloadFile(url, saveDir)
-
readRDataFile(fullPath, takeGeneSymbolOnly=True, saveToHDF=True, returnSizeOnly=False)[source]¶ Read R data file
- Parameters:
- fullPath: str
Path to file
- takeGeneSymbolOnly: boolean, Default True
Wether to save gene symbol only
- saveToHDF: boolean, Default True
Wether to save data in HDF format
- returnSizeOnly: boolean, Default False
Get data size and return, without reading the data itself
- Returns
None
- Usage:
readRDataFile(path)