Skip to contents

Load raw counts and return the filtered and normalized data. Alternatively, the user can provide a numeric data frame of raw RNA-seq counts.

Usage

transcript_normalize_counts(
  tissue,
  min_cpm = 0.5,
  min_num_samples = 2,
  norm_method = "TMM",
  counts = NULL
)

Arguments

tissue

character, tissue abbreviation, one of MotrpacRatTraining6moData::TISSUE_ABBREV

min_cpm

double, retain genes with more than min_cpm counts per million in at least min_num_samples samples

min_num_samples

double, retain genes with more than min_cpm counts per million in at least min_num_samples samples

norm_method

character, one of c("TMM","TMMwsp","RLE","upperquartile","none"). "TMM" by default.

counts

optional user-supplied numeric data frame or matrix where row names are gene IDs and column names are sample identifiers

Value

data frame where row names are feature_ID and column names are viallabel

Details

Note that while this function is identical to the code used to generate the normalized RNA-seq data tables (MotrpacRatTraining6moData::TRNSCRPT_NORM_DATA) and the normalized RNA-seq data available through the MoTrPAC Data Hub, transcript_normalize_counts(tissue) yields slightly fewer genes than its corresponding MotrpacRatTraining6moData::TRNSCRPT_NORM_DATA object. Investigation of this discrepancy suggests minor functional differences in the version of edgeR::cpm() used ~2.5 years apart. Find more details in this GitHub issue.

Examples

norm_data = transcript_normalize_counts("LUNG")

# Simulate "user-supplied data"
counts = load_sample_data("LUNG", "TRNSCRPT", normalized=FALSE)
#> TRNSCRPT_LUNG_RAW_COUNTS
counts = df_to_numeric(counts)
norm_data = transcript_normalize_counts(counts = counts)