The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that sequenced and molecularly characterized over 11000 cases of primary cancer samples.
TCGA provides RNA-seq profiles for these primary cancer samples. I use two R packages for data retrieval, including:
TCGA2STAT enables users to easily download TCGA data directly into a format ready for statistical analysis in the R environment.
TCGAbiolinks aims to : i) facilitate the GDC open-access data retrieval, ii) prepare the data using the appropriate pre-processing strategies, iii) provide the means to carry out different standard analyses and iv) to easily reproduce earlier research results.
Here is the complete code for retrieving RNA-seq and Reads counts from TCGA prostate cancer.
# To install TCGA2STAT from the package archive file obtained from the package’s Github: # install.packages("TCGA2STAT_1.0.tar.gz", repos = NULL, type = "source") # if (!require("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("TCGAbiolinks") library("TCGA2STAT") options(timeout=10000) rnaseq.RPKM.PRAD <- getTCGA(disease="PRAD", data.type="RNASeq2", type="RPKM",clinical = TRUE) rnaseq.count.PRAD <- getTCGA(disease="PRAD", data.type="RNASeq2", type="count",clinical = TRUE) #methyl.PRAD <- getTCGA(disease="PRAD", data.type="Methylation", type="27K",clinical = TRUE) library("TCGAbiolinks") query_PRAD <- GDCquery(project = "TCGA-PRAD", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR Biotab") GDCdownload(query_PRAD) clinical.BCRtab.all_PRAD <- GDCprepare(query_PRAD) names(clinical.BCRtab.all_PRAD)
When you run the codes above, you will find the information on the screen:
The reads counts of prostate cancer samples are stored in rnaseq.count.PRAD$dat while the clinical data are stored in rnaseq.count.PRAD$clinical.