Site icon NoteForDataScience

Retrieve RNA-seq and Reads counts from TCGA

The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that sequenced and molecularly characterized over 11000 cases of primary cancer samples.

TCGA provides RNA-seq profiles for these primary cancer samples. I use two R packages for data retrieval, including:

TCGA2STAT enables users to easily download TCGA data directly into a format ready for statistical analysis in the R environment.

TCGAbiolinks aims to : i) facilitate the GDC open-access data retrieval, ii) prepare the data using the appropriate pre-processing strategies, iii) provide the means to carry out different standard analyses and iv) to easily reproduce earlier research results.

Here is the complete code for retrieving RNA-seq and Reads counts from TCGA prostate cancer.

# To install TCGA2STAT from the package archive file obtained from the package’s Github:
# install.packages("TCGA2STAT_1.0.tar.gz", repos = NULL, type = "source")
# if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install("TCGAbiolinks")
library("TCGA2STAT")
options(timeout=10000)
rnaseq.RPKM.PRAD <- getTCGA(disease="PRAD", data.type="RNASeq2", type="RPKM",clinical = TRUE)
rnaseq.count.PRAD <- getTCGA(disease="PRAD", data.type="RNASeq2", type="count",clinical = TRUE)
#methyl.PRAD <- getTCGA(disease="PRAD", data.type="Methylation", type="27K",clinical = TRUE)
library("TCGAbiolinks")
query_PRAD <- GDCquery(project = "TCGA-PRAD",
                  data.category = "Clinical",
                  data.type = "Clinical Supplement",
                  data.format = "BCR Biotab")
GDCdownload(query_PRAD)
clinical.BCRtab.all_PRAD <- GDCprepare(query_PRAD)
names(clinical.BCRtab.all_PRAD)

When you run the codes above, you will find the information on the screen:

The reads counts of prostate cancer samples are stored in rnaseq.count.PRAD$dat while the clinical data are stored in rnaseq.count.PRAD$clinical.

Exit mobile version