Machine Learning Project in Oncology 6 – Web Scraping for Cancer Omics Data

Web scraping is data scraping used for extracting data from websites.

Why Scrape Cancer Omics data?

Cancer Omics data is scattered around various resources, with most of them not providing a direct access to the data by using an API. This makes impossible to fetch the data programmatically and leaves with an option to scrape the data from the host websites.

Photo by ThisIsEngineering on Pexels.com

Web Scraping Best Practices:

  • Never scrape more frequently than you need to.
  • Consider caching the content you scrape so that it’s only downloaded once.
  • Build pauses into your code using functions like time.sleep() to keep from overwhelming servers with too many requests too quickly.

Methods to Scrape Data:

1. Using Pandas to fetch table
Pandas provides a method read_html() to fetch the tables from html using the page link.

For instance, I tried to fetch tissue source site codes table from TCGA (The Cancer Genome Atlas) database.

import pandas as pd
url='https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codes'
df_list = pd.read_html(url)
Tables_pd=df_list[1]
Tables_pd.to_excel('Tissue_Source_Site_Codes_from_pandas.xls',index='false')

2. Using Beautiful Soup

Python has libraries to help us solve the scrapping problem. Requests and beautiful soup works well together. Request library provides mechanism to access a web page and get the HTML content while beautiful soup lets us navigate the DOM and fetch a specific element/html tag. The required requests and BeautifulSoup are easy to install via pip.

I recommend some great BS tutorials, such as

  1. Tutorial: Web Scraping with Python Using Beautiful Soup
  2. Top 5 Beautiful Soup Functions That Will Make Your Life Easier

Case 1: I tried to fetch tissue source site codes table from TCGA (The Cancer Genome Atlas) database.

from bs4 import BeautifulSoup
import requests
import pandas as pd
url='https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codes'
# One method using BeautifulSoup
request_html = requests.get(url)
soup = BeautifulSoup(request_html.content, 'html.parser')
Tables=soup.findAll("table")[1]
output_rows = []
for table_row in Tables.findAll('tr'):
    columns = table_row.findAll(['td','th'])
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)
Table_df=pd.DataFrame(output_rows)
Table_df.columns = Table_df.iloc[0]
Table_df.to_csv("Tissue_Source_Site_Codes.txt",index=FALSE)

Case 2: I tried to pull down information of P53 in Human, Mouse, and Rat from Uniprot. The output looks as follows:

################################################################################
# urllib is a package that collects several modules for working with URLs:
import urllib
from bs4 import BeautifulSoup
import pandas as pd
import re
###############################################
def get_uniprot (query='',query_type='PDB_ID'):
    #query_type must be: "PDB_ID" or "ACC"
    url = 'https://www.uniprot.org/uploadlists/' #This is the webser to retrieve the Uniprot data
    params = {
    'from':query_type,
    'to':'ACC',
    'format':'txt',
    'query':query
    }

    data = urllib.parse.urlencode(params)
    data = data.encode('ascii')
    request = urllib.request.Request(url, data)
    with urllib.request.urlopen(request) as response:
        res = response.read()
        page=BeautifulSoup(res).get_text()
        page=page.splitlines()
    return page
###############################################
prots=['P53_HUMAN','P53_MOUSE','P53_RAT']
table=pd.DataFrame()
for index,entry in enumerate(prots):
    sizes=[]
    pdbs=[]
    funtions=[]
    process=[]
    organism=[]
    data=get_uniprot(query=entry,query_type='ACC')

    table.loc[index,'Uniprot_entry']=entry

    for line in data:
        if 'OS   ' in line:
            line=line.strip().replace('OS   ','').replace('.','')
            organism.append(line)
            table.loc[index,'Organism']=(", ".join(list(set(organism))))
        if 'ID   ' in line:
            line=re.sub('ID\W+Reviewed;\W+','',line.strip())
            sizes.append(line)
            table.loc[index,'Sizes']=(", ".join(list(set(sizes))))

        if 'DR   PDB;' in line:
            line=line.strip().replace('DR   ','').replace(';','')
            pdbs.append ((line.split()[1]+':'+line.split()[3]))
            table.loc[index,'PDB:Resol']=(", ".join(list(set(pdbs))))

        if 'DR   GO; GO:' in line:
            line=line.strip().replace('DR   GO; GO:','').replace(';','').split(':')
            if 'F' in line[0]:
                funtions.append(line[1])
                table.loc[index,'GO_funtion']=(", ".join(list(set(funtions))))
            else:
                process.append (line[1])
                table.loc[index,'GO_process']=(", ".join(list(set(process))))
################################################################################

3. Web Scrapping Alternative

If there’s no bulk download available, check to see whether the website has an application programming interface (API). An API lets software interact with a website’s data directly, rather than requesting the HTML.

Case 1: Functional annotations of gene list using DAVID API.

Linking Methods

https://david.ncifcrf.gov/api.jsp?type=xxxxx&ids=XXXXX,XXXXX,XXXXXX,&tool=xxxx&annot=xxxxx,xxxxxx,xxxxx,


type  =  one of DAVID recognized gene types
annot  = a list of desired annotation  categories separated by ","
ids  = a list of user's gene IDs separated by ","
tool  = one of DAVID tool names

Case 2: I tried to retrieve protein information of human P53 from EBI using REST api.

import requests
import sys
import json
requestURL = "https://www.ebi.ac.uk/proteins/api/proteins?offset=0&size=100&accession=P04637"
r = requests.get(requestURL, headers={ "Accept" : "application/json"})
if not r.ok:
    r.raise_for_status()
    sys.exit()
json_data=json.loads(r.text)
print(json_data[0])

Leave a Reply

%d bloggers like this: