Chapter 7 Translate human gmt file to another species using ensembl
for species where there is a not an existing effort to characterize genes through GO, a much richer gmt file will come from mapping the gmt file from a well defined species preferrably the species your experiment is trying to model (for example, if you are modelling a human disease in mouse using a human gmt might return much richer results but species specific changes might be lost )
7.1 Load in required libraries
#install required R and bioconductor packages
tryCatch(expr = { library("RCurl")},
error = function(e) {
install.packages("RCurl")},
finally = library("RCurl"))
#use library
#make sure biocManager is installed
tryCatch(expr = { library("BiocManager")},
error = function(e) {
install.packages("BiocManager")},
finally = library("BiocManager"))
tryCatch(expr = { library("biomaRt")},
error = function(e) {
BiocManager::install("biomaRt")},
finally = library("biomaRt"))
7.2 Download the latest pathway definition file
Only Human, Mouse, Rat, and Woodchuck gene set files are currently available on the baderlab downloads site. If you are working with a species other than human (and it is either rat,mouse or woodchuck) change the gmt_url below to the correct species. Check here to see all available species.
To create your own GMT file using Ensembl see Create GMT file from Ensembl
dest_gmt_file <- "" #params$dest_gmt_file
if(dest_gmt_file == ""){
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
#list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
#get the gmt that has all the pathways and does not include terms
# inferred from electronic annotations(IEA)
#start with gmt file that has pathways only and GO Biological Process only.
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_noPFOCR_no_GO_iea.*.)(.gmt)(?=\">)",
contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(params$working_dir,gmt_file )
#check if this gmt file already exists
if(!file.exists(dest_gmt_file)){
download.file(
paste(gmt_url,gmt_file,sep=""),
destfile=dest_gmt_file
)
}
}
Load in the originaing gmt file
7.3 Set up Biomart connection
Connect to Biomart
ensembl <- useEnsembl(biomart = "genes" , host = "asia.ensembl.org")
#ensembl <- useEnsembl("ensembl")
Figure out which dataset you want to use - for some species there might be a few datasets to choose from. Not all of the datasets have common namesa associated with them. For example, if you search for ‘yeast’ nothing will be returned but if you look for Saccharomyces or cerevisiae you will be able to find it.
## Attempting web service request:
## asia.ensembl.org:80/biomart/martservice?redirect=no&type=datasets&requestid=biomaRt&mart=ENSEMBL_MART_ENSEMBL
#get all the datasets that match our species definition
all_datasets[grep(all_datasets$description,
pattern=paste(params$new_species,sep=""),
ignore.case = TRUE),]
## dataset description version
## 68 fcatus_gene_ensembl Cat genes (Felis_catus_9.0) Felis_catus_9.0
Based on the above table define the dataset
#get all the datasets that match our species definition
base_dataset <- all_datasets$dataset[grep(all_datasets$description,
pattern=paste(params$new_species,sep=""),
ignore.case = TRUE)]
If you know the ensembl dataset that you want to use you can specify it in the parameters above or grab from the above table the dataset of the species that you are interested in.
7.4 Convert the species genes from human
Get the homologs in the species of interest of the human genes.
homologs <- getBM(attributes = c("external_gene_name",
"ensembl_gene_id",
"hsapiens_homolog_ensembl_gene",
"hsapiens_homolog_associated_gene_name"
),
filters=list(biotype='protein_coding'), mart=ensembl);
# get rid of rows that have an empty gene name for new species or for human
homologs <- homologs[which((homologs$external_gene_name != "") & (homologs$hsapiens_homolog_associated_gene_name != "")),]
Convert the human gmt file into the species of interest
names(originating_gmt$genesets) <- originating_gmt$geneset.names
temp_genesets <- lapply(originating_gmt$genesets,FUN=function(x){homologs$external_gene_name[which(homologs$hsapiens_homolog_associated_gene_name %in% unlist(x))]})
translated_genessets <- list(geneset_names=originating_gmt$geneset.names,
genesets = temp_genesets,
geneset.descriptions=originating_gmt$geneset.descriptions)
7.5 Format results into GMT file
Convert lists to a tab delimited string of gene names
translated_genessets$collapsed_genenames <- unlist(lapply(translated_genessets$genesets,
FUN=function(x){
paste(unlist(x),collapse = "\t")
}))
The format of the GMT file is described https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:Gene_Matrix_Transposed_file_format.28.2A.gmt.29 and consists of rows with the following
- Name
- Description
- tab delimited list of genes a part of this geneset
Write out the gmt file with genenames
gmt_file_genenames <- data.frame(name=translated_genessets$geneset_names,
description=translated_genessets$geneset.descriptions,
genes = translated_genessets$collapsed_genenames)
gmt_genenames_filename <- file.path(params$working_dir, paste(params$new_species,"convertedfrom",basename(dest_gmt_file),sep = "_"))
write.table(x = gmt_file_genenames,file = gmt_genenames_filename,
quote = FALSE,sep = "\t",row.names = FALSE,
col.names=TRUE)