Chapter 4 Run g:profiler from R
4.1 Initialize variables and libraries
Detailed instructions on how to run g:Profiler programmatically from R
The parameters are set manually here but if you want to run the script from the command line then you can update the notebook to pull the parameters from the command line given arguments by updating each variable below to pull the values from the paramters - for example:
- variable <- params$parameter_name
For more details see - defining and using parameters and Knitting with parameters
#all parameters are pulled from the defined parameters at the top of the notebook
#where to put all the generated files
# for example - "./generated_data/g_profiler"
working_dir <- params$working_dir
# where to find the data files needed to run the analysis
# for example = "./data"
data_dir <- params$data_dir
# File name containing the list of genes to be used for analysis
# fro example - "Supplementary_Table1_Cancer_drivers.txt"
genelist_file <- params$genelist_file
# default max size of the genesets for example - 250. For this example we
# will be varying this parameter
max_gs_size <- params$max_gs_size
# default min size of the genesets for example - 3
min_gs_size <- params$min_gs_size
#min intersection between your genelist and the geneset - for example 3
min_intersection <- params$min_intersection
# organism parameter used for g:profiler.
# First letter of first word in species name followed by
# the second word for example - hsapiens
organism <- params$organism
#set the gmt file you want to use if you don't want to use the latest gmt file.
# For example, if you set dest_gmt_file =="" the below script will automatically
# download the latest gmt file from baderlab webstie. If it is set then it
# will use the file specified.
dest_gmt_file = ""
#use library
tryCatch(expr = { library("gprofiler2")},
error = function(e) {
install.packages("gprofiler2")},
finally = library("gprofiler2"))
tryCatch(expr = { library("GSA")},
error = function(e) {
install.packages("GSA")},
finally = library("GSA"))
Create or set a directory to store all the generatd results
4.2 Load in Query set
Load in the set of genes that we will be running g:profiler with
#load in the file
current_genelist <- read.table(file =
file.path(data_dir, genelist_file),
header = FALSE,
sep = "\t", quote = "",
stringsAsFactors = FALSE)
query_set <- current_genelist$V1
With regards to pathway sets there are two options when using g:Profiler -
- Use the genesets that are supplied by g:Profiler
- Upload your own genesets.
The most common reasons for supplying your own genesets is the ability to use up to date annotations or in-house annotations that might not be available in the public sphere yet. One of the greatest features of g:Profiler is that it is updated on a regular basis and most of the previous versions are available online ont the gprofiler archive.
The gprofielr2 -g:Profiler R implementation is a wrapper for the web version. You require an internet connection to get enrichment results.
4.3 Run g:profiler with supplied genesets
For detailed descriptions of all the parameters that can be specified for the gost g:profiler function see -here
For this query we are specifying -
- query - the set of genes of interest, as loaded in from the Supplementary_Table1_Cancer_drivers.txt file.
- significant - set to FALSE because we want g:Profiler to return all the results not just the ones that it deems significant by its perdetermined threshold.
- ordered_query - set to TRUE because for this set of genes they are ordered in order of their significance
- correction_method - set to fdr. by default g:Profiler uses g:Scs
- organism - set to “hsapiens” for homo sapiens. Organism names are constructed by concatenating the first letter of the name and the family name (according to gprofiler2 documentation)
- source - the geneset source databases to use for the analysis. We recommend using GO biological process (GO:BP), WikiPathways (WP) and Reactome (Reac) but there are additional sources you can add (GO molecular function or cellular component(GO:MF, GO:CC), KEGG, transcription factors (TF), microRNA targets (MIRNA), corum complexes (CORUM), Human protein atlas (HPA),Human phenotype ontology (HP) )
gprofiler_results <- gost(query = query_set ,
significant=FALSE,
ordered_query = FALSE,
exclude_iea=TRUE,
correction_method = "fdr",
organism = organism,
source = c("REAC","WP","GO:BP"))
#get the gprofiler results table
enrichment_results <- gprofiler_results$result
enrichment_results[1:5,]
## query significant p_value term_size query_size intersection_size
## 1 query_1 TRUE 1.266713e-31 4791 117 99
## 2 query_1 TRUE 6.230929e-31 4911 117 99
## 3 query_1 TRUE 2.945299e-30 5297 117 101
## 4 query_1 TRUE 2.945299e-30 5313 117 101
## 5 query_1 TRUE 3.692367e-28 5757 117 102
## precision recall term_id source
## 1 0.8461538 0.02066374 GO:0051171 GO:BP
## 2 0.8461538 0.02015883 GO:0080090 GO:BP
## 3 0.8632479 0.01906740 GO:0060255 GO:BP
## 4 0.8632479 0.01900998 GO:0031323 GO:BP
## 5 0.8717949 0.01771756 GO:0019222 GO:BP
## term_name effective_domain_size
## 1 regulation of nitrogen compound metabolic process 16178
## 2 regulation of primary metabolic process 16178
## 3 regulation of macromolecule metabolic process 16178
## 4 regulation of cellular metabolic process 16178
## 5 regulation of metabolic process 16178
## source_order parents
## 1 13995 GO:0006807, GO:0019222
## 2 18320 GO:0019222, GO:0044238
## 3 14938 GO:0019222, GO:0043170
## 4 7372 GO:0019222, GO:0044237, GO:0050794
## 5 5789 GO:0008152, GO:0050789
4.4 Download and load g:profiler geneset file
In order to create a proper Generic enrichment results file we will need a copy of the gmt file used by g:Profiler. (also to create an Enrichment map).
Download the gmt file used for this analysis from g:profiler
#the link to the gmt file is static no matter what version
gprofiler_gmt_url <-
"https://biit.cs.ut.ee/gprofiler/static/gprofiler_full_hsapiens.name.gmt"
#get version info gprofiler as the gmt file is always associated with
# a specific version of g:profiler
gprofiler_version <- get_version_info(organism=organism)
gprofiler_gmt_filename <- file.path(working_dir,
paste("gprofiler_full", organism,
gprofiler_version$gprofiler_version,sep="_",
".name.gmt"))
if(!file.exists(gprofiler_gmt_filename)){
download.file(url = gprofiler_gmt_url,
destfile = gprofiler_gmt_filename)
}
To create a proper Generic enrichmentMap results file we need to include the list of genes that are associated with each geneset. To do that we need to know what genes are associated with each set and filter them by our query set. Load in the geneset definitions from the gmt file we just downloaded from g:profiler site.
#load in the g:profiler geneset file
capt_output <- capture.output(genesets_gprofiler <- GSA.read.gmt(
filename = gprofiler_gmt_filename))
names(genesets_gprofiler$genesets) <- genesets_gprofiler$geneset.names
For the next module the name of the gmt file is - gprofiler_full_hsapiens.name.gmt but it is important to preserve the database version so in the future when we revisit these results for publication or results verfication we have the exact version used. Instead of creating a copy of the file (which can be pretty large) create a symbolic link to the file with the generic name.
#file.exists does not work for a symbolic link on my computer for some reason
# list the files in the directory and check if the symbolic link is there
#if(file.exists(file.path(working_dir, "gprofiler_full_hsapiens.name.gmt"))){
if(length(grep(x = list.files(file.path(working_dir)),
pattern = "gprofiler_full_hsapiens.name.gmt",
fixed = TRUE) > 0 )){
file.remove(file.path(working_dir, "gprofiler_full_hsapiens.name.gmt"))
}
## [1] TRUE
## [1] TRUE
# Given:
# query_genes - genes used for enrichment analysis (or as query)
#
# returns - the genes that overlap with the query set and part of the given
# genesets
getGenesetGenes <- function(query_genes, subset_genesets){
genes <- lapply(subset_genesets,FUN=function(x){intersect(x,query_genes)})
# For each of the genes collapse to the comma separate text
genes_collapsed <- unlist(lapply(genes,FUN=function(x){
paste(x,collapse = ",")}))
genes_collapsed_df <- data.frame(term_id = names(genes),
genes = genes_collapsed,stringsAsFactors = FALSE)
return(genes_collapsed_df)
}
4.5 Filter results by geneset size
Filter the table to include just the columns that are required for the generic enrichment map file results GEM. Restrict the results to just the ones that have at least min_gs_size and less than max_gs_size terms and min_intersection size include only the term_id, term_name, p_value (and p_value again because the p_value is actually the corrected p-value. The output file does not contain the nominal p_value. For down stream analysis though it is expected to have both a p-value and a q-value so just duplicate the q-value as both p-value and q-value)
Vary the thresholds for max_gs_size just as we did in Module 2 lab -
min_gs_size = 3
max_gs_size = 10000
max_gs_size = 1000
max_gs_size = 250
# filer by params defined above
# by default we have set the max and min gs size to 250 and 3, respectively.
enrichment_results_mxgssize_250_min_3 <-
subset(enrichment_results,term_size >= min_gs_size &
term_size <= max_gs_size &
intersection_size >= min_intersection ,
select = c(term_id,term_name,p_value,p_value ))
enrichment_results_mxgssize_1000_min_3 <-
subset(enrichment_results,term_size >= min_gs_size &
term_size <= 1000 &
intersection_size >= min_intersection ,
select = c(term_id,term_name,p_value,p_value ))
enrichment_results_mxgssize_10000_min_3 <-
subset(enrichment_results,term_size >= min_gs_size &
term_size <= 10000 &
intersection_size >= min_intersection ,
select = c(term_id,term_name,p_value,p_value ))
4.6 Create an output file of the results - Generic enrichment Map file from g:profiler gmt
The file requires -
- name
- description
- p-value
- q-value
- phenotyp
- list of genes (overlap of query set and original geneset)
The list of genes needs to be calculated using the gmt file and original query set. For each geneset found in the result find the overlap between the set of genes that are a part of the geneset and the query set.
# Given:
# gprofiler_results - results form g_profiler R function (filtered by desired)
# parameters
# gs - genes associated with each geneset, loaded in from a gmt file.
#
# returns - the properly formatted GEM file results
#
createGEMformat <- function(results, gs, query_genes){
if(nrow(results) >0){
#add phenotype to the results
formatted_results <- cbind(results,1)
# Add the genes to the genesets
subset_genesets <- gs$genesets[
which(gs$geneset.names
%in% results$term_id)]
genes <- getGenesetGenes(query_genes, subset_genesets)
formatted_results <- merge(formatted_results,genes,by.x=1, by.y=1)
colnames(formatted_results) <- c("name","description","p-value",
"q-value","phenotype","genes")
}
return(formatted_results)
}
enrichment_results_mxgssize_10000_min_3_GEMfile <- createGEMformat(
enrichment_results_mxgssize_10000_min_3, genesets_gprofiler, query_set)
enrichment_results_mxgssize_1000_min_3_GEMfile <- createGEMformat(
enrichment_results_mxgssize_1000_min_3, genesets_gprofiler, query_set)
enrichment_results_mxgssize_250_min_3_GEMfile <- createGEMformat(
enrichment_results_mxgssize_250_min_3, genesets_gprofiler, query_set)
Output each of the above filtered files
#output the enrichment map file
write.table(enrichment_results_mxgssize_10000_min_3_GEMfile,
file = file.path(working_dir,
"gProfiler_hsapiens_lab2_results_GEM_termmin3_max10000.gem.txt"),
row.names = FALSE,
col.names = TRUE, sep="\t",
quote = FALSE)
#output the enrichment map file
write.table(enrichment_results_mxgssize_1000_min_3_GEMfile,
file = file.path(working_dir,
"gProfiler_hsapiens_lab2_results_GEM_termmin3_max1000.gem.txt"),
row.names = FALSE,
col.names = TRUE, sep="\t",
quote = FALSE)
#output the enrichment map file
write.table(enrichment_results_mxgssize_250_min_3_GEMfile,
file = file.path(working_dir,
"gProfiler_hsapiens_lab2_results_GEM_termmin3_max250.gem.txt"),
row.names = FALSE,
col.names = TRUE, sep="\t",
quote = FALSE)
4.8 Download and load Bader lab geneset file
Download the latest Bader lab genesets
if(dest_gmt_file == ""){
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
#list all the files on the server
filenames = RCurl::getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
#get the gmt that has all the pathways and does not include
# terms inferred from electronic annotations(IEA)
#start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_noPFOCR_no_GO_iea.*.)(.gmt)(?=\">)",
contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(working_dir,gmt_file)
if(!file.exists(dest_gmt_file)){
download.file(
paste(gmt_url,gmt_file,sep=""),
destfile=dest_gmt_file
)
}
}
In order to use our results down stream in the Enrichment map we need to generate results files that we can pass to Enrichment Map.
Load in the GMT file
4.9 Filter Bader lab geneset file
The g:Profiler interface only allows for filtering genesets by size only after the analysis is complete. After the analysis is complete means the filtering is happening after Multiple hypothesis testing. Filtering prior to the analysis will generate more robust results because we exclude the uninformative large genesets prior to testing changing the sets that multiple hypothesis filtering will get rid of.
Create multiple gmt files with different filtering thresholds - remove * genesets greater than 250 genes * geneset greater than 1000 genes * geneset greater than 10000 genes
# Filter geneset GSA object by specified gs size threshold
#
# Given -
# genesets - in GSA object
# gs_sizes - list of all the sizes of the genesets found in the genesets
# filter_threshold - value to filter the geneset by.
#
# returns - filtered genesets in GSA object
filter_genesets <- function(genesets, gs_sizes, filter_threshold) {
filtered_genesets <- genesets
filtered_genesets$genesets <- genesets$genesets[
which(gs_sizes<filter_threshold)]
filtered_genesets$geneset.names <- genesets$geneset.names[
which(gs_sizes<filter_threshold)]
filtered_genesets$geneset.descriptions <- genesets$geneset.descriptions[
which(gs_sizes<filter_threshold)]
return(filtered_genesets)
}
# You can not simply write a list of lists to a file in R. In order
# to output the new geneset file you need to convert it ot a data.frame
# To do this convert the list of genes to a tab delmiated list in one column
# of the dataframe.
# format to write out to a file.
#
# Given -
# genesets - in GSA object
# returns - formatted genesets as data frame
format_genesets <- function(genesets) {
collapsed_genesets <- data.frame(name=genesets$geneset.names,
description= genesets$geneset.description)
collapsed_genesets$genes <- unlist(lapply(genesets$genesets,
FUN=function(x){
paste(x,collapse = "\t")
}))
return(collapsed_genesets)
}
The format of the GMT file is described https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:Gene_Matrix_Transposed_file_format.28.2A.gmt.29 and consists of rows with the following
- Name
- Description
- tab delimited list of genes a part of this geneset
Write out the gmt file with genenames
#get the geneset sizes
gs_sizes_baderlab_sets <- lapply(genesets_baderlab_genesets$genesets,
FUN = function(x){
length(x)
})
# max 10,000
genesets_baderlab_genesets_max10000 <- filter_genesets(genesets_baderlab_genesets,
gs_sizes_baderlab_sets,
10000)
genesets_baderlab_genesets_max10000_filename <- gsub(x =dest_gmt_file,
pattern = "symbol" ,
replacement = "symbol_max10000"
)
if(!file.exists(genesets_baderlab_genesets_max10000_filename)){
write.table(x = format_genesets(genesets_baderlab_genesets_max10000),
file = genesets_baderlab_genesets_max10000_filename,
quote = FALSE,sep = "\t",row.names = FALSE,
col.names=TRUE)
}
#max gs size of 1,000
genesets_baderlab_genesets_max1000 <- filter_genesets(genesets_baderlab_genesets,
gs_sizes_baderlab_sets,
1000)
genesets_baderlab_genesets_max1000_filename <- gsub(x =dest_gmt_file,
pattern = "symbol" ,
replacement = "symbol_max1000"
)
if(!file.exists(genesets_baderlab_genesets_max1000_filename)){
write.table(x = format_genesets(genesets_baderlab_genesets_max1000),
file = genesets_baderlab_genesets_max1000_filename,
quote = FALSE,sep = "\t",row.names = FALSE,
col.names=TRUE)
}
#max gs size of 250
genesets_baderlab_genesets_max250 <- filter_genesets(genesets_baderlab_genesets,
gs_sizes_baderlab_sets,
250)
genesets_baderlab_genesets_max250_filename <- gsub(x =dest_gmt_file,
pattern = "symbol" ,
replacement = "symbol_max250"
)
if(!file.exists(genesets_baderlab_genesets_max250_filename)){
write.table(x = format_genesets(genesets_baderlab_genesets_max250),
file = genesets_baderlab_genesets_max250_filename,
quote = FALSE,sep = "\t",row.names = FALSE,
col.names=TRUE)
}
4.10 Upload the gmt files to gprofiler
In order to use your own genesets with g:Profiler you need to upload the the file to their server first. The function will return an ID that you need to specify in the organism parameter of the g:Profiler gost function call.
## Your custom annotations ID is gp__U4sx_h9hm_EtM.
## You can use this ID as an 'organism' name in all the related enrichment tests against this custom source.
## Just use: gost(my_genes, organism = 'gp__U4sx_h9hm_EtM')
## Your custom annotations ID is gp__kEYN_qEm9_mG4.
## You can use this ID as an 'organism' name in all the related enrichment tests against this custom source.
## Just use: gost(my_genes, organism = 'gp__kEYN_qEm9_mG4')
## Your custom annotations ID is gp__DNWX_dEd1_yYo.
## You can use this ID as an 'organism' name in all the related enrichment tests against this custom source.
## Just use: gost(my_genes, organism = 'gp__DNWX_dEd1_yYo')
For this query we are specifying -
- query - the set of genes of interest, as loaded in from the Supplementary_Table1_Cancer_drivers.txt file.
- significant - set to FALSE because we want g:Profiler to return all the results not just the ones that it deems significant by its perdetermined threshold.
- ordered_query - set to FALSE (but you can try setting it to true as well because for this set of genes they are ordered in order of their significance)
- correction_method - set to fdr. by default g:Profiler uses g:Scs
- organism - set to the custom_gmt ID ( for this run it is - gp__U4sx_h9hm_EtM) that we received when we uploaded our genetset file.
gprofiler_results_custom_max250 <- gost(query = query_set ,
significant=FALSE,
ordered_query = FALSE,
exclude_iea=TRUE,
correction_method = "fdr",
organism = custom_gmt_max250
)
## Detected custom GMT source request
gprofiler_results_custom_max1000 <- gost(query = query_set ,
significant=FALSE,
ordered_query = FALSE,
exclude_iea=TRUE,
correction_method = "fdr",
organism = custom_gmt_max1000
)
## Detected custom GMT source request
gprofiler_results_custom_max10000 <- gost(query = query_set ,
significant=FALSE,
ordered_query = FALSE,
exclude_iea=TRUE,
correction_method = "fdr",
organism = custom_gmt_max10000
)
## Detected custom GMT source request
#get the gprofiler results table
enrichment_results_customgmt_max250 <- gprofiler_results_custom_max250$result
enrichment_results_customgmt_max1000 <- gprofiler_results_custom_max1000$result
enrichment_results_customgmt_max10000 <- gprofiler_results_custom_max10000$result
enrichment_results_customgmt_max250[1:5,]
## query significant p_value term_size query_size intersection_size
## 1 query_1 TRUE 2.274768e-19 69 108 17
## 2 query_1 TRUE 1.758667e-18 64 108 16
## 3 query_1 TRUE 1.639494e-14 54 108 13
## 4 query_1 TRUE 5.029242e-14 98 108 15
## 5 query_1 TRUE 1.653203e-13 160 108 17
## precision recall
## 1 0.1574074 0.2463768
## 2 0.1481481 0.2500000
## 3 0.1203704 0.2407407
## 4 0.1388889 0.1530612
## 5 0.1574074 0.1062500
## term_id
## 1 GLIOBLASTOMA SIGNALING PATHWAYS%WIKIPATHWAYS_20240510%WP2261%HOMO SAPIENS
## 2 HEAD AND NECK SQUAMOUS CELL CARCINOMA%WIKIPATHWAYS_20240510%WP4674%HOMO SAPIENS
## 3 PATHWAYS AFFECTED IN ADENOID CYSTIC CARCINOMA%WIKIPATHWAYS_20240510%WP3651%HOMO SAPIENS
## 4 CELL CYCLE%WIKIPATHWAYS_20240510%WP179%HOMO SAPIENS
## 5 REGULATION OF CELL CYCLE G1/S PHASE TRANSITION%GOBP%GO:1902806
## source
## 1 Human_GOBP_AllPathways_noPFOCR_no_GO_iea_June_01_2024_symbol_max250
## 2 Human_GOBP_AllPathways_noPFOCR_no_GO_iea_June_01_2024_symbol_max250
## 3 Human_GOBP_AllPathways_noPFOCR_no_GO_iea_June_01_2024_symbol_max250
## 4 Human_GOBP_AllPathways_noPFOCR_no_GO_iea_June_01_2024_symbol_max250
## 5 Human_GOBP_AllPathways_noPFOCR_no_GO_iea_June_01_2024_symbol_max250
## term_name effective_domain_size
## 1 Glioblastoma signaling pathways 17067
## 2 Head and neck squamous cell carcinoma 17067
## 3 Pathways affected in adenoid cystic carcinoma 17067
## 4 Cell cycle 17067
## 5 regulation of cell cycle G1/S phase transition 17067
## source_order parents
## 1 5048 NULL
## 2 5575 NULL
## 3 5198 NULL
## 4 5170 NULL
## 5 18856 NULL
Filter the table to include just the columns that are required for the generic enrichment map file results GEM. Restrict the results to just the ones that have at least min_gs_size and less than max_gs_size terms and min_intersection size include only the term_id, term_name, p_value (and p_value again because the p_value is actually the corrected p-value. The output file does not contain the nominal p_value. For down stream analysis though it is expected to have both a p-value and a q-value so just duplicate the q-value as both p-value and q-value)
# filer by params defined above
enrichment_results_customgmt_max250 <- subset(enrichment_results_customgmt_max250,
term_size >= min_gs_size &
term_size <= max_gs_size &
intersection_size >= min_intersection ,
select = c(term_id,term_name,p_value,p_value ))
enrichment_results_customgmt_max1000 <- subset(enrichment_results_customgmt_max1000,
term_size >= min_gs_size &
term_size <= max_gs_size &
intersection_size >= min_intersection ,
select = c(term_id,term_name,p_value,p_value ))
enrichment_results_customgmt_max10000 <- subset(enrichment_results_customgmt_max10000,
term_size >= min_gs_size &
term_size <= max_gs_size &
intersection_size >= min_intersection ,
select = c(term_id,term_name,p_value,p_value ))
4.11 Create an output file of the results - Generic enrichment Map file from Baderlab gmt
Use the same function defined above but instead of passing the genesets from the g_profiler gmt file pass the geneset defitnions we loaded in from the Baderlab gmt file.
enrichment_results_customgmt_GEM_max250 <- createGEMformat(
enrichment_results_customgmt_max250,
genesets_baderlab_genesets_max250,
query_set)
#output the enrichment map file
write.table(enrichment_results_customgmt_GEM_max250,
file = file.path(
working_dir, "gProfiler_hsapiens_Baderlab_max250.gem.txt"),
row.names = FALSE,
col.names = TRUE, sep="\t",
quote = FALSE)
enrichment_results_customgmt_GEM_max1000 <- createGEMformat(
enrichment_results_customgmt_max1000,
genesets_baderlab_genesets_max1000,
query_set)
#output the enrichment map file
write.table(enrichment_results_customgmt_GEM_max1000,
file = file.path(
working_dir, "gProfiler_hsapiens_Baderlab_max1000.gem.txt"),
row.names = FALSE,
col.names = TRUE, sep="\t",
quote = FALSE)
enrichment_results_customgmt_GEM_max10000 <- createGEMformat(
enrichment_results_customgmt_max10000,
genesets_baderlab_genesets_max10000,
query_set)
#output the enrichment map file
write.table(enrichment_results_customgmt_GEM_max10000,
file = file.path(
working_dir, "gProfiler_hsapiens_Baderlab_max10000.gem.txt"),
row.names = FALSE,
col.names = TRUE, sep="\t",
quote = FALSE)