BCB420 - Computational Systems Biology

class: center, middle, inverse, title-slide

# BCB420 - Computational Systems Biology
## Lecture 6 - Annotation Dataset and Intro to Pathway analysis
### Ruth Isserlin
### 2020-02-23

---

## Before we start

Please fill this out:

[<font size=8>Mid course Feedback : https://forms.gle/maGA529V7pxvgBXC6</font>](https://forms.gle/maGA529V7pxvgBXC6)

---

## Features of Markdown

Three **must** features to include in your next assignment:
  * Table of contents
  * Code chunk specifications.
  * References using bibtex

For more information on Rmarkdown see the following useful resources:  
* See [Rmarkdown cheatsheet](https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) for the basics  
* For more in depth on all the **great** features of markdown chekc out [this page](https://yihui.org/knitr/)

---
## Table of contents

.pull-left[
<img src=./images/img_lecture6/rmarkdown_toc.png>
]
.pull-right[
* TOC will be generated by taking all entries specified as a heading using "\#"
* Depending on what depth you specify this will only include the number of "\#" specified
]

---

## Code chunk specifications

.pull-left[
<img src=./images/img_lecture6/rmarkdown_code_chunks.png>
]
.pull-right[
* message = FALSE - to get rid of all those info messages that make the report un-readable.
* warning = FALSE - same as messages but it is important that you pay attention to these when you are iterating over your report.  Some warning may indicate that a function is not being used correctly.  You don't want these showing up in the final report but you do want to address all of them.
* echo and eval are both useful for reports where you don't want to show the code you are using.  If you are giving this report to a biologist or experimentalist they don't want to see the code that generated the data or figures.  
]

---

## References using bibtext

* Create a set of references that you use for your Rmarkdown notebook, extension .bib
  * To get the bibtex go to google scholar and search for your article. Click on the quotation marks associated with your article and click on Bibtex.  Copy it to .bib file. 
  * Example bibtex to put in your bib file:

```r
@article{cytoscape2003,
  title={Cytoscape: a software environment for integrated models of biomolecular interaction networks},
  author={Shannon, Paul and Markiel, Andrew and Ozier, Owen and Baliga, Nitin S and Wang, Jonathan T and Ramage, Daniel and Amin, Nada and Schwikowski, Benno and Ideker, Trey},
  journal={Genome research},
  volume={13},
  number={11},
  pages={2498--2504},
  year={2003},
  publisher={Cold Spring Harbor Lab}
}
```
  
---

## References using bibtex - step 2

Add your bibliography file to your Rmarkdown file

```r
---
title: "BCB420 - Computational Systems Biology"
subtitle: "Lecture 6 - Annotation Dataset and Intro to Pathway analysis"
author: "Ruth Isserlin"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_depth: 2
*bibliography: lecture6.bib
---
```

---

## References using bibtex - step3

If I want to add the following sentence to add the reference use the @ sign followed by the  name you called the reference.

```r
  * [Cytoscape](www.cytoscape.org)[@cytoscape_2003] is a freely available open-source, cross platform network analysis software. 
```

--
In your markdown file the above will get rendered to look like:

* [Cytoscape](www.cytoscape.org)(Shannon et al. 2003) is a freely available open-source, cross platform network analysis software. 
  
And the following will appear at the end of your document:

```r
Shannon, Paul, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13 (11). Cold Spring Harbor Lab: 2498–2504.
```

---

## References using bibtex - controlling the format

* In order to change the style of your references you can specify a citation style.
* For a set of predefined styles see [here](https://github.com/citation-style-language/styles)

---

## Annotation Sources

[<font size=8>Fill in this short survey on your annotation source: https://forms.gle/shyJiddEKMk9qnU76</font>](https://forms.gle/shyJiddEKMk9qnU76)

---

## Annotation Sources

.pull-left[
<img src="images/img_lecture6/annotation_resources.png">
]
.pull-right[
* There are a lot of sources that try and collate a lot of the smaller databases
  * [Pathway commons](https://www.pathwaycommons.org/)
  * [MsigDb](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp)
  * [iRefIndex](https://irefindex.vib.be/wiki/index.php/iRefIndex)
  * [GeneCards](https://www.genecards.org/)
* There are hundreds of smaller databases!
]

---

## Information that is important to know about our annotation source

.pull-left[
* Where did it come from?
  * is it from a primary publication?
  * Is it curated by a consortium?
  * Is it computationally defined?
]
.pull-right[
* How is it released and how often is it released?
  * Is it a custom format? gmt format?
  * available for download from a website or ftp site?
  * queriable and accessibly through a REST interface?
* Is it versioned?
]

???

REST - REpresentational State Transfer

---

## GMT file format

* File format defined by the GSEA tool when it was originally released in 2003
* Each row represents a pathway or gene set
* Format is: pathway name <tab> pathway description <tab> tab delimited list of genes associated with this pathway.
* A lot of enrichment analysis tools can take this format in as input.

---

## Where are we?

So, what are **we** going to be covering during this course?<br>
![](./images/course_overview_flowchart.png)

---

## Recap from previous weeks

<font size=1>Janzen DM, Tiourin E, Salehi JA, Paik DY, Lu J, Pellegrini M, Memarzadeh S. [An
apoptosis-enhancing drug overcomes platinum resistance in a tumour-initiating
subpopulation of ovarian cancer](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4532886/). Nat Commun. 2015 Aug 3;6:7956. - Figure 1</font>

---

First things first, 
  * Load the data, filter it

```r
library(GEOquery)
sfiles = getGEOSuppFiles('GSE70072')

fnames = rownames(sfiles)

# there is only one supplemental file
ca125_exp = read.delim(fnames[1],header=TRUE,
                check.names = FALSE)

cpms = edgeR::cpm(ca125_exp[,3:22])
rownames(cpms) <- ca125_exp[,1]

# get rid of low counts
*keep = rowSums(cpms >1) >=3
ca125_exp_filtered = ca125_exp[keep,]

filtered_data_matrix <- as.matrix(ca125_exp_filtered[,3:22])
rownames(filtered_data_matrix) <- ca125_exp_filtered$ensembl75_id
```

---

* Define the groups

```r
#get the 2 and third token from the column names
samples <- data.frame(
        lapply(colnames(ca125_exp_filtered)[3:22], 
        FUN=function(x){
          unlist(strsplit(x, split = "\\."))[c(2,3)]}))
colnames(samples) <- colnames(ca125_exp_filtered)[3:22]
rownames(samples) <- c("patients","cell_type")
samples <- data.frame(t(samples))
```

```r
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
```

* create the model

```r
model_design_pat <- model.matrix(
  ~ samples$patients + samples$cell_type)
```

---

* estimate the dispersion, 
* calculate normalization factors, 
* fit the model
* calculate differentail expression

```r
#estimate dispersion
d <- estimateDisp(d, model_design_pat)
#calculate normalization factors
d <- calcNormFactors(d)
#fit model
fit <- glmQLFit(d, model_design_pat)
#calculate differential expression
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$cell_typeCA125+')
```

---
Get all the results

```r
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
                           n = nrow(filtered_data_matrix))
```

How many gene pass the threshold p-value < 0.05?

```r
length(which(qlf_output_hits$table$PValue < 0.05))
```

```
## [1] 1338
```

How many genes pass correction?
--

```r
length(which(qlf_output_hits$table$FDR < 0.05))
```

```
## [1] 0
```

---

Output our top hits

```r
kable(topTags(qlf.pos_vs_neg), type="html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> logFC </th>
   <th style="text-align:right;"> logCPM </th>
   <th style="text-align:right;"> F </th>
   <th style="text-align:right;"> PValue </th>
   <th style="text-align:right;"> FDR </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> ENSG00000182752 </td>
   <td style="text-align:right;"> 2.2978937 </td>
   <td style="text-align:right;"> 2.7763253 </td>
   <td style="text-align:right;"> 35.38410 </td>
   <td style="text-align:right;"> 0.0000416 </td>
   <td style="text-align:right;"> 0.5020363 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000240864 </td>
   <td style="text-align:right;"> 5.0129489 </td>
   <td style="text-align:right;"> -0.7698429 </td>
   <td style="text-align:right;"> 55.78796 </td>
   <td style="text-align:right;"> 0.0000523 </td>
   <td style="text-align:right;"> 0.5020363 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000102755 </td>
   <td style="text-align:right;"> 1.3912815 </td>
   <td style="text-align:right;"> 2.5717883 </td>
   <td style="text-align:right;"> 30.52553 </td>
   <td style="text-align:right;"> 0.0000857 </td>
   <td style="text-align:right;"> 0.5483107 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000211625 </td>
   <td style="text-align:right;"> 2.8877608 </td>
   <td style="text-align:right;"> 1.8364482 </td>
   <td style="text-align:right;"> 34.28879 </td>
   <td style="text-align:right;"> 0.0001973 </td>
   <td style="text-align:right;"> 0.6697894 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000188641 </td>
   <td style="text-align:right;"> 1.1109197 </td>
   <td style="text-align:right;"> 3.9514876 </td>
   <td style="text-align:right;"> 24.50111 </td>
   <td style="text-align:right;"> 0.0002383 </td>
   <td style="text-align:right;"> 0.6697894 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000026508 </td>
   <td style="text-align:right;"> 1.1113542 </td>
   <td style="text-align:right;"> 8.4785833 </td>
   <td style="text-align:right;"> 23.75535 </td>
   <td style="text-align:right;"> 0.0002737 </td>
   <td style="text-align:right;"> 0.6697894 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000085552 </td>
   <td style="text-align:right;"> -0.9281010 </td>
   <td style="text-align:right;"> 2.4534215 </td>
   <td style="text-align:right;"> 22.06649 </td>
   <td style="text-align:right;"> 0.0003784 </td>
   <td style="text-align:right;"> 0.6697894 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000126878 </td>
   <td style="text-align:right;"> -0.6443631 </td>
   <td style="text-align:right;"> 6.6234342 </td>
   <td style="text-align:right;"> 22.00129 </td>
   <td style="text-align:right;"> 0.0003833 </td>
   <td style="text-align:right;"> 0.6697894 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000168350 </td>
   <td style="text-align:right;"> -1.4574338 </td>
   <td style="text-align:right;"> 3.1352032 </td>
   <td style="text-align:right;"> 21.85648 </td>
   <td style="text-align:right;"> 0.0003945 </td>
   <td style="text-align:right;"> 0.6697894 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> ENSG00000150938 </td>
   <td style="text-align:right;"> 0.6396190 </td>
   <td style="text-align:right;"> 5.8837098 </td>
   <td style="text-align:right;"> 21.74652 </td>
   <td style="text-align:right;"> 0.0004032 </td>
   <td style="text-align:right;"> 0.6697894 </td>
  </tr>
</tbody>
</table>

</td>
   <td>

</td>
   <td>

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> x </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> samples$cell_typeCA125+ </td>
  </tr>
</tbody>
</table>

</td>
   <td>

</td>
  </tr>
</tbody>
</table>

---

How many genes are up regulated?

```r
length(which(qlf_output_hits$table$PValue < 0.05 
             & qlf_output_hits$table$logFC > 0))
```

```
## [1] 790
```

How many genes are down regulated?

```r
length(which(qlf_output_hits$table$PValue < 0.05 
             & qlf_output_hits$table$logFC < 0))
```

```
## [1] 548
```

---

Create **thresholded** lists of genes.

```r
#merge gene names with the top hits
qlf_output_hits_withgn <- merge(ca125_exp[,1:2],qlf_output_hits, by.x=1, by.y = 0)

qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]

upregulated_genes <- qlf_output_hits_withgn$gname[
  which(qlf_output_hits_withgn$PValue < 0.05 
             & qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$gname[
  which(qlf_output_hits_withgn$PValue < 0.05 
             & qlf_output_hits_withgn$logFC < 0)]

write.table(x=upregulated_genes,
            file=file.path("data","ca125_upregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

write.table(x=downregulated_genes,
            file=file.path("data","ca125_downregulated_genes.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)

write.table(x=data.frame(genename= qlf_output_hits_withgn$gname,F_stat= qlf_output_hits_withgn$rank),
            file=file.path("data","ca125_ranked_genelist.txt"),sep = "\t",
            row.names = FALSE,col.names = FALSE,quote = FALSE)
```

---

## Thresholded list?

* We have gone from a large datasets of 1000s of measurements to a list of differentially expressed genes.  Why do we look at a subset of this list?
* Aren't thresholds arbitrary?  Why define a threshold and limit our list to just that set?

Some reasons for starting with the thresholded list:
  * Where else can we start?
  * Get the low hanging fruit.  Strongest signal might be the most important
  * The strongest signals might be the best place to start. 
  * Easy.
  
---

## Is there an alternative to a Thresholded list?

### Thresholded list - genes with p-value < threshold
  * Asks the question - Are there any gene sets or pathways that are **enriched/over-represented** or **depleted/under-represented** in my list
  * Uses stastics like fisher exact test, hypergeometric test to assess this.
  * Tools like: [DAVID](https://david.ncifcrf.gov/),[g:profiler](https://biit.cs.ut.ee/gprofiler/),[ErichR](https://amp.pharm.mssm.edu/Enrichr/),[GREAT](http://great.stanford.edu/public/html/)

### non-thresholded list - ranked gene list
  * Asks the question - Are there any gene sets or pathways that are ranked surprisingly high or low in my ranked list.
  * Uses stastics like KS test, linear models
  * Tools and packages like:[GSEA - Gene set enrichment anlaysis](https://www.gsea-msigdb.org/gsea/index.jsp),[GSVA - Gene Set Variation Analysis](https://bioconductor.org/packages/release/bioc/html/GSVA.html)

---

## Pros and Cons of Thresholded list

.pull-left[
**Cons**
* No set value good for threshold - pvalue<0.05 is accepted but ultimately it is still arbitrary
* Weak signals are ignored (many weak signals can contribute to the whole)
* different results at different thresholds
]
.pull-right[
**Pros**
* No learning curve! Plug and play.
* Very accessible.
* Quick and fast results
* Endless of online tools available
* No requirement for a bioinformatician or computational biologist.
]

---
## Gene List Enrichment Analysis
  * sometimes called over-representation analysis

### What do we start with:
  * Gene list: eg: FUCA2, GCLC, NFYA, NIPAL3, SEMA3F, KRIT1
  * Gene sets or pathways: come from any one of the annotation databases that we mentioned at the beginning of the day today.

### What are we looking for:
  * Are any of the genelists surprisingly or statistically over-represented in our set of genes.
  
### What do we need in order to answer this question:
  * A way to associate a statistic with the above question
  * A way to correct for the multiple hypothesis that we are going to be asking.
  
---

<font size=2>Voisin V, Bader GD,  Pathway and network Analysis course - slide 12 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>

---

<font size=2>Voisin V, Bader GD,  Pathway and network Analysis course - slide 14 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>

---

<font size=2>Voisin V, Bader GD,  Pathway and network Analysis course - slide 16 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>

---

<font size=2>Voisin V, Bader GD,  Pathway and network Analysis course - slide 17 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>

---

<font size=2>Voisin V, Bader GD,  Pathway and network Analysis course - slide 18 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>

---

<font size=2>Voisin V, Bader GD,  Pathway and network Analysis course - slide 19 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>

---
## More about fisher exact test

* Video using M & M example with fisher exact test : https://www.youtube.com/watch?v=udyAvvaMjfM
* Great explanation on pathway commons site: https://www.pathwaycommons.org/guide/primers/statistics/fishers_exact_test/

---

<font size=2>Voisin V, Bader GD,  Pathway and network Analysis course - slide 21 [Bioinformatics.ca](https://bioinformatics.ca/workshops/2018-pathway-and-network-analysis-of-omics-data/#course-material)</font>

---

Check out this publication:
* Geistlinger L, Csaba G, Santarelli M, Ramos M, Schiffer L, Turaga N, Law C,
Davis S, Carey V, Morgan M, Zimmer R, Waldron L. Toward a gold standard for
benchmarking gene set enrichment analysis. Brief Bioinform. 2020 Feb 6 [PMID](https://www.ncbi.nlm.nih.gov/pubmed/32026945)

---

<font size=2>Geistlinger L, Csaba G, Santarelli M, Ramos M, Schiffer L, Turaga N, Law C,Davis S, Carey V, Morgan M, Zimmer R, Waldron L. Toward a gold standard for
benchmarking gene set enrichment analysis. Brief Bioinform. 2020 Feb 6 [PMID](https://www.ncbi.nlm.nih.gov/pubmed/32026945)</font>

---

## Assignment #2 
  * differentail gene expression and preliminary ORA 
  * <font size=5> Due March 3, 2020! @ 20:00 </font>

## What to hand in?
  * **html rendered RNotebook** - you should submit this through quercus
  * Make sure the notebook and all associated code is checked into your github repo as I will be pulling all the repos at the deadline and using them to compile your code. - Your checked in code must replicate the handed in notebook.
  * Document your work and your code directly in the notebook.
  * **Reference the paper associated with your data!**
  * **Introduce your paper and your data again** 
  * You are allowed to use helper functions or methods but make sure when you source those files the paths to them are relative and that they are checked into your repo as well.