Test for enrichment of gene functions using gene sets identified from high-throughput genomics data.
This pipeline uses GO annotation to categorize gene functions, and the desktop GUI software ErmineJ for enrichment analysis.
Step by step instructions
- Read the tutorial
the ErmineJ software is very well documented. See the tutorials at http://erminej.chibi.ubc.ca/help/tutorials/
Below, I outline the process as we commonly run it.
- Get ErmineJ software
Download the appropriate installer for your OS from http://erminej.chibi.ubc.ca/download/ and install.
- Assign GO annotation to your genes
You'll need a tab-delimited file describing GO annotation for your gene set.
You can prepare this file from an annotated fasta file using the script GO2ErmineJ.pl, as:
GO2ErmineJ.pl input.fasta output.tab
This command takes a FASTA file annotated with gene name and GO terms on the definition line as input. Here is an example of the input format:
>Transcript1 Gene=Actin Match_Acc=ACG453 GO=GO:0000166,GO:0000168,GO:0000170 ACGTTTGGGCATATTCCATATTATGGGCACACCACATTTTACCACAGAGGAGGCCCCACACC GGCACACCACATTTTACCACAGAGGAGGCCCCACACCTTTTTTTTTTTTAAAAAAAAAAAAA
And here is an example of the output format. This is the format expected by ErmineJ, which is described at
- Assemble a table of gene scores
Make a tab-delimited file containing scores for your genes, representing the outcome of statistical tests. For expression analysis, these are typically either -log10(p-value), or log2(fold-difference). For other genomic analysis, these may be -log10(p-value), or other test statistics (e.g. measures of genetic differentiation).
This input file will be formatted as a tab delimited text file:
- Load your input files into ErmineJ
Run ErmineJ. This software has a wizard-style graphical user interface.
- On the start screen, the software asks for Gene Ontology XML and Gene Annotation files. The first time you run it, it will take a moment to find its local copy of the Gene Ontology XML file. After that it'll find it immediately.
- Load your gene annotation file (produced by GO2ErmineJ.pl, above) using the Browse button.
- Click Start.
- Choose an enrichment test
ErmineJ offers four different enrichment tests (in addition to a test of correlation among genes). These options are described in detail at http://erminej.chibi.ubc.ca/help/tutorials/ . Our practice is:
- Over-representation analysis (ORA) is conceptually the most straightforward. If you have clear-cut, discrete gene sets (e.g. on the X chromosome vs not on the X chromosome), ORA is appropriate. If your gene sets are based on a threshold (e.g. differentially expressed genes), we prefer to instead use any of the threshold-free tests which are by definition not affected by the choice of threshold.
- If you have confidence in the absolute value of your gene scores (e.g. gene-specific sequence divergence values from whole genome resequencing), choose gene score resampling.
- If you have more confidence in the ranks of your scores than their absolute values (e.g. p-values from statistical tests), choose one of the rank-based, threshold-free approaches (precision-recall or receiver-operator characteristic).
- Consider running more than one test to evaluate whether your results are robust to the choice of test or threshold.
- Run the test
The wizard guides you through each of the tests. Here are a couple notes intended to clarify details that may be confusing your first time through.
- Appropriately, ErmineJ uses FDR to correct for multiple tests. But one feature about this MTC that may not be obvious initially is that if you run additional tests sequentially (e.g. first ORA then GSR), the p-values reported from your first test are adjusted after the second test is run!
To avoid this unexpected behavior, just close ErmineJ completely after running each test and start again for the next test.
In the end, you are provided with a list of adjusted p-values, one for each functional category in your dataset. You should consider a functional category enriched if it has an adjusted p-value below 0.05 or 0.1, depending on your tolerance for false positives.
- On step 5, you will be asked about negative log adjustments to scores, and whether larger scores are better. This question refers to the raw scores in your input file. If you have raw p-values, check the "negative log" box and leave the "larger is better" box unchecked. If your input file has negative log adjusted p-values, leave the "negative log" box unchecked and check the "larger is better" box.
For other kinds of statistics, make your own decision but again remember that these boxes refer to the raw scores provided in your input file.
Created 10:53 Dec 29, 2016 By: EliMeyer
Last updated 09:05 Sep 23, 2019 By: Admin