This protocol describes a pipeline for extracting SNPs from RNASeq data. Its written from the assumption that you've already processed your reads, mapped them against a reference, and filtered the alignments to remove ambiguous or weak mappings. The starting point for this pipeline is a filtered set of alignments in SAM format. Note that for this purpose, its useful to filter pretty stringently, allowing only a few mismatches per read.
The pipeline relies on samtools and bcftools software.
We've added a few tools to deal with challenges of SNP genotyping in RNASeq data: variation in coverage and allele frequencies. We've also added a few scripts for file conversions and analysis of genetic distance.
Step by step instructions
- Resample from highly expressed genes
The excessive coverage of a few highly expressed genes (>>1000x) leads to genotyping errors. To resolve this, set a maximum coverage and for any genes expressed at higher levels, randomly sample reads to achieve the target coverage. genes with lower coverage arent affected. Run this as e.g.:
Maxcov_SAM.pl input.sam 100 resampled.sam
to impose a maximum coverage of 100x.
- Convert SAM to sorted BAM
To convert a SAM file called "resampled.sam", run:
samtools view -b resampled.sam > input.bam
To sort the binary alignment file, run:
samtools sort input.bam -o input.sorted.bam
- Index the BAM file
To index the BAM file produced above, run:
samtools index input.sorted.bam input.sorted.bai
- Call genotypes from alignments
To determine genotypes from alignments, run:
samtools mpileup -I -B -uf reference.fasta input.sorted.bam > output.bcf
bcftools call -m -A output.bcf > output.vcf
Finally, call genotypes from nucleotide frequencies, allowing for variation in allele frequencies
NFGenotypesFromVCF.pl output.vcf 20 0.01 0.25 genotypes.tab
- Combine SNP genotypes
Here we use the same scripts as with 2bRAD genotypes. To combine genotype files called "genotype.tab" in a set of directories called "sample1" to "sample3", run:
CombineGenotypes.pl sample*/genotypes.tab > combined.tab
- Analysis of genetic distance
This is just one example of the kind of analysis you might explore in SNP data produced above. To estimate overall (transcriptome-wide) genetic distance among samples, run:
SNPDist.pl combined.tab a average_distance.txt
To estimate genetic distance for each gene (slower but potentially interesting), run:
SNPDist.pl combined.tab g gene_distance.txt
Created 07:17 Feb 19, 2017 By: EliMeyer
Last updated 04:52 Jun 26, 2019 By: EliMeyer