From Wiki | Meyer Lab
Jump to: navigation, search


This protocol describes a pipeline for extracting SNPs from RNASeq data. Its written from the assumption that you've already processed your reads, mapped them against a reference, and filtered the alignments to remove ambiguous or weak mappings. The starting point for this pipeline is a filtered set of alignments in SAM format. Note that for this purpose, its useful to filter pretty stringently, allowing only a few mismatches per read.

The pipeline relies on samtools and bcftools software.

We've added a few tools to deal with challenges of SNP genotyping in RNASeq data: variation in coverage and allele frequencies. We've also added a few scripts for file conversions and analysis of genetic distance.

Step by step instructions

  • Resample from highly expressed genes

The excessive coverage of a few highly expressed genes (>>1000x) leads to genotyping errors. To resolve this, set a maximum coverage and for any genes expressed at higher levels, randomly sample reads to achieve the target coverage. genes with lower coverage arent affected. Run this as e.g.:

Maxcov_SAM.pl input.sam 100 resampled.sam

to impose a maximum coverage of 100x.

  • Convert SAM to sorted BAM

To convert a SAM file called "resampled.sam", run:

samtools view -b resampled.sam > input.bam

To sort the binary alignment file, run:

samtools sort input.bam -o input.sorted.bam
  • Index the BAM file

To index the BAM file produced above, run:

samtools index input.sorted.bam input.sorted.bai
  • Call genotypes from alignments

To determine genotypes from alignments, run:

samtools mpileup -I -B -uf reference.fasta input.sorted.bam > output.bcf 


bcftools call -m -A  output.bcf > output.vcf

Finally, call genotypes from nucleotide frequencies, allowing for variation in allele frequencies

NFGenotypesFromVCF.pl output.vcf 20 0.01 0.25 genotypes.tab
  • Combine SNP genotypes

Here we use the same scripts as with 2bRAD genotypes. To combine genotype files called "genotype.tab" in a set of directories called "sample1" to "sample3", run:

CombineGenotypes.pl sample*/genotypes.tab > combined.tab
  • Analysis of genetic distance

This is just one example of the kind of analysis you might explore in SNP data produced above. To estimate overall (transcriptome-wide) genetic distance among samples, run:

SNPDist.pl combined.tab a average_distance.txt

To estimate genetic distance for each gene (slower but potentially interesting), run:

SNPDist.pl combined.tab g gene_distance.txt


Created 07:17 Feb 19, 2017 By: EliMeyer

Last updated 04:52 Jun 26, 2019 By: EliMeyer