From Wiki | Meyer Lab
Jump to: navigation, search


This protocol describes a series of steps for processing high-throughput DNA sequences.

Step by step instructions

The following steps may be run with customized settings or omitted, depending on the analysis. The following order is recommended based on computational requirements (i.e. the easier and faster steps are run first on the largest inputs).

Notice that the filters are run in series (i.e. the output from the first step is input for the second step).

  • Truncating reads

Preparation of sequencing libraries always introduces artificial DNA sequences, and in some cases these occupy a known region of the read. To remove these, we can use For example, we could extract the 36-bp tag produced from 2bRAD genotyping as: input.fastq 1 36 trimmed.fastq

To remove the internal barcode (4 bp) and template-switching tag (4 bp) from TagSeq (tag-based RNA-Seq), we could use input.fastq 9 36 trimmed.fastq
  • Quality Filtering

All sequence datasets include some low-quality reads that should be excluded before analysis. To remove reads with low quality scores, we can use The right thresholds are up for debate, but generally you'd like to require reasonably high quality scores at all positions for genotyping and assembly, and at most positions for gene expression analysis. To eliminate any reads where more than 10 bases have quality scores below 20, we can run: trimmed.fastq 20 10 hq.fastq
  • Homopolymer Repeat Filtering

Some sequence datasets include lots of homopolymer repeats (long stretches of a single nucleotide, e.g. poly-A tails). Generally this is a challenge for transcriptome assembly, and to a lesser extent for gene expression analysis (RNA-Seq) and genome assembly, and not at all for 2bRAD.

To exclude all reads with homopolymer repeats more than 30 bases in length, we can run: hq.fastq 30 hrf.fastq
  • Adaptor Filtering

Finally, the most computationally intensive task. The artificial DNA sequences introduced during library preparation may occupy unknown portions of the read (including the entire read). Removing these is probably the most important task in read processing, and certainly the most computationally intensive.

To make the whole pipeline easier to run in parallel, we've switched from the old version ( to a newer kmer-based tool called [BBDUK]. This has superior performance in all respects to the old process and most importantly, can be run directly on large sequence datasets without lots of tedious parallelizing.

To run this tool on a set of reads called hrf.fastq, to eliminate any reads sharing at least one 12-mer with sequences in adaptors.fasta, we can run: in=hrf.fastq ref=adaptors.fasta k=12 stats=stats.txt out=clean.fastq
  • Running it all in one pipeline

All these steps can be directly combined into a single pipeline without any complicated splitting and recombining.

The following shell script combines all of the above steps into one pipeline. Make a similar file, chmod it, and submit it to your job scheduler to process each batch of reads.

# these steps are appropriate for tag-based RNA-Seq (assuming you start with 75-bp reads) input.fastq 9 75 trimmed.fastq trimmed.fastq 20 10 hq.fastq hq.fastq 30 hrf.fastq in=hrf.fastq ref=adaptors.fasta k=12 stats=stats.txt out=clean.fastq

Here is a similar script for processing 2bRAD reads:

# these steps are appropriate for 2bRAD input.fastq 1 36 trimmed.fastq trimmed.fastq 20 10 hq.fastq in=hq.fastq ref=adaptors.fasta k=12 stats=stats.txt out=clean.fastq


Created 20:34 Mar 29, 2017 By: EliMeyer

Last updated 09:17 Sep 23, 2019 By: Admin