Favia-parents/10-06-2016

From Wiki | Meyer Lab
Jump to: navigation, search



Update: 13:17 Jun 10, 2016 By: EliMeyer

Last night I set up reference construction, today I mapped and genotyped

  • de novo reference construction
    1. combined a subset of cleaned reads from all samples to produce the input, using this bash script
#!/bin/bash

LIST=$1
OUT=$2
exec < "$LIST"
touch $OUT
while read LINE
do
        echo "$LINE"
        head -n 2000000 ../$LINE/clean.fastq >>$OUT
done

    1. Next I ran the reference construction pipeline using this bash script
#!/bin/bash

IN="combined.fastq"
FastqToFasta.pl $IN split.fasta split.qual ;
BuildRef.pl split.fasta split.qual reference.fasta

That was last night. It appears to have worked well -- ~71,000 tags in the reference, produced from ~14M reads.

  • mapping and genotyping
    1. to map, I used these two bash scripts. First the mapping and genotyping script itself
#!/bin/bash
# input/output workaround for processes with frequent read/write operations
# that compete for read/write bandwidth in the cluster
# this script writes output to a temporary remote directory (i.e. /data/*) on
# the compute node, then moves the output back to working directory
# and removes the temporary remote directory

# make random string
MYDIR=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 8 | head -n 1)
#echo $MYDIR

# make remote directory for temporary output (TRD)
TRD="/data/$MYDIR"
#echo $TRD
mkdir $TRD

# do some process to produce output in TRD
# this is where you paste your own code
DB="/nfs0/IB/Meyer_Lab/meyere/Projects/ffra/par_rad/reference/reference.fasta"
READS="clean.fastq"

cp $DB $TRD/db.fasta
cp $READS $TRD/clean.fastq
gmapper --qv-offset 33 -Q --strata -o 3 -N 1 $TRD/clean.fastq $TRD/db.fasta >$TRD/gmapper.sam
SAMFilter.pl $TRD/gmapper.sam 30 32 $TRD/filtered.sam $TRD/counts.tab
SAMBasecaller.pl $TRD/filtered.sam $TRD/db.fasta 3 $TRD/basecalls.tab
NFGenotyper.pl $TRD/basecalls.tab 0.01 0.25 5 > $TRD/genotypes_cov5.tab
NFGenotyper.pl $TRD/basecalls.tab 0.01 0.25 10 > $TRD/genotypes_cov10.tab
NFGenotyper.pl $TRD/basecalls.tab 0.01 0.25 20 > $TRD/genotypes_cov20.tab

# move output back to local working directory
rm -rf $TRD/*.fast*
mv $TRD/* .

# remove temporary remote directory
rm -rf $TRD

    1. to submit jobs using that script, I called this script (run_map.sh)
#!/bin/bash

LIST=$1
exec < "$LIST"
while read LINE
do
        cd $LINE
        cp ../trd_map.sh .
        SGE_Batch -c './trd_map.sh' -r map.$LINE -P 1 -f 16G
        cd ../
done

Update: 13:29 Jun 10, 2016 By: EliMeyer

  • combining genotypes for each coverage level e.g.
SGE_Batch -c "CombineGenotypes.pl */genotypes_cov10.tab >combined_cov10.tab" -r cat10 -Q

  • filtering genotypes
 SGE_Batch -c "PolyFilter.pl combined_cov20.tab 2 y >snps20.tab" -r snps20 -Q
  • all subsequent filtering steps are fast enough you can run them on the head node