Variant calling pipeline for amplicon-based sequencing of the SARS-Cov-2 viral genome

Overview of the analytic pipeline “Donner” for variant calling from amplicon-based sequencing of the SARS-Cov-2 viral genome sequences

Created date: 2020/5/16
Update date: 2020/11/27
Kenjiro Kosaki,
Center for Medical Genetics, Keio University School of Medicine, Tokyo, Japan

Following article has been has published in the Journal of Hospital Infection.

Clinical Utility of SARS-CoV-2 Whole Genome Sequencing in Deciphering Source of Infection

Toshiki Takenouchi, Yuka W. Iwasaki, Sei Harada, Hirotsugu Ishizu, Yoshifumi Uwamino, Shunsuke Uno, Asami Osada, Naoki Hasegawa, Mitsuru Murata, Toru Takebayashi, Koichi Fukunaga, Hideyuki Saya, Yuko Kitagawa, Masayuki Amagai, Haruhiko Siomi, Kenjiro Kosaki


System requirements

The system was tested on Cent OS 6.3 and Ubuntu 16.04LTS and 18.06LTS.
Conda may be helpful during installation of the required packages

Reference SARS-Cov-2 sequence

MN908947.3.fasta was used as the reference

A new coronavirus associated with human respiratory disease in China.
Raw data generation

PCR amplified using the ARITC primer set version 3

Amplicon sequencing by Illumina MiSeq

Software components used in the pipeline

Resampling and quality control


Resample fastq data when depth is excessively high

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation
Trim adapters

fastp: an ultra-fast all-in-one FASTQ preprocessor.
Soft clip PCR primer sequences

An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar.
bwa version xx or above

Fast and accurate short read alignment with Burrows-Wheeler transform.
variant calling

samtools (version1.9 or above)
bcftools (version 1.9 or above)

The Sequence Alignment/Map format and SAMtools.
See above



Predicted effects on translated protein

A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.
Requires to build custom database for SARS-COV-2
Custom GFF3 file for building database
See atached GFF3 file.
Binary database


Annotate mutations considering the ribosome slippage event in ORF1ab

Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research, 38:e164, 2010

For ANNOVAR installation and how to analyze mutations in SARS-CoV-2, please refer to the ANNOVAR Documentation



Pairwise sequence alignment of the sample sequence and the reference sequence base on the Smith-Waterman algorithm.
From the emboss package

