Variant calling pipeline for amplicon-based sequencing of the SARS-Cov-2 viral genome

Variant calling pipeline for amplicon-based sequencing of the SARS-Cov-2 viral genome

Overview of the analytic pipeline “Donner” for variant calling from amplicon-based sequencing of the SARS-Cov-2 viral genome sequences

Created date: 2020/5/16
Update date: 2020/11/27
Kenjiro Kosaki,
Center for Medical Genetics, Keio University School of Medicine, Tokyo, Japan

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


Following article has been has published in the Journal of Hospital Infection.

Clinical Utility of SARS-CoV-2 Whole Genome Sequencing in Deciphering Source of Infection

Toshiki Takenouchi, Yuka W. Iwasaki, Sei Harada, Hirotsugu Ishizu, Yoshifumi Uwamino, Shunsuke Uno, Asami Osada, Naoki Hasegawa, Mitsuru Murata, Toru Takebayashi, Koichi Fukunaga, Hideyuki Saya, Yuko Kitagawa, Masayuki Amagai, Haruhiko Siomi, Kenjiro Kosaki


System requirements

The system was tested on Cent OS 6.3 and Ubuntu 16.04LTS and 18.06LTS.
Conda may be helpful during installation of the required packages

Reference SARS-Cov-2 sequence

MN908947.3.fasta was used as the reference

Wu,F., Zhao,S., Yu,B., Chen,Y.M., Wang,W., Song,Z.G., Hu,Y.,Tao,Z.W., Tian,J.H., Pei,Y.Y., Yuan,M.L., Zhang,Y.L., Dai,F.H.,Liu,Y., Wang,Q.M., Zheng,J.J., Xu,L., Holmes,E.C. and Zhang,Y.Z.

A new coronavirus associated with human respiratory disease in China.
Nature 2020; 579:265-269

Raw data generation

PCR amplified using the ARITC primer set version 3

Amplicon sequencing by Illumina MiSeq

Software components used in the pipeline

Resampling and quality control


Resample fastq data when depth is excessively high

Wei Shen, Shuai Le, Yan Li, and Fuquan Hu
SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation
PLoS One. 2016; 11: e0163962.


Trim adapters

Chen S, Zhou Y, Chen Y, Gu J.
fastp: an ultra-fast all-in-one FASTQ preprocessor.
Bioinformatics. 2018;34:i884-i890.


Soft clip PCR primer sequences

Grubaugh ND, Gangavarapu K, Quick J, Matteson NL, De Jesus JG, Main BJ, Tan AL, Paul LM, Brackney DE, Grewal S, Gurfield N, Van Rompay KKA, Isern S, Michael SF, Coffey LL, Loman NJ, Andersen KG.
An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar.
Genome Biol. 2019 Jan 8;20:8.


bwa version xx or above

Li H, Durbin R.
Fast and accurate short read alignment with Burrows-Wheeler transform.
Bioinformatics. 2009;25:1754-60.

variant calling

samtools (version1.9 or above)
bcftools (version 1.9 or above)

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup.
The Sequence Alignment/Map format and SAMtools.
Bioinformatics. 2009;25:2078-9.


See above



Predicted effects on translated protein

Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM.
A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.
Fly (Austin). 2012 ;6:80-92.
Requires to build custom database for SARS-COV-2
Custom GFF3 file for building database
See atached GFF3 file.
Binary database


Annotate mutations considering the ribosome slippage event in ORF1ab

Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data Nucleic Acids Research, 38:e164, 2010

For ANNOVAR installation and how to analyze mutations in SARS-CoV-2, please refer to the ANNOVAR Documentation



Pairwise sequence alignment of the sample sequence and the reference sequence base on the Smith-Waterman algorithm.
From the emboss package

Carver T, Bleasby A: The design of Jemboss: a graphical user interface to EMBOSS. Bioinformatics. 2003; 19: 1837-1843.
