Pattern recognition on read positioning in next generation sequencing
Loading...
Date
2016
Authors
Byeon, Boseon
Kovalchuk, Igor
Journal Title
Journal ISSN
Volume Title
Publisher
Public Library of Science
Abstract
The usefulness and the utility of the next generation sequencing (NGS) technology are
based on the assumption that the DNA or cDNA cleavage required to generate short
sequence reads is random. Several previous reports suggest the existence of sequencing
bias of NGS reads. To address this question in greater detail, we analyze NGS data from
four organisms with different GC content, Plasmodium falciparum (19.39%), Arabidopsis
thaliana (36.03%), Homo sapiens (40.91%) and Streptomyces coelicolor (72.00%). Using
machine learning techniques, we recognize the pattern that the NGS read start is positioned
in the local region where the nucleotide distribution is dissimilar from the global nucleotide
distribution. We also demonstrate that the mono-nucleotide distribution underestimates
sequencing bias, and the recognized pattern is explained largely by the distribution of multinucleotides
(di-, tri-, and tetra- nucleotides) rather than mono-nucleotides. This implies that
the correction of sequencing bias needs to be performed on the basis of the multi-nucleotide
distribution. Providing companion software to quantify the effect of the recognized pattern
on read positioning, we exemplify that the bias correction based on the mono-nucleotide
distribution may not be sufficient to clean sequencing bias.
Description
Sherpa Romeo green journal: open access
Keywords
Next generation sequencing , Plasmodium falciparum , Arabidopsis thaliana , Homo sapiens , Streptomyces coelicolor , Read positioning , Pattern recognition , Nucleotide distribution
Citation
Byeon, B., & Kovalchuk, I.(2016). Pattern recognition on read positioning in next generation sequencing. PLoS ONE, 11(6), e0157033. doi:10.1371/journal/pone.0157033