NSIT: Novel Sequence Identification Tool

<p dir="ltr">Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent work...

Full description

Saved in:
Bibliographic Details
Main Author: Benjarath Pupacdi (635765) (author)
Other Authors: Asif Javed (236620) (author), Mohammed J. Zaki (257966) (author), Mathuros Ruchirawat (272859) (author)
Published: 2014
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1864513506676572160
author Benjarath Pupacdi (635765)
author2 Asif Javed (236620)
Mohammed J. Zaki (257966)
Mathuros Ruchirawat (272859)
author2_role author
author
author
author_facet Benjarath Pupacdi (635765)
Asif Javed (236620)
Mohammed J. Zaki (257966)
Mathuros Ruchirawat (272859)
author_role author
dc.creator.none.fl_str_mv Benjarath Pupacdi (635765)
Asif Javed (236620)
Mohammed J. Zaki (257966)
Mathuros Ruchirawat (272859)
dc.date.none.fl_str_mv 2014-09-29T06:00:00Z
dc.identifier.none.fl_str_mv 10.1371/journal.pone.0108011
dc.relation.none.fl_str_mv https://figshare.com/articles/journal_contribution/NSIT_Novel_Sequence_Identification_Tool/26860975
dc.rights.none.fl_str_mv CC BY 4.0
info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Biological sciences
Bioinformatics and computational biology
Genetics
Biomedical and clinical sciences
Clinical sciences
Sequence alignment
Human genomics
Chromosomes
Mammalian genomics
Chromosome mapping
Computer software
Epstein-Barr virus
Zebrafish
dc.title.none.fl_str_mv NSIT: Novel Sequence Identification Tool
dc.type.none.fl_str_mv Text
Journal contribution
info:eu-repo/semantics/publishedVersion
text
contribution to journal
description <p dir="ltr">Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2–5 Mb of such sequences and estimated that the human pan-genome contains as high as 19–40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed <b>NSIT</b> (<b>N</b>ovel <b>S</b>equence <b>I</b>dentification <b>T</b>ool), a software that can accurately and efficiently identify novel sequences in an individual's de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires2GB of RAM and 1.5–2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.</p><h2>Other Information</h2><p dir="ltr">Published in: PLOS ONE<br>License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1371/journal.pone.0108011" rel="noreferrer" target="_blank">https://dx.doi.org/10.1371/journal.pone.0108011</a></p>
eu_rights_str_mv openAccess
id Manara2_092208a44d9961492975741a9a2b8956
identifier_str_mv 10.1371/journal.pone.0108011
network_acronym_str Manara2
network_name_str Manara2
oai_identifier_str oai:figshare.com:article/26860975
publishDate 2014
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
rights_invalid_str_mv CC BY 4.0
spelling NSIT: Novel Sequence Identification ToolBenjarath Pupacdi (635765)Asif Javed (236620)Mohammed J. Zaki (257966)Mathuros Ruchirawat (272859)Biological sciencesBioinformatics and computational biologyGeneticsBiomedical and clinical sciencesClinical sciencesSequence alignmentHuman genomicsChromosomesMammalian genomicsChromosome mappingComputer softwareEpstein-Barr virusZebrafish<p dir="ltr">Novel sequences are DNA sequences present in an individual's genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2–5 Mb of such sequences and estimated that the human pan-genome contains as high as 19–40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed <b>NSIT</b> (<b>N</b>ovel <b>S</b>equence <b>I</b>dentification <b>T</b>ool), a software that can accurately and efficiently identify novel sequences in an individual's de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires2GB of RAM and 1.5–2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.</p><h2>Other Information</h2><p dir="ltr">Published in: PLOS ONE<br>License: <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank">http://creativecommons.org/licenses/by/4.0/</a><br>See article on publisher's website: <a href="https://dx.doi.org/10.1371/journal.pone.0108011" rel="noreferrer" target="_blank">https://dx.doi.org/10.1371/journal.pone.0108011</a></p>2014-09-29T06:00:00ZTextJournal contributioninfo:eu-repo/semantics/publishedVersiontextcontribution to journal10.1371/journal.pone.0108011https://figshare.com/articles/journal_contribution/NSIT_Novel_Sequence_Identification_Tool/26860975CC BY 4.0info:eu-repo/semantics/openAccessoai:figshare.com:article/268609752014-09-29T06:00:00Z
spellingShingle NSIT: Novel Sequence Identification Tool
Benjarath Pupacdi (635765)
Biological sciences
Bioinformatics and computational biology
Genetics
Biomedical and clinical sciences
Clinical sciences
Sequence alignment
Human genomics
Chromosomes
Mammalian genomics
Chromosome mapping
Computer software
Epstein-Barr virus
Zebrafish
status_str publishedVersion
title NSIT: Novel Sequence Identification Tool
title_full NSIT: Novel Sequence Identification Tool
title_fullStr NSIT: Novel Sequence Identification Tool
title_full_unstemmed NSIT: Novel Sequence Identification Tool
title_short NSIT: Novel Sequence Identification Tool
title_sort NSIT: Novel Sequence Identification Tool
topic Biological sciences
Bioinformatics and computational biology
Genetics
Biomedical and clinical sciences
Clinical sciences
Sequence alignment
Human genomics
Chromosomes
Mammalian genomics
Chromosome mapping
Computer software
Epstein-Barr virus
Zebrafish