A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)

A lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making...

وصف كامل

محفوظ في:

التفاصيل البيبلوغرافية
المؤلف الرئيسي:	Wehbe, Gioia Wahib (author)
التنسيق:	masterThesis
منشور في:	2016
الموضوعات:	Population genetics Population genetics > Computer simulation Human genome Lebanese American University > Dissertations Dissertations, Academic
الوصول للمادة أونلاين:	http://hdl.handle.net/10725/3493 https://doi.org/10.26756/th.2015.49
الوسوم:	إضافة وسم لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!

_version_	1864513461323563008
author	Wehbe, Gioia Wahib
author_facet	Wehbe, Gioia Wahib
author_role	author
dc.creator.none.fl_str_mv	Wehbe, Gioia Wahib
dc.date.none.fl_str_mv	12/21/2015 2016-04-06T05:32:08Z 2016-04-06T05:32:08Z 2016-04-06
dc.identifier.none.fl_str_mv	http://hdl.handle.net/10725/3493 https://doi.org/10.26756/th.2015.49
dc.language.none.fl_str_mv	en
dc.publisher.none.fl_str_mv	Lebanese American University
dc.rights.*.fl_str_mv	info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv	Population genetics Population genetics -- Computer simulation Human genome Lebanese American University -- Dissertations Dissertations, Academic
dc.title.none.fl_str_mv	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
dc.type.none.fl_str_mv	Thesis info:eu-repo/semantics/publishedVersion info:eu-repo/semantics/masterThesis
description	A lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making use of these technologies to provide the scientific community with a detailed genetic reference from different populations. The challenge now is to develop fast and accurate computational methods to analyze this huge amount of data. Identifying genetic signatures that can distinguish between populations is one of the major concerns nowadays. A lot of work has been done to analyze variations within the human genome, and more specifically at the Y-chromosome level, in order to better understand the evolution of the human species. However, learning about the variability of the complete human genome is inevitable in order to fully understand the genetic evolution of Homo sapiens. Unfortunately, finding such conserved regions on autosomal chromosomes is still in its infancy as it has proven to be very difficult due to the high rate of recombination on these chromosomes. In addition, implementing feasible computational methods for such enormous data is by itself another challenge. Aiming to tackle these obstacles, we have derived a new computational method in order to identify conserved regions of Single Nucleotide Polymorphisms (SNPs) on autosomal chromosomes that are differentiable in different populations. Our algorithm first performs a feature selection step to define differentiable SNPs. Then, it searches for population discriminative motifs or differentiable sequence of SNPs, by implementing Probabilistic Suffix Trees data structures. We initially tested the efficiency and performance of our method on several simulated datasets and then applied it on a real genomic data that has different populations from the Middle East and North Africa (MENA) region. Interestingly, our method was able to identify the inserted motifs in the simulated data with a precision of 90% and a sensitivity of 80% on average. Additionally, it was able to identify several differentiable regions in the real data set and on different chromosomes. However, we noticed that chromosomes 1, 3 and 6 had the highest occurrence rate of differentiable motifs (9, 8 and 6 motifs respectively). Our Feature Selection step out-performed SPLSDA, a state-of-the-art feature selection technique known for its speed, both at the computational time and precision levels. Our method is the first to identify multi-class-specific 'regions' rather than random subsets of Single Nucleotide Polymorphisms on unphased Genomic SNP data. These discriminative motifs can be further studied to understand their role both at the evolutionary and disease levels.
eu_rights_str_mv	openAccess
format	masterThesis
id	LAURepo_4d9af97eb0cd8f5528e356751e3fc5fd
language_invalid_str_mv	en
network_acronym_str	LAURepo
network_name_str	Lebanese American University repository
oai_identifier_str	oai:laur.lau.edu.lb:10725/3493
publishDate	2016
publisher.none.fl_str_mv	Lebanese American University
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
spelling	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)Wehbe, Gioia WahibPopulation geneticsPopulation genetics -- Computer simulationHuman genomeLebanese American University -- DissertationsDissertations, AcademicA lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making use of these technologies to provide the scientific community with a detailed genetic reference from different populations. The challenge now is to develop fast and accurate computational methods to analyze this huge amount of data. Identifying genetic signatures that can distinguish between populations is one of the major concerns nowadays. A lot of work has been done to analyze variations within the human genome, and more specifically at the Y-chromosome level, in order to better understand the evolution of the human species. However, learning about the variability of the complete human genome is inevitable in order to fully understand the genetic evolution of Homo sapiens. Unfortunately, finding such conserved regions on autosomal chromosomes is still in its infancy as it has proven to be very difficult due to the high rate of recombination on these chromosomes. In addition, implementing feasible computational methods for such enormous data is by itself another challenge. Aiming to tackle these obstacles, we have derived a new computational method in order to identify conserved regions of Single Nucleotide Polymorphisms (SNPs) on autosomal chromosomes that are differentiable in different populations. Our algorithm first performs a feature selection step to define differentiable SNPs. Then, it searches for population discriminative motifs or differentiable sequence of SNPs, by implementing Probabilistic Suffix Trees data structures. We initially tested the efficiency and performance of our method on several simulated datasets and then applied it on a real genomic data that has different populations from the Middle East and North Africa (MENA) region. Interestingly, our method was able to identify the inserted motifs in the simulated data with a precision of 90% and a sensitivity of 80% on average. Additionally, it was able to identify several differentiable regions in the real data set and on different chromosomes. However, we noticed that chromosomes 1, 3 and 6 had the highest occurrence rate of differentiable motifs (9, 8 and 6 motifs respectively). Our Feature Selection step out-performed SPLSDA, a state-of-the-art feature selection technique known for its speed, both at the computational time and precision levels. Our method is the first to identify multi-class-specific 'regions' rather than random subsets of Single Nucleotide Polymorphisms on unphased Genomic SNP data. These discriminative motifs can be further studied to understand their role both at the evolutionary and disease levels.N/A1 hard copy: xix, 156 leaves; col. ill.; 31 cm. available at RNL.Includes bibliographical references (leaves 121-132).Lebanese American University2016-04-06T05:32:08Z2016-04-06T05:32:08Z12/21/20152016-04-06Thesisinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://hdl.handle.net/10725/3493https://doi.org/10.26756/th.2015.49eninfo:eu-repo/semantics/openAccessoai:laur.lau.edu.lb:10725/34932023-03-01T09:18:43Z
spellingShingle	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015) Wehbe, Gioia Wahib Population genetics Population genetics -- Computer simulation Human genome Lebanese American University -- Dissertations Dissertations, Academic
status_str	publishedVersion
title	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_full	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_fullStr	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_full_unstemmed	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_short	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_sort	A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
topic	Population genetics Population genetics -- Computer simulation Human genome Lebanese American University -- Dissertations Dissertations, Academic
url	http://hdl.handle.net/10725/3493 https://doi.org/10.26756/th.2015.49

A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)

مواد مشابهة