A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)

A lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Wehbe, Gioia Wahib (author)
التنسيق: masterThesis
منشور في: 2016
الموضوعات:
الوصول للمادة أونلاين:http://hdl.handle.net/10725/3493
https://doi.org/10.26756/th.2015.49
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1864513461323563008
author Wehbe, Gioia Wahib
author_facet Wehbe, Gioia Wahib
author_role author
dc.creator.none.fl_str_mv Wehbe, Gioia Wahib
dc.date.none.fl_str_mv 12/21/2015
2016-04-06T05:32:08Z
2016-04-06T05:32:08Z
2016-04-06
dc.identifier.none.fl_str_mv http://hdl.handle.net/10725/3493
https://doi.org/10.26756/th.2015.49
dc.language.none.fl_str_mv en
dc.publisher.none.fl_str_mv Lebanese American University
dc.rights.*.fl_str_mv info:eu-repo/semantics/openAccess
dc.subject.none.fl_str_mv Population genetics
Population genetics -- Computer simulation
Human genome
Lebanese American University -- Dissertations
Dissertations, Academic
dc.title.none.fl_str_mv A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
dc.type.none.fl_str_mv Thesis
info:eu-repo/semantics/publishedVersion
info:eu-repo/semantics/masterThesis
description A lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making use of these technologies to provide the scientific community with a detailed genetic reference from different populations. The challenge now is to develop fast and accurate computational methods to analyze this huge amount of data. Identifying genetic signatures that can distinguish between populations is one of the major concerns nowadays. A lot of work has been done to analyze variations within the human genome, and more specifically at the Y-chromosome level, in order to better understand the evolution of the human species. However, learning about the variability of the complete human genome is inevitable in order to fully understand the genetic evolution of Homo sapiens. Unfortunately, finding such conserved regions on autosomal chromosomes is still in its infancy as it has proven to be very difficult due to the high rate of recombination on these chromosomes. In addition, implementing feasible computational methods for such enormous data is by itself another challenge. Aiming to tackle these obstacles, we have derived a new computational method in order to identify conserved regions of Single Nucleotide Polymorphisms (SNPs) on autosomal chromosomes that are differentiable in different populations. Our algorithm first performs a feature selection step to define differentiable SNPs. Then, it searches for population discriminative motifs or differentiable sequence of SNPs, by implementing Probabilistic Suffix Trees data structures. We initially tested the efficiency and performance of our method on several simulated datasets and then applied it on a real genomic data that has different populations from the Middle East and North Africa (MENA) region. Interestingly, our method was able to identify the inserted motifs in the simulated data with a precision of 90% and a sensitivity of 80% on average. Additionally, it was able to identify several differentiable regions in the real data set and on different chromosomes. However, we noticed that chromosomes 1, 3 and 6 had the highest occurrence rate of differentiable motifs (9, 8 and 6 motifs respectively). Our Feature Selection step out-performed SPLSDA, a state-of-the-art feature selection technique known for its speed, both at the computational time and precision levels. Our method is the first to identify multi-class-specific 'regions' rather than random subsets of Single Nucleotide Polymorphisms on unphased Genomic SNP data. These discriminative motifs can be further studied to understand their role both at the evolutionary and disease levels.
eu_rights_str_mv openAccess
format masterThesis
id LAURepo_4d9af97eb0cd8f5528e356751e3fc5fd
language_invalid_str_mv en
network_acronym_str LAURepo
network_name_str Lebanese American University repository
oai_identifier_str oai:laur.lau.edu.lb:10725/3493
publishDate 2016
publisher.none.fl_str_mv Lebanese American University
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
spelling A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)Wehbe, Gioia WahibPopulation geneticsPopulation genetics -- Computer simulationHuman genomeLebanese American University -- DissertationsDissertations, AcademicA lot of advancement has been made in the field of population genetics in the past decade. New technologies, such as Next-Generation Genome Sequencing, can now provide huge amounts of data in little time. Big initiatives such as the International Hapmap Project and the 1000 Genome project are making use of these technologies to provide the scientific community with a detailed genetic reference from different populations. The challenge now is to develop fast and accurate computational methods to analyze this huge amount of data. Identifying genetic signatures that can distinguish between populations is one of the major concerns nowadays. A lot of work has been done to analyze variations within the human genome, and more specifically at the Y-chromosome level, in order to better understand the evolution of the human species. However, learning about the variability of the complete human genome is inevitable in order to fully understand the genetic evolution of Homo sapiens. Unfortunately, finding such conserved regions on autosomal chromosomes is still in its infancy as it has proven to be very difficult due to the high rate of recombination on these chromosomes. In addition, implementing feasible computational methods for such enormous data is by itself another challenge. Aiming to tackle these obstacles, we have derived a new computational method in order to identify conserved regions of Single Nucleotide Polymorphisms (SNPs) on autosomal chromosomes that are differentiable in different populations. Our algorithm first performs a feature selection step to define differentiable SNPs. Then, it searches for population discriminative motifs or differentiable sequence of SNPs, by implementing Probabilistic Suffix Trees data structures. We initially tested the efficiency and performance of our method on several simulated datasets and then applied it on a real genomic data that has different populations from the Middle East and North Africa (MENA) region. Interestingly, our method was able to identify the inserted motifs in the simulated data with a precision of 90% and a sensitivity of 80% on average. Additionally, it was able to identify several differentiable regions in the real data set and on different chromosomes. However, we noticed that chromosomes 1, 3 and 6 had the highest occurrence rate of differentiable motifs (9, 8 and 6 motifs respectively). Our Feature Selection step out-performed SPLSDA, a state-of-the-art feature selection technique known for its speed, both at the computational time and precision levels. Our method is the first to identify multi-class-specific 'regions' rather than random subsets of Single Nucleotide Polymorphisms on unphased Genomic SNP data. These discriminative motifs can be further studied to understand their role both at the evolutionary and disease levels.N/A1 hard copy: xix, 156 leaves; col. ill.; 31 cm. available at RNL.Includes bibliographical references (leaves 121-132).Lebanese American University2016-04-06T05:32:08Z2016-04-06T05:32:08Z12/21/20152016-04-06Thesisinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://hdl.handle.net/10725/3493https://doi.org/10.26756/th.2015.49eninfo:eu-repo/semantics/openAccessoai:laur.lau.edu.lb:10725/34932023-03-01T09:18:43Z
spellingShingle A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
Wehbe, Gioia Wahib
Population genetics
Population genetics -- Computer simulation
Human genome
Lebanese American University -- Dissertations
Dissertations, Academic
status_str publishedVersion
title A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_full A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_fullStr A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_full_unstemmed A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_short A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
title_sort A multi-class discriminative motif finding algorithm for autosomal genomic data. (c2015)
topic Population genetics
Population genetics -- Computer simulation
Human genome
Lebanese American University -- Dissertations
Dissertations, Academic
url http://hdl.handle.net/10725/3493
https://doi.org/10.26756/th.2015.49