Research Article

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases

by  Robinson Silvester. A, J. Cruz Antony, M. Pratheepa
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 78 - Issue 9
Published: September 2013
Authors: Robinson Silvester. A, J. Cruz Antony, M. Pratheepa
10.5120/13516-1295
PDF

Robinson Silvester. A, J. Cruz Antony, M. Pratheepa . Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases. International Journal of Computer Applications. 78, 9 (September 2013), 13-17. DOI=10.5120/13516-1295

                        @article{ 10.5120/13516-1295,
                        author  = { Robinson Silvester. A,J. Cruz Antony,M. Pratheepa },
                        title   = { Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases },
                        journal = { International Journal of Computer Applications },
                        year    = { 2013 },
                        volume  = { 78 },
                        number  = { 9 },
                        pages   = { 13-17 },
                        doi     = { 10.5120/13516-1295 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2013
                        %A Robinson Silvester. A
                        %A J. Cruz Antony
                        %A M. Pratheepa
                        %T Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases%T 
                        %J International Journal of Computer Applications
                        %V 78
                        %N 9
                        %P 13-17
                        %R 10.5120/13516-1295
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

Emergent interest in genomic research has resulted in the creation of huge biological sequence databases, however search and retrieval of relevant information from these databases takes a lot of processing time, when performed conventionally as size of databases containing DNA sequences is huge. Hence, providing an efficient searching mechanism is mandatory. In this paper we present an efficient search mechanism using Hashing techniques. Initially, the data is hashed and indexed according to different window sizes. During this process, we eliminate redundancies and only record patterns with distinct elements and provide them with corresponding hash values. During the search phase, the search string is checked for the size of the window and if it exceeds the maximum limit of 4, then it is divided. The first part is considered as the search string and the search is made. After the confirmation of the index, the strings that follow the current indexed string are matched with the search string and finally the confirmation is made. The simulation results show that the current methodology provides faster results, while occupying lesser memory.

References
  • Peter Snustard, Michael J. Simmons. GENETICS, Wiley 6th edition, www. wiley. com/go/global/snustad.
  • Jones, Neil C. and Pevzner, Pavel A. (2004) "An Introduction to Bioinformatics Algorithms. " Cambridge: The MIT Press. 148-226 and 311-337.
  • Petri Kalsi, Hannu Peltola, and Jorma Tarhio, "Comparison of exact string matching algorithms for biological sequences", BIRD, 417–426, 2008.
  • Simone Faro and Thierry Lecroq, "The exact string matching problem: a comprehensive experimental evaluation", CoRR, abs/1012. 2547, 2010.
  • Simone Faro and Thierry Lecroq "The exact online string matching problem: a review of the most recent results", ACM Computing Surveys, 45(2): to appear, 2013.
  • Gonzalo Navarro and Mathieu Raffinot, "Flexible pattern matching in strings - practical on-line search algorithms for texts and biological sequences", Cambridge University Press, 2002.
  • Simone Faro, Thierry Lecroq, "Fast Searching in Biological Sequences Using Multiple Hash Functions", Proceedings of the 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), Larnaca, Cyprus, 11-13 November 2012
  • David Nassimi, Milind Joshi, Andrew Sohn,"H-PBS: A Hash-Based Scalable Technique for Parallel Bidirectional Search", 1063-6374/95 1995 IEEE
  • Zhou Bai Stefan C. Kremer, "Sequence Learning: Analysis and Solutions for Sparse Data in High Dimensional Spaces", 978-1-4673-1191-5/12/$31. 00 ©2012 IEEE
  • David Dittman, Taghi Khoshgoftaar, Randall Wald, and Amri Napolitano, "Similarity Analysis of Feature Ranking Techniques on Imbalanced DNA Microarray Datasets", 2012 IEEE International Conference on Bioinformatics and Biomedicine
  • V. Hari Prasad, Dr. P. Y. Kumar, Dr. D. Vasumathi, "Adaptive segmentation of DNA sequences using SBC Tehcnique:A novel Algorithm", ICCCNT'12, July 2012, Coimbatore, India
  • Gonzalo Navarro , Mathieu Raffinot , "Practical and flexible pattern matching over Ziv–Lempel compressed text", Journal of Discrete Algorithms 2 (2004) 347–371
  • Nur'Aini Binti Abdul Rashid, Rana Ghadban, Hazrina Yusof Hamdani,Atheer A-Abdulrazaq, "Enhanced CAFÉ Indexing Algorithm Using Hashing Function", 978-1-4244-6716-7/10/$26. 00, 2010 IEEE
  • Maryam Nuser, Izzat Alsmadi, "Evaluating Graphical and Statistical Techniques for Measuring Similarity in DNA Sequences", 978-1-4673-1550-0/12/$31. 00 ©2012 IEEE
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Hashing Sequence Similarity by Hashing Substring Extraction DNA Sequence

Powered by PhDFocusTM