Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases

Robinson Silvester. A; J. Cruz Antony; M. Pratheepa

Research Article

Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases

by Robinson Silvester. A, J. Cruz Antony, M. Pratheepa

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 78 - Issue 9

Published: September 2013

Authors: Robinson Silvester. A, J. Cruz Antony, M. Pratheepa

10.5120/13516-1295

PDF

Robinson Silvester. A, J. Cruz Antony, M. Pratheepa . Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases. International Journal of Computer Applications. 78, 9 (September 2013), 13-17. DOI=10.5120/13516-1295

                        @article{ 10.5120/13516-1295,
                        author  = { Robinson Silvester. A,J. Cruz Antony,M. Pratheepa },
                        title   = { Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases },
                        journal = { International Journal of Computer Applications },
                        year    = { 2013 },
                        volume  = { 78 },
                        number  = { 9 },
                        pages   = { 13-17 },
                        doi     = { 10.5120/13516-1295 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2013
                        %A Robinson Silvester. A
                        %A J. Cruz Antony
                        %A M. Pratheepa
                        %T Fast and Efficient Hashing for Sequence Similarity Search using Substring Extraction in DNA Sequence Databases%T 
                        %J International Journal of Computer Applications
                        %V 78
                        %N 9
                        %P 13-17
                        %R 10.5120/13516-1295
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

Emergent interest in genomic research has resulted in the creation of huge biological sequence databases, however search and retrieval of relevant information from these databases takes a lot of processing time, when performed conventionally as size of databases containing DNA sequences is huge. Hence, providing an efficient searching mechanism is mandatory. In this paper we present an efficient search mechanism using Hashing techniques. Initially, the data is hashed and indexed according to different window sizes. During this process, we eliminate redundancies and only record patterns with distinct elements and provide them with corresponding hash values. During the search phase, the search string is checked for the size of the window and if it exceeds the maximum limit of 4, then it is divided. The first part is considered as the search string and the search is made. After the confirmation of the index, the strings that follow the current indexed string are matched with the search string and finally the confirmation is made. The simulation results show that the current methodology provides faster results, while occupying lesser memory.

References

Peter Snustard, Michael J. Simmons. GENETICS, Wiley 6th edition, www. wiley. com/go/global/snustad.
Jones, Neil C. and Pevzner, Pavel A. (2004) "An Introduction to Bioinformatics Algorithms. " Cambridge: The MIT Press. 148-226 and 311-337.
Petri Kalsi, Hannu Peltola, and Jorma Tarhio, "Comparison of exact string matching algorithms for biological sequences", BIRD, 417–426, 2008.
Simone Faro and Thierry Lecroq, "The exact string matching problem: a comprehensive experimental evaluation", CoRR, abs/1012. 2547, 2010.
Simone Faro and Thierry Lecroq "The exact online string matching problem: a review of the most recent results", ACM Computing Surveys, 45(2): to appear, 2013.
Gonzalo Navarro and Mathieu Raffinot, "Flexible pattern matching in strings - practical on-line search algorithms for texts and biological sequences", Cambridge University Press, 2002.
Simone Faro, Thierry Lecroq, "Fast Searching in Biological Sequences Using Multiple Hash Functions", Proceedings of the 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE), Larnaca, Cyprus, 11-13 November 2012
David Nassimi, Milind Joshi, Andrew Sohn,"H-PBS: A Hash-Based Scalable Technique for Parallel Bidirectional Search", 1063-6374/95 1995 IEEE
Zhou Bai Stefan C. Kremer, "Sequence Learning: Analysis and Solutions for Sparse Data in High Dimensional Spaces", 978-1-4673-1191-5/12/$31. 00 ©2012 IEEE
David Dittman, Taghi Khoshgoftaar, Randall Wald, and Amri Napolitano, "Similarity Analysis of Feature Ranking Techniques on Imbalanced DNA Microarray Datasets", 2012 IEEE International Conference on Bioinformatics and Biomedicine
V. Hari Prasad, Dr. P. Y. Kumar, Dr. D. Vasumathi, "Adaptive segmentation of DNA sequences using SBC Tehcnique:A novel Algorithm", ICCCNT'12, July 2012, Coimbatore, India
Gonzalo Navarro , Mathieu Raffinot , "Practical and flexible pattern matching over Ziv–Lempel compressed text", Journal of Discrete Algorithms 2 (2004) 347–371
Nur'Aini Binti Abdul Rashid, Rana Ghadban, Hazrina Yusof Hamdani,Atheer A-Abdulrazaq, "Enhanced CAFÉ Indexing Algorithm Using Hashing Function", 978-1-4244-6716-7/10/$26. 00, 2010 IEEE
Maryam Nuser, Izzat Alsmadi, "Evaluating Graphical and Statistical Techniques for Measuring Similarity in DNA Sequences", 978-1-4673-1550-0/12/$31. 00 ©2012 IEEE

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Hashing Sequence Similarity by Hashing Substring Extraction DNA Sequence