Research Article

Author verification using a Graph-based Representation

by  Esteban Castillo, Ofelia Cervantes, Darnes Vilariño, David Báez
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 123 - Issue 14
Published: August 2015
Authors: Esteban Castillo, Ofelia Cervantes, Darnes Vilariño, David Báez
10.5120/ijca2015905654
PDF

Esteban Castillo, Ofelia Cervantes, Darnes Vilariño, David Báez . Author verification using a Graph-based Representation. International Journal of Computer Applications. 123, 14 (August 2015), 1-8. DOI=10.5120/ijca2015905654

                        @article{ 10.5120/ijca2015905654,
                        author  = { Esteban Castillo,Ofelia Cervantes,Darnes Vilariño,David Báez },
                        title   = { Author verification using a Graph-based Representation },
                        journal = { International Journal of Computer Applications },
                        year    = { 2015 },
                        volume  = { 123 },
                        number  = { 14 },
                        pages   = { 1-8 },
                        doi     = { 10.5120/ijca2015905654 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2015
                        %A Esteban Castillo
                        %A Ofelia Cervantes
                        %A Darnes Vilariño
                        %A David Báez
                        %T Author verification using a Graph-based Representation%T 
                        %J International Journal of Computer Applications
                        %V 123
                        %N 14
                        %P 1-8
                        %R 10.5120/ijca2015905654
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper presents a methodology for tackling the authorship verification problem. The approach is based on comparing the similarity between a given unknown document against the known documents using a graph representation that captures the syntactic sequence of texts and a graph similarity measure. An unknown document can be classified as having been written by the same author if the majority of the comparisons surpass a predefined threshold. The best results were obtained on the Clef PAN 2014 dataset: 79% for the Spanish and 68% for English, showing that the proposed methodology could be a way for determining a document authorship.

References
  • Patrick Juola. Authorship attribution. Foundations and Trends in Information Retrieval, 1(3):233–334, 2008.
  • Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1):9–26, 2009.
  • Rada Mihalcea and Dragomir Radev. Graph-based natural language processing and information retrieval. Cambridge University Press, 2011.
  • S. S. Sonawane and P. A. Kulkarni. Article: Graph based representation and analysis of text document: A survey of techniques. International Journal of Computer Applications, 96(19):1–8, 2014.
  • Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure trees. In LREC, pages 449–454, 2006.
  • Aria Haghighi, Andrew Y. Ng, and Christopher D. Manning. Robust textual inference via graph matching. In EMNLP. The Association for Computational Linguistics, 2005.
  • Diane J. Cook and Lawrence B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15(2):32–41, 2000.
  • L.C. Freeman. The Development of Social Network Analysis: A Study in the Sociology of Science. BookSurge Publishing, 2004.
  • S. Wasserman and K. Faust. Social network analysis: Methods and applications. Cambridge Univ Pr, 1994.
  • Santo Fortunato. Community detection in graphs. Physics Reports, 486:75–174, 2010.
  • M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E, 69(2):026113, February 2004.
  • R. Arun, V. Suresh, and C. E. Veni Madhavan. Stopword graphs and authorship attribution in text corpora. In ICSC, pages 192–196. IEEE Computer Society, 2009.
  • Darnes VilariËœno, David Pinto, Helena G´omez-Adorno, Saul Le´on, and Esteban Castillo. Lexical-syntactic and graph-based features for authorship verification notebook for pan at clef 2013. In CLEF (Working Notes), volume 1179 of CEUR Workshop Proceedings. CEUR-WS.org, 2013.
  • Efstathios Stamatatos. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):538–556, 2009.
  • Lada A. Adamic and Eytan Adar. Friends and neighbors on the web. Social Networks, 25(3):211–230, 2003.
  • Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction to Information Retrieval. Cambridge University Press, 2008.
  • Christopher D. Manning and Hinrich Sch¨utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts, 1999.
  • L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. An improved algorithm for matching large graphs. In In: 3rd IAPR-TC15 Workshop on Graph-based Representations in Pattern Recognition, Cuen, pages 149–159, 2001.
  • L.C. Freeman. Centrality in Social Networks: Conceptual Clarification. Social Networks, 1:215–239, 1979.
  • Efstathios Stamatatos, Walter Daelemans, Ben Verhoeven, Benno Stein, Martin Potthast, Patrick Juola, Miguel A. S´anchez, and Alberto Barr´on. Overview of the author identification task at PAN 2014. In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014., pages 877–897, 2014.
  • G. Zipf. Selective Studies and the Principle of Relative Frequency in Language. Harvard University Press, Cambridge, MA, 1932.
  • Gabor Csardi and Tamas Nepusz. The igraph software package for complex network research. InterJournal, Complex Systems:1695, 2006.
  • Fabian Pedregosa. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Mahmoud Khonji and Youssef Iraqi. A slightly-modified gi-based author-verifier with lots of features (asgalf). In CLEF (Working Notes), volume 1180 of CEUR Workshop Proceedings, pages 977–983. CEUR-WS.org, 2014.
  • Esteban Castillo, Ofelia Cervantes, Darnes VilariËœno, David Pinto, and Saul Le´on. Unsupervised method for the authorship identification task. In CLEF (Working Notes), volume 1180 of CEUR Workshop Proceedings, pages 1035–1041. CEUR-WS.org, 2014.
  • S. P. Abney. Parsing by chunks. In Robert C. Berwick, Steven P. Abney, and Carol Tenny, editors, Principle-Based Parsing: Computation and Psycholinguistics, pages 257–278. Kluwer, 1991.
  • MEJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74(3):36104, 2006.
  • Usha Nandini Raghavan, R´eka Albert, and Soundar Kumara. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E, 76, 2007.
  • Marco A. Alvarez and Changhui Yan. A graph-based semantic similarity measure for the gene ontology. J. Bioinformatics and Computational Biology, 9(6):681–695, 2011.
  • Efstathios Stamatatos. Author identification: Using text sampling to handle the class imbalance problem. Inf. Process. Manage., 44(2):790–799, March 2008.
  • Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135, 2008.
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Authorship Verification Syntactic Sequence Graph Graph Similarity

Powered by PhDFocusTM