CFP last date
20 May 2024
Call for Paper
June Edition
IJCA solicits high quality original research papers for the upcoming June edition of the journal. The last date of research paper submission is 20 May 2024

Submit your paper
Know more
Reseach Article

Progressive Sampling Algorithm with Rademacher Averages for Optimized Learning of Big Data: A Novel Approach

by Yathish Aradhya B. C., Y. P. Gowramma
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 175 - Number 15
Year of Publication: 2020
Authors: Yathish Aradhya B. C., Y. P. Gowramma
10.5120/ijca2020920652

Yathish Aradhya B. C., Y. P. Gowramma . Progressive Sampling Algorithm with Rademacher Averages for Optimized Learning of Big Data: A Novel Approach. International Journal of Computer Applications. 175, 15 ( Aug 2020), 37-40. DOI=10.5120/ijca2020920652

@article{ 10.5120/ijca2020920652,
author = { Yathish Aradhya B. C., Y. P. Gowramma },
title = { Progressive Sampling Algorithm with Rademacher Averages for Optimized Learning of Big Data: A Novel Approach },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2020 },
volume = { 175 },
number = { 15 },
month = { Aug },
year = { 2020 },
issn = { 0975-8887 },
pages = { 37-40 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume175/number15/31532-2020920652/ },
doi = { 10.5120/ijca2020920652 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-07T00:25:09.455940+05:30
%A Yathish Aradhya B. C.
%A Y. P. Gowramma
%T Progressive Sampling Algorithm with Rademacher Averages for Optimized Learning of Big Data: A Novel Approach
%J International Journal of Computer Applications
%@ 0975-8887
%V 175
%N 15
%P 37-40
%D 2020
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Sampling of Big Data for its analytics is a tedious task. Progressive Sampling Algorithm (PSA) is a primary tool adopted elsewhere to produce minimal training data set for learning algorithm used in Big Data Analytics. PSA can be characterized by its underlying initial sample size selection ,sampling schedules and stopping criterion are suggested along with process flow of PSA in generating adequate number of samples for training data set. operations used such as initial sample size, sampling schedule and stopping criterion. Training data set is a determining factor of traing cost, computational cost and learning model accuracy. Rademacher Averages Bound of Sampling can be used to bound the sampling process. This paper suggests novel ways to underlying operations of PSA and scope for significant reduction of the cardinality of training dataset while retaining the behavior of Learning model's Accuracy within Probably Acceptable Correct(PAC) Framework using Rademacher Averages Bounds.

References
  1. Provost F., Jensen D., Oates T. (2001) Progressive Sampling. In: Liu H., Motoda H. (eds) Instance Selection and Construction for Data Mining. The Springer International Series in Engineering and Computer Science, vol 608. Springer, Boston, MA
  2. Gu B., Liu B., Hu F., Liu H. (2001) Efficiently Determining the Starting Sample Size for Progressive Sampling. In: De Raedt L., Flach P. (eds) Machine Learning: ECML 2001. ECML 2001. Lecture Notes in Computer Science, vol 2167. Springer, Berlin, Heidelberg.
  3. Static and Dynamic sampling by GH John - ‎1996 Aug 2, 1996 - Publication: KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data MiningAugust 1996.
  4. Aounallah M., Quirion S., Mineau G.W. (2004) Distributed Data Mining vs. Sampling Techniques: A Comparison. In: Tawfik A.Y., Goodwin S.D. (eds) Advances in Artificial Intelligence. Canadian AI 2004. Lecture Notes in Computer Science, vol 3060. Springer, Berlin, Heidelberg.
  5. Machine Learning By Tom M Mitchell published by Tata Mc-Graw Hill ISBN-13: 978-0070428072.
  6. An Overview of Statistical Learning Theory. Vladimir N. Vapnik. Abstract—Statistical learning theory was introduced in the late. 1960's. Until the 1990's it was a by VN Vapnik - ‎1999 - ‎Cited by 5582 articles.
  7. Sampling-based Randomized Algorithms for Big Data Analytics” by Matteo Riondato, Ph.D., Brown University, May 2014.
  8. Mining frequent item through Progressive Sampling with Rademacher Averages Publication:KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2015 Pages 1005–1014https://doi.org/10.1145/2783258.2783265.
Index Terms

Computer Science
Information Sciences

Keywords

Progressive Sampling Algorithm (PSA) VC-Dimension Rademacher Averages Big Data Statistical Optima Size Sample Convergence Rademacher penalty Bounds