Comparing SQL and NoSQL approaches for clustering over big data



Data mining is the process of discovering patterns in large datasets. With the exponential growth of available information, new machine learning, statistics and other analytics techniques have to be developed to solve the processing needs required to do such analysis fast enough to be used successfully. In this study, techniques like cluster analysis are used over generated data in order to do customer segmentation, and the system performance is evaluated by measuring the processing time. The data used in the current paper is generated using the Star Schema Benchmark (SSB). Our main goal is to find a scalable solution to run data mining over a decision support benchmark. Four different systems will be tested: single node MySQL, MySQL cluster, Apache Mahout and R. By running MySQL cluster and Mahout, each system distributed by four nodes, the paper compares the performance of k-means run in parallel. MySQL and R will allow for comparison of this kind of execution against methods running on a single machine, both on relational and non-relational systems.


International Journal of Business Process Integration and Management (IJBPIM), Vol. 7, #4, pp. 335 2015


Cited by

No citations found