A comprehensive overview of open source big data platforms and frameworks



Big Data is the paradigm that represents the ability to analyze and cross-reference large amounts of data generated by computational systems and turn them into useful knowledge. This potential is one solution organizations can use to answer the challenge of getting closer to their users. Organization managers face the challenge of understanding the Big Data concept and the business strategies inherent to its use. The high number of challenges that need to be addressed creates a high number of proposed technical solutions that most times only overlap existing ones. Frequently managers face these issues as their organizations race against the competitors for a market share, without having resources to embrace not only Big Data but also other options that can give competitive advantage. Therefore, organization owners and managers must be educated on deployed platforms that can make them understand the benefits that can be achieved on short term. In this paper we aim to provide an overview of using Big Data with Open Source tools. We explain the Big Data concept, the potential value and the organizational strategies that must be studied in order to determine which benefits organizations can win from it. We analyze the strengths and drawbacks of five open source frameworks for distributed data programming – Hadoop, Spark, Storm, Flink and H2O – and seven open source platforms for Big Data Analytics – Mahout, MOA, R Project, Vowpal Wabbit, Pegasus, GraphLab Create and MLLib. There is no single platform that truly embodies a one size fits all solution, so this paper aims to help decision makers by providing as much information as possible and quantifying some tradeoffs.


International Journal of Big Data, Vol. 2, #3, pp. 15-33 2015


Cited by

No citations found