Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

- Wiki
8 articles, 8 books. Go to books ↓

This article is an effort to explore techniques used by developers of in-stream data processing systems, trace the connections of these techniques to massive batch processing and OLTP/OLAP databases, and discuss how one unified query engine can support in-stream, batch, and OLAP processing at the same time.

Here’s an overview of Spark, an open source framework for big data. With its exceptional performance characteristics, Spark is well-suited for use with machine learning systems. James McCaffrey shows how you can install and run it on a Windows machine.

Machine learning works spectacularly well, but mathematicians aren’t quite sure why.

Data science continues to generate excitement and yet real-world results can often disappoint business stakeholders. How can we mitigate risk and ensure results match expectations?

Data lakes are marketed as enterprise-wide data management platforms for analyzing disparate sources of data in its native format.

Often people use Hadoop and other so-called Big Data™ tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques.

We seek ever more data for a good reason: it’s the commodity that fuels digital innovation. However, turning those huge data collections into actionable insight remains a difficult proposition. Organizations that find solutions to formidable data challenges will be better positioned to economically benefit from the fruits of digital innovation.