Big Data and Datamining

Did you know that just over the last couple of years humans have accumulated more data than over the entire history of mankind? This incredible amount of data is collected from all kinds of sources – sensors in your cars and mobile phones, website usage statistics, shopping habits, product safety information and even things like historical weather data. All of this data is what has become known as big data.

Why is big data important?

Big data is slowly changing the way companies and other organizations work. From marketing campaigns to product safety and risk analysis – we are beginning to rely increasingly more on information discovered during data analysis.

Big Data and DataminingOver the last few years, large companies have accumulated so much information about their customers, that without proper tools to make some sense out of all that data, they are unable to further optimize their business processes. From things like customer support and user-experience to customer retention and the success of new product launches – most major business decisions are becoming more and more reliant on data analysis and discovery.

Big data differs from regular data mining techniques in several ways:

  • Size. Numerous data sets are collected into a single database, making it extremely difficult to store and curate efficiently.
  • Variety. Big data includes all kinds of data types – from sensor readings to plain text. All of this data must be stored and organized into a single data set that makes sense.
  • Analysis and retrieval. Analyzing extremely large and complex data sets is not a simple task. Fast and efficient analysis, search, sharing and visualization of information is very difficult.

If using big data is such a challenging task, why bother? Why not analyze multiple smaller and less complex data sets separately? The answer is very simple: by combining data sets of different types and sizes, we can find patterns and other information which would otherwise have been impossible to see.

The techniques used to analyze big data sets are very similar to those used in other data mining approaches. The only major difference is the machines performing the analysis: we need extremely fast and scalable systems in order to make big data analysis worthwhile (results should be returned instantly or almost instantly, otherwise, at least in most cases, they are not going to be very useful).