Each information technology software vendor talked, a lot, about Big Data in 2012 and it doesn’t seem to be a temporary thing. All major players like SAS, Teradata, Informatica, Oracle, IBM and many others don’t cut short on making as much noise as possible about Big Data.
Teradata claims that in 2020 the amount of digital information will be 35 Billion Terrabyte gigabytes. That’s 4.6 Terabyte per projected person in the world in 2020. I have no problem believing that. The challenge Big Data wants to answer is on how to do something useful with that data.
So is this just a lot of marketing mambo jumbo or are we really on to something?
Big Data Definition
The good thing is that almost everyone agrees to when you can label your information challenge as “Big Data”:
- The data volume is massive and a challenge to handle, say Tera- and Petabytes.
- The change rate of that data is high, real-time, or near real-time.
- The data can have many different formats and/or formats.
- Analytics insight is required in a timely fashion.
So the challenge with Big Data is managing highly volatile massive and diverse volumes of data intelligently in an actionable time span.
An Ambitious Example
Let’s run ahead of ourselves and illustrate with a possible example. Suppose Walmart wants to improve the customer shopping experience and be able to target customers on the fly with the right offering. Ten years ago you would ask a marketing agency to perform a survey on a statistically viable but small set of respondents. The agency would present the results and Walmart would take the decision to e.g. introduce discounts on Friday or rearrange some products in the shelves for the next 6 months. Today you would do the same but using online data. But suppose you want to do this for each individual customer while visiting the store 24/7?
How would this go? Let’s assume Walmart knows the customer, has a way of tracking the customers’ behavior in the store and uses electronic shelf labels. The target is to be able to highlight the right product, the right promotion or the right price specifically mapped on that individual customer.
After some analysis you will end up with a massive amount of data that contains the history of a customer, the shelf layout of each shop, the price of each product, pictures and characteristics of products, the stock status of each product, demographics of the customer, the location of the customer in the store, maybe the Facebook data of the customer, weather conditions, the music played at the store, promotions and much much more (Maybe even facial recognition data that indicates the customers’ mood when he/she realizes all of this ;-)). New products and customers are added daily, shelf layout and stock position varry constantly. The result is an enormous amount of data, potentially unstructured that changes constantly.
Based on this you want to perform an advanced individual real-time market basket analysis thousands of times per minute. Challenge? Mhh.
I browsed around on the web and truth must be told not many reliable sources can be found, but I estimate the total chunk of data that you will use in the above scenario on a continuous basis can be estimated to be potentially well over 10 Terabytes if you go all the way. What we do know is that Walmart handles more than 1 million customer transactions every hour, loaded into databases covering more than 2500 terabytes.
The solution to solve these types of massive data challenges demands a serious bit of resources and kit.
- You need massive computing and storage power
- You need an intelligent engine that can work the data
- You need an interface to use the results
- You need the people to work the above
I’ve taken a look on how IBM solves this and their offering is in essence a combination of hardware (like Netezza, which I like a lot), an information management system (various components of the InfoSphere suite) and analytics capabilities, which can be the sum of Cognos BI, TM1 and SPSS, and various other components.The other traditional venders follow the same logic, take what you have and adapt and bundle to meet Big Data challenges.
So on the one hand it seems just a a lot of marketing talk for things that already exist but on the other hand you can argue that no single solution exists to tackle a Big Data challenge. Both are probably true, but this does not negate the fact that we are becoming part of a data collecting society where the challenge to work this data is very real. Next to that, although vendors are bundling various software bits and pieces, it is also true that hard- and software capabilities did increase and mature in recent years. Think in-memory technologies, think solutions like Netezza, think Cloud storage, etc.
The biggest challenges in these type of offerings will most likely not be the amount of data, nor the change rate or the immediate responses required, not even applying some intelligent analytics on it, but the ability to integrate different sources and different formats.
Data like voice in different languages or the interpretation of images is underway but nowhere near is it production ready on a massive scale for a variety of applications.
The integration from different sources has technical solutions, but sources like popular social media sites pop up and gain massive popularity very quickly. That doesn’t make it any easier either.
But those are technical challenges, the biggest challenge for an organisation will be to define what benefits can be derived from using Big Data. It is obvious that Big Data will not come cheap, so a solid business case is crucial.
In the example above one can argue that Big Data is merely the mega version of the classic “beer and diaper” story. What did change in those 20 years is that we now have the means to gather all that data and to exploit it on a massive scale intelligently. If the projections about data growth are true, and there is no reason to doubt them, the major challenge for/after Big Data, will be intelligent and powerful analytics on non relational information. And as argued above, to identify the business cases that can benefit from a Big Data approach.