
We see this question in different forms from many organizations struggling to tame their data. Lakes and warehouses are terms that are tossed around by everyone, assuming everyone knows what they all mean. Sometimes they even seem interchangeable. The challenge is they are different, and the choice of what to invest in and how to exploit it are key decisions.
Let’s start with the Data Warehouse.
The basic challenge these days is that most organizations have more than one core system and the reporting from that or those systems is generally weaker than anyone would like. It leads to lots of extracts – if you can get the data – and from there to lots of spreadsheets. And integration is a nightmare.
Even if the reporting is good, it’s limited or hard to interconnect with other systems. Not only do you want to know about sales and operations, but you want to connect it to other data, like your marketing efforts (and spend) which may be in a different system (or through an agency) and your website traffic (which may come from Google analytics). When you have transaction data from all of these systems, each time you need a new report you need to figure out what can be shown, and what can be connected, and once that’s done and you see it, you ask a new question – how does this compare with a prior period.
That’s what a data warehouse is really for. The warehouse takes (often nightly, but on any schedule, really), data from source systems and “transforms” it into readily “analyzable” data. Maybe you really need to know sales over time, or sales by customer by period over time. Or units by model type is the most important thing. Data warehouses are mostly custom built for a company to pre-transform and integrate the data from many systems for a wide range of common (and sometimes not so common) questions. Built well, they speed you to the answers you need.
So that makes sense – what’s a data lake?
Data lakes started out as one thing and turned out to be so useful they quickly morphed into use for a range of other similar purposes.
Originally, lakes were conceived as a way to capture “big data.” Big data is data when there’s a lot of data arriving very fast, and you have limited options about what to do with it at the time of arrival. Think hundreds or thousands of transactions per second. When you have that much data arriving quickly, a common solution is to design a fast process to “store it away” and then schedule whatever next step you need.
Consider an example. Think about running oil wells across the country. Each well is equipped with sensors that that transmit, say, 20 measurements every minute. (Measurements might be well flow, liquid temperature, chemical composition at the moment, air temperature, etc..). The data is raw, and if you have a thousand wells out there, you get 20,000 measurements a minute. You stuff it away in a data lake which is a simple database structure (so it’s fast) and store it until it’s needed. No transformation, no cleansing, no anything, really, just raw data.
In the case of oil wells, that data may be scanned once every minute a program looking for anomalies – temperature or flows that are outside expected parameters and highlight those to operators. Otherwise it might sit there till someone has a need to go looking at it.
This was the original application of data lakes – repositories for inbound big data – which you receive now, store away and process later.
But this approach turned out to have a few other really useful applications. Like System Conversions for one. You have this old system you’ve used forever, and now you’re getting a new system because the old one doesn’t cut it anymore. But you have all this old data in the old system. What do you do with that? (Turns out 99% of the time you can’t actually convert it into the new system or if you can, it’s prohibitively expensive to do so.) Answer: you stick it all in a data lake, as is, and then connect it so you can report on it for as long as you need. This part of the data lake isn’t really being updated – it’s just an archive – making all the old data available.
There are lots of “store it away” till needed examples. But lakes are usually pretty raw data, so using them for advanced reporting is often a challenge. So they sometimes feed nightly warehouses, in which the really useful data is pulled out for ready analysis.
So there you have it. Three points to keep it all sorted out:
- Collated, prepared, curated integrated data that is known to be correct and frequently used goes into a warehouse to facilitate advanced reporting and analysis;
- Data with minimum touch and very little, if any, transformation goes into a data lake to be used or sorted out later.
- Are they totally different? Not really, but in the hands of a data professional the use cases are different, the benefits are different, and the time to results is different.
To discuss ways to get the lake or warehouse you need, and the results you need out of it, you can contact us here.