You will have N databases no matter what. Almost every company has some legacy s...

jtmarmon · on Oct 21, 2015

Thanks for your input. The point of a BI tool is to allow flexible analysis of your data. What types of transformations are generally required for making this work without actually _doing_ the analysis during the warehousing step?

vog · on Oct 21, 2015

The data lake contains datasets in their original structure, raw data, with all warts and stuff, defined by the operational systems.

The data warehouse contains datasets in a unified (and usually simplified and reduced) form, defined by the needs of analysis tools.

If your analysis tool accesses the data lake directly, it will almost certainly contain "parsers" for various operational data formats. Also, it will perform those transformations over and over again every time. And multiple analysis scripts may contain multiple versions of those parsers. The idea is to separate these "parsers" out of the analysis step and to "cache" the cleaned-up intermediate result. That "cache of clean data" is usually called "data warehouse", and can create good indexes on that data, multiple runs of your analysis tools have very fast access.

jtmarmon · on Oct 21, 2015

got it. so the idea is generally people want to do sort-of 2nd derivative queries on the data, so it's best to get those first stats out of the way in the warehousing step

vog · on Oct 25, 2015

Yes, although I wouldn't describe this as "2nd derivative queries", but more like "put the code that you need anyway into two separate layers (tools) with clean boundaries and a persistent intermediate result".