More precicely, it is a common pattern to split the analtics into three parts: 1) Collecting the data (using lots of adapters) into a "data lake", 2) filter and preprocess that data lake into a uniform data structure whose structure is determined by the analysis goal rather than the operational system, 3) analyze that uniform data with statistical and other tools.
It usually makes no sense to combine those phases into a single overall tool. First, these are very different tasks, where different specialized tools will evolve anyway. Second, you want to keep the intermediate results anyway - for caching as well as to have an audit trail and reproducibility of the results.
For example, you don't want the performance of operational system be affected by for how many analysis tools it is used at a point in point. Also, you don't to work on a always-modifying dataset while fine-tunung your analysis methods.
Thanks for your input. The point of a BI tool is to allow flexible analysis of your data. What types of transformations are generally required for making this work without actually _doing_ the analysis during the warehousing step?
The data lake contains datasets in their original structure, raw data, with all warts and stuff, defined by the operational systems.
The data warehouse contains datasets in a unified (and usually simplified and reduced) form, defined by the needs of analysis tools.
If your analysis tool accesses the data lake directly, it will almost certainly contain "parsers" for various operational data formats. Also, it will perform those transformations over and over again every time. And multiple analysis scripts may contain multiple versions of those parsers. The idea is to separate these "parsers" out of the analysis step and to "cache" the cleaned-up intermediate result. That "cache of clean data" is usually called "data warehouse", and can create good indexes on that data, multiple runs of your analysis tools have very fast access.
got it. so the idea is generally people want to do sort-of 2nd derivative queries on the data, so it's best to get those first stats out of the way in the warehousing step
Yes, although I wouldn't describe this as "2nd derivative queries", but more like "put the code that you need anyway into two separate layers (tools) with clean boundaries and a persistent intermediate result".
However, you still want to copy & combine your data into a single database. The relevant Fowler patterns are:
http://martinfowler.com/bliki/DataLake.html
http://martinfowler.com/bliki/ReportingDatabase.html
More precicely, it is a common pattern to split the analtics into three parts: 1) Collecting the data (using lots of adapters) into a "data lake", 2) filter and preprocess that data lake into a uniform data structure whose structure is determined by the analysis goal rather than the operational system, 3) analyze that uniform data with statistical and other tools.
It usually makes no sense to combine those phases into a single overall tool. First, these are very different tasks, where different specialized tools will evolve anyway. Second, you want to keep the intermediate results anyway - for caching as well as to have an audit trail and reproducibility of the results.
For example, you don't want the performance of operational system be affected by for how many analysis tools it is used at a point in point. Also, you don't to work on a always-modifying dataset while fine-tunung your analysis methods.