From Warehouse over the Swamp to the Lake of Data

Author: Nikolina Zeljko | Category: News | Posted: 05.06.2017
Writes: Marko Štajcer, Poslovna inteligencija Data warehouse (DWH) is a well-known term, and everyone in touch with the IT world knows what it is. In a single sentence, DWH is the central repository of data used within the organisation to report and analyse data, and the data itself can be integrated from a variety of systems. When we talk about data warehouses, we talk about relational, structured data. With the release of Hadoop and the new Big Data technology, we have been able to analyse a much larger amount of data compared to traditional DWH and the ability to analyse not only relational but unstructured data types. Some new terms related to storage of data such as the Data Lake have appeared, but also many variations such as Managed Data Lake, Data Factory, Data Refinery… By definition, Data Lake is also a data repository that allows organisations to process and store all data, from internal or external sources, in their original format, without the need for prior structuring or formatting. Once data is collected and stored in Data Lake, it's possible to combine and enable users to use them in accordance with business needs. At the beginnings of the commercialization of Hadoop, the announcements were spectacular, it went so far as to say that Hadoop and Data Lake were built over Hadoop for the aforementioned features and the development of components such as Hive and Impala, which offer functionality of the classical relational base and significantly lower Data storage prices, completely replace DWH. However, there are more than a few reasons why this has not happened. Constant changes, the release of new versions of Hadoop and its components at almost monthly levels are certainly the most important factor affecting the user's perception of the stability of the platform itself, and it gives the impression that the entire Hadoop ecosystem is still not sufficiently mature for such use. Also, lack of or limited metadata management capability, security constraints, system administration and maintenance challenges are just some of the additional reasons why organisations will keep their financial and other important data in the classical data warehouses. Over time, DWH has also evolved. The very idea of using MPP (multi-parallel processing) technology exists much longer than Hadoop itself. Teradata has developed this concept in the 90s of the last century and is still in this segment one of the most important players, according to Gartner, besides IBM Netezza, Oracle Exadata and HPE Vertica. These systems have been significantly advanced compared to traditional data warehouses, allowing a much larger amount of data processing, responding to complex requirements in advanced analytics, and offering support for in-database data processing with exceptionally good performance. However, additional features that bring Big Data technology, Hadoop and NoSQL base should never be neglected. New platforms are complementary, serve to manage, process, and analyse new types and forms of data that are not supported in standard DWH systems. Since DWH is still the best reporting and analytics platform, new platforms do not replace DWH, but they complement it. DWH Optimisation where Data Lake is used for storage and processing of detailed data, SNA, Customer Experience, internal and external data monetization are just some of the cases where Data Lake architecture has its own application. If we take into account all the advantages and disadvantages presented by Big Data Technology and manage the data correctly, we will create the necessary prerequisites so that the organisation can efficiently use the data and derive extra business value from it. Otherwise, if we ignore the above, instead of a clean and neat Data Lake, we will get the swamp from out of which we will hardly swim.