At first, a generalized data architecture for a data warehouse was created, based on the customer’s requirements and available data. A data model, compliant with Data Vault 2.0 methodology, was used in the process.
According to use cases, the data was divided into the categories gold, silver and bronze.
Gold refers to data, undergoing a strict quality control within the data warehouse. This data is supposed to serve the operation of analytical platforms should be suitable to be viewed by end users.
Silver, on the other hand, categorises raw data, which has already been cleaned and enhanced with data from different sources. Silver data can therefore be accessed quickly by consulting teams and analysts, but can only be interpreted with the essential expertise.
Bronce data, are raw data, collected from the data lake of customer systems, which will then be provided to data scientists. Based on bronce data, first hypotheses and later use cases (gold and silver) can be derived.
Within the context of the project, a serverless ETL pipeline was created using Apache Airflow and Google Dataflow. Using a scripting languange for the ETL process, a duplication of the Data Vault Code could be avoided as far as possible (bridges, hubs, etc.). Our Data Vault 2.0 package for Airflow, only allows creating complex links between tables, by configuring the DAG.
Using Google Bigquery as a data warehouse a serverless database was aplied, minimizing administrative effort. At the same time, extraordinarily fast queries at high parallelism are made possible.