Analytical platforms and data architecture in Google Cloud

As a globally leading consulting firm, our customer, among other things, conducts analytical industry solutions. The goal of the project was to develop a generalized data structure for an analytical industry solution, including a data warehouse and implement the essential data processes (ETL processes) in the cloud.

Challenge

The availabilty of business-critical data has never been more crucial to decision makers. However, today huge amounts of data need to be analyzed and provided to co-workers, consultants and c-level in a very short time, in order to make data-driven decisions. Our customer processed raw data on a terrabyte scale, for which they wanted build a suitable big data strategy, which should be able to scale with a high number of customers and consulting projects.

Procedure

At first, a generalized data architecture for a data warehouse was created, based on the customer’s requirements and available data. A data model, compliant with Data Vault 2.0 methodology, was used in the process.

According to use cases, the data was divided into the categories gold, silver and bronze.

Gold refers to data, undergoing a strict quality control within the data warehouse. This data is supposed to serve the operation of analytical platforms should be suitable to be viewed by end users.

Silver, on the other hand, categorises raw data, which has already been cleaned and enhanced with data from different sources. Silver data can therefore be accessed quickly by consulting teams and analysts, but can only be interpreted with the essential expertise.

Bronce data, are raw data, collected from the data lake of customer systems, which will then be provided to data scientists. Based on bronce data, first hypotheses and later use cases (gold and silver) can be derived.

Within the context of the project, a serverless ETL pipeline was created using Apache Airflow and Google Dataflow. Using a scripting languange for the ETL process, a duplication of the Data Vault Code could be avoided as far as possible (bridges, hubs, etc.). Our Data Vault 2.0 package for Airflow, only allows creating complex links between tables, by configuring the DAG.

Using Google Bigquery as a data warehouse a serverless database was aplied, minimizing administrative effort. At the same time, extraordinarily fast queries at high parallelism are made possible.

Results and customer value

A Data architecture was succesfully implemented in Google Cloud, according to the requirements of specific stakeholders. An ETL process was implemented, which transferred raw data into the generalized Data Vault 2.0 scheme and forwarded it to the data marts. This ETL process was based on Apache Airflow combined with Google DataFlow, as well as individual data base queries in SQL.

The data architecture was implemented with the help of Google BigQuery, due to its outstanding performance. In contrast to other data base solutions, which originated from BigQuery, our customer did not face any initial licensing and setup costs.

With the help of the solution implemented by us, our customer was now enabled to analyse business-critical data on a terrabyte scale, within only a few minutes. The data could then be forwarded to stakeholders. Thanks to the cloud-native architecture, our customer did not need to invest any initial costs for infrastructure and licensing. Therefore, costs could be saved with the help of a pay as you go model.

Closely cooperating with the Google Cloud team, an IT architecture could be created, which perfectly suited our customer’s needs.