Sharing top billing on the list of data science capabilities, machine learning and artificial intelligence are not just buzzwords – many organizations are eager to adopt them. However, the often forgotten fundamental work necessary to make it happen – data literacy, collection, and infrastructure – must be accomplished prior to building intelligent data products.
If we look at the hierarchy of needs in data science implementations, we’ll see that the next step after gathering your data for analysis is data engineering. This discipline is not to be underestimated, as it enables effective data storing and reliable data flow while taking charge of the infrastructure.
Data science layers towards AI, Source: Monica Rogati
Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. It takes dedicated specialists – data engineers – to maintain data so that it remains available and usable by others. In short, data engineers set up and operate the organization’s data infrastructure preparing it for further analysis by data analysts and scientists.
Explaining Data Engineering and Data Warehouse
To understand data engineering in simple terms, let’s turn to databases – collections of consistent and accessible information. Within a large organization, there are usually many different types of operations management software: ERP, CRM, production systems, and more. And so there are many different databases as well. As the number of data sources multiplies, having data scattered all over in various formats prevents the organization from seeing the full and clear picture of their business state. It’s necessary to figure out how to get sales data from its dedicated database talk with inventory records kept in a SQL server, for instance. This creates the necessity for integrating data in a unified storage system where data is collected, reformatted, and ready for use – a data warehouse. Now, data scientists and business intelligence (BI) engineers can connect to the warehouse, access the needed data in the needed format, and start yielding valuable insights from it.
The process of moving data from one system to another, be it a SaaS application, a data warehouse (DW), or just another database, is maintained by data engineers. A data architect, however, is responsible for building a DW – designing its structure, defining data sources, and choosing a unified data format.
The process of transporting data from sources into a warehouse
We’re going to elaborate on the details of the data flow process, explain the nuances of building a data warehouse, and describe the role of a data engineer.
Data warehouse
A data warehouse is a central repository where raw data is transformed and stored in query-able forms. Without DW, data scientists have to pull data straight from the production database and may wind up reporting different results to the same question or cause delays and even outages. Serving as an enterprise’s single source of truth, the data warehouse simplifies the organization’s reporting and analysis, decision making, and metrics forecasting.
Technically, a data warehouse is a relational database optimized for reading, aggregating, and querying large volumes of data. Surprisingly, DW isn’t a regular database. How so?
- First of all, they differ in terms of data structure. A regular database normalizes data excluding any data redundancies and separating related data into tables. This takes up a lot of computing resources, as a single query combines data from many tables. Contrarily, a DW uses simple queries with few tables to improve performance and analytics.
- Secondly, aimed at day-to-day transactions, databases don’t usually store historic data, while for warehouses, it’s their main purpose, as they collect data from multiple periods. DW simplifies a data analyst’s job, allowing for manipulating all data from a single interface and deriving analytics, visualizations, and statistics.
- Notably, a data warehouse doesn’t support as many concurrent users as a database, being designed for a small group of analysts and business users.
Data warehouse architecture
To structure a data warehouse, four basic components are combined.
Data warehouse storage. The foundation of data warehouse architecture is a database that stores all enterprise data allowing business users to access it for drawing valuable insights. Data architects usually decide between on-premises and cloud-hosted databases noting how the business can benefit from this or that solution. Although cloud environment is more cost-efficient, easier to scale up or down, and isn’t limited to a prescribed structure, it may lose to on-prem solution in terms of querying speed and security.
A data architect can also design collective storage for your data warehouse – multiple databases running in parallel. This will improve the warehouse’s scalability.
Metadata. Adding business context to data, metadata helps transform it into comprehensible knowledge. Metadata defines how data can be changed and processed. It contains information about any transformations or operations applied to source data while loading it into the data warehouse.
Data warehouse access tools. Designed to facilitate interactions with DW databases for business users, access tools need to be integrated with the warehouse. They have different functions, for example, query and reporting tools are used for generating business analysis reports. Another type of access tools – data mining tools – automate the process of finding patterns and correlations in large amounts of data based on advanced statistical modeling techniques.
Data warehouse management tools. Spanning the enterprise, data warehouse deals with a number of management and administrative operations. That’s why managing a DW requires a solution that can facilitate all these operations. Dedicated data warehouse management tools exist to accomplish this.
Building Data Warehouse: Understanding the Data Pipeline
While data warehouse concerns the storage of data, data pipeline ensures the consumption and handling of it. A Data pipeline is a sum of tools and processes for performing data integration.
It captures datasets from multiple sources and inserts them into some form of database, another tool or app, providing quick and reliable access to this combined data for the teams of data scientists, BI engineers, data analysts, etc.
Constructing data pipelines is the core responsibility of data engineering. It requires advanced programming skills to design a program for continuous and automated data exchange. As this process is quite complex, it’s viable for organizations whose products have found the market, to pursue further growth. That said, a data pipeline is commonly used for:
- moving data to the cloud or to a data warehouse
- wrangling the data into a single location for convenience in machine learning projects
- integrating data from various connected devices and systems in IoT
- copying databases into a cloud data warehouse
- bringing data to one place in BI for informed business decisions
Nevertheless, young companies and startups with low traffic will make better use of SQL scripts that will run as cron jobs against the production data.
Creating a data pipeline step by step
Pipeline infrastructure varies depending on the use case and scale. However, it always implements a set of ETL operations:
1. Extracting data from source databases
2. Transforming data to match a unified format for specific business purposes
3. Loading reformatted data to the data warehouse
ETL operations, Source: Alooma
1. Retrieving incoming data. At the start of the pipeline, we’re dealing with raw data from numerous separate sources. Data engineers write pieces of code – jobs – that run on a schedule extracting all the data gathered during a certain period.
2. Standardizing data. Data from disparate sources is often inconsistent. So, for efficient querying and analysis, it must be modified. Having data extracted, engineers execute another set of jobs that transforms it to meet the format requirements (e.g. units of measure, dates, attributes like color or size.) Data transformation is a critical function, as it significantly improves data discoverability and usability.
3. Saving data to a new destination. After bringing data into a usable state, engineers can load it to the destination that typically is a relational database management system (RDBMS), a data warehouse, or Hadoop. Each destination has its specific practices to follow for performance and reliability.
4. Maintaining alterations. Despite being automated, a data pipeline must be constantly maintained by data engineers: they repair failures, update the system by adding/deleting fields, or adjust the schema to the changing needs of the business.
Data pipeline challenges
Setting up secure and reliable data flow is a challenging task. There are so many things that can go wrong during data transportation: Data can be corrupted, hit bottlenecks causing latency, or data sources may conflict generating duplicate or incorrect data. Getting data into one place requires careful planning and testing to filter out junk data, eliminating duplicates and incompatible data types, to obfuscate sensitive information while not missing critical data.
Juan De Dios Santos, an experienced practitioner in the data industry, outlines two major pitfalls in building data pipelines:
- lacking relevant metrics
- underestimating data load
“The importance of a healthy and relevant metrics system is that it can inform us of the status and performance of each pipeline stage, while with underestimating the data load, I am referring to building the system in such a way that it won’t face any overload if the product experiences an unexpected surge of users,” elaborates Juan.
Beyond Data Warehousing: Big Data Engineering
Speaking about data engineering, we can’t ignore the big data concept. Grounded in the three Vs – volume, velocity, and variety – big data usually floods large technology companies like YouTube, Amazon, or Instagram. Big data engineering is about building massive reservoirs and highly scalable and fault-tolerant distributed systems able to inherently store and process data.
Big data architecture differs from the conventional data handling, as here we’re talking about such massive volumes of rapidly changing information streams that a data warehouse isn’t able to accommodate. The architecture that can handle such an amount of data is a data lake.
Data lake architecture
A Data lake is a vast pool for saving data in its native, unprocessed form. A data lake stands out for its high agility as it isn’t limited to a warehouse’s fixed configuration.
A data lake uses the ELT approach swapping transform and load operations in the standard ETL operations sequence. Supporting large storage and scalable computing, a data lake starts data loading immediately after extracting it. This allows for increasing volumes of data to be processed. A data lake is very convenient, for instance, when the purpose of data hasn’t been determined yet – since a data lake stores it and later processes it on-demand.
Hadoop platform – a hands-on example of a data lake
A popular open source example of a data lake platform is Hadoop. It’s a large-scale data processing framework based on Java. This software project is capable of structuring various big data types for further analysis. The platform allows for splitting data analysis jobs across various computers and processing them in parallel.
The Hadoop ecosystem consists of the following set of tools.
Hadoop Distributed File System (HDFS). Java-based big data storage system, HDFS includes two components: NameNode stores metadata while DataNode is responsible for actual data and performs operations according to NameNode.
MapReduce. It’s a framework for writing applications that process the data stored in HDFS. Parallel in nature, MapReduce programs are effective for performing big data analysis using multiple machines in the cluster.
YARN. As the operating system of Hadoop, YARN helps manage and monitor workloads.
YARN architecture, Source: Data Flair
Hive. A system for summarizing, querying and analyzing large datasets, Hive uses its own language – HQL- which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs for execution on Hadoop.
Pig. Having similar goals with Hive, Pig also has its own language – PigLatin. When to use Pig and when to use Hive is the question. Pig is a better option for programming purposes, while Hive is mainly used by data analysts for creating reports.
HBase. A NoSQL database built on top of HDFS that provides real-time access to read or write data.
There are many other components that empower Hadoop functionality: HCatalog, Avro, Thrift, Apache Drill, Mahout, Sqoop, Flume, Ambari, Zookeeper, Oozie. You can read about their specifications in Hadoop documentation.
Stepping up to Enterprise Data Hub (EDH)
Besides big data capabilities, data lakes also brought new challenges for governance and security, and the risk of turning into a data swamp – a collection of all kinds of data that is neither governable nor usable. To tackle these problems, a new data integration approach emerged – data hub – where data is physically moved and re-indexed into a new system.
In the data hub architecture, data from many operational and analytic sources is acquired through replication and/or publish-and-subscribe interfaces. As data changes occur, replication uses changed data capture (CDC) to continuously populate the hub, while publish-and-subscribe allows the hub to subscribe to messages published by data sources. EDH data-centric storage architecture enables executing applications where the data resides.
So what are the benefits of EDH over traditional data consolidation? Let’s have a closer look.
Easy connection of new data sources. Contrarily to DW that lacks flexibility, modern EDH can connect multiple systems on the fly, integrating the diverse data types.
Up-to-date data. Outdated data can be an issue with a DW, but EDH overcomes it, presenting fresh data ready for analysis right after capturing it.
Tools. EDH offers powerful tools for processing and analyzing data.
Rapid deployment. While DW system deploy can last months and even years, EDH deployment is a matter of days or weeks.
An EDH can be integrated with a DW or a data lake to streamline data processing and deal with the common challenges these architectures face.
The Role of Data Engineer
Now that we know what data engineering is concerned with, let’s delve into the role that specializes in creating software solutions around big data – a data engineer.
Juan De Dios Santos, a data engineer himself, defines this role in the following way: “In a multidisciplinary team that includes data scientists, BI engineers, and data engineers, the role of the data engineer is mostly to ensure the quality and availability of the data.” He also adds that a data engineer might collaborate with the others at the time of implementing or designing a data-related feature (or product) such an A/B test, the deployment of a machine learning model, and the refinement of an existing data source.
Skills and qualifications
Overlapping skills of the software engineer, data engineer, and data scientist, Source: Ryan Swanstrom
Software engineering background. Data engineers use programming languages to enable clean, reliable, and performative access to data and databases. Juan points out their ability to work with the complete cycle of software development including ideation, architecture design, deployment and DevOps, prototyping, testing, defining metrics, alerts, and monitoring systems. Data engineers are experienced programmers in at least Python or Scala/Java.
Data-related skills. “A data engineer should have knowledge of multiple kinds of databases (SQL and NoSQL), data platforms, concepts such as MapReduce, batch and stream processing, and even some basic theory of data itself, e.g. data types, and descriptive statistics,” underlines Juan.
Systems creation skills. Data engineers need to have experience with various data storage technologies and frameworks they can combine to build data pipelines.
Toolkit
Data engineering process involves using different data storage and manipulation tools together. So a data engineer should have a deep understanding of many data technologies to be able to choose the right ones for a certain job.
Tools for writing ETL pipelines:
Airflow. This Python-based workflow management system was initially developed by Airbnb to rearchitect their data pipelines. Migrating to Airflow, the company reduced their experimentation reporting framework (ERF) run-time from 24+ hours to about 45 minutes. Airflow’s key feature is automating scripts to perform tasks. Among the Airflow’s pros, Juan highlights its operators: “they allow us to execute bash commands, run a SQL query or even send an email”. Juan also stresses Airflow’s ability to send Slack notifications, complete and rich UI, and the overall maturity of the project. On the contrary, Juan dislikes that Airflow only allows for writing jobs in Python.
Cloud Dataflow. A cloud-based data processing service, Dataflow is aimed at large-scale data ingestion and low-latency processing through fast parallel execution of the analytics pipelines. Dataflow has a benefit over Airflow, as it supports multiple languages like Java, Python, SQL, and engines like Flink and Spark. It is also well maintained by Google Cloud. However, Juan warns that Dataflow’s high cost might be a disadvantage for some.
Kafka. From a messaging queue to a full-fledged event streaming platform, Apache Kafka
distributes data across multiple nodes for a highly available deployment within a single data center or across multiple availability zones. As an abstraction of a distributed commit log, it provides durable storage.
Other popular ETL and data solutions are the Stitch platform for rapidly moving data and Blendo, a tool for syncing data from various sources to a data warehouse.
Warehouse solutions. Widely used on-premise data warehouse tools include Teradata Data Warehouse, SAP Data Warehouse, IBM db2, and Oracle Exadata. Most popular cloud-based data warehouse solutions are Amazon Redshift and Google BigQuery.
Big data tools. Big data technologies that a data engineer should be able to utilize (or at least know of) are Hadoop, distributed file systems such as HDFS, search engines like Elasticsearch, ETL and data platforms: Apache Spark analytics engine for large-scale data processing, Apache Drill SQL query engine with big data execution capabilities, Apache Beam model and software development kit for constructing and running pipelines on distributed processing backends in parallel.
In closing: data engineer vs data scientist
It’s not rare that a data engineer is confused with data scientist. We asked Alexander Konduforov, a data scientist at AltexSoft, with over 10 years of experience, to comment on the difference between these two roles:
“Both data scientists and data engineers work with data but solve quite different tasks, have different skills and use different tools. Data engineers build and maintain massive data storage and apply engineering skills: programming languages, ETL techniques, knowledge of different data warehouses, and database languages. Whereas, data scientists clean and analyze this data, get valuable insights from it, implement models for forecasting and predictive analytics, and mostly apply their math and algorithmic skills, machine learning algorithms and tools.”
Alexander stresses that accessing data can be a difficult task for data scientists for several reasons:
- Vast data volumes require additional effort and specific engineering solutions to access and process it in a reasonable amount of time
- Data is usually stored in lots of different storages and formats. In this case, it makes sense first to clean it up taking dataset preparation measures, transform, merge and move to more structured storage, like a data warehouse. This is typically a task for data architects and engineers.
- Data storages have different APIs for accessing them. In this case, data scientists need data engineers to implement the most efficient and reliable pipeline of getting data for their purpose.
As we can see, working with data storages built by data engineers, data scientists become their “internal clients.” That’s where their collaboration takes place.
Source : https://www.altexsoft.com/