Written by Anna Geller |
We could start this article by sharing statistics showing how quickly the amount of data we generate every day is growing. We could delve into how Big Data and AI are rapidly changing so many areas of our lives. But we no longer have to explain it. Everybody intuitively knows it. So what’s missing? The right people with the right skill set.
Data Engineering is not equivalent to Software Engineering or Data Science
Nearly every company these days strives to become more data-driven. To accomplish that, many firms are hiring software engineers assuming that the challenge is purely on a technical level. But this assumption is overly simplistic and does not acknowledge the specialized skills required to handle data projects at a large scale.
Having worked in both data and software engineering projects, I believe that they require people with a different background. Both roles are needed, and in a wider sense, they complement each other. In software engineering projects, we need programmers who know how to implement well-constructed abstractions and applications. In contrast, data engineering projects involve extracting and transforming vast amounts of data at scale, often in real-time, and consolidating this data into a broader data ecosystem. Those two types of projects entail considerably different processes, tooling, and skill set.
Code is important, but data is critical
There is no denying that code that processes your data must be thoroughly tested and well-engineered. However, software engineers often place more importance on code than on data. Code is a vehicle to provide value, but it doesn’t provide value by itself. In the end, all the tooling, refactoring, modularity, continuous deployment, and full test coverage don’t help that much if the data is wrong because the code’s ultimate objective is not satisfied. This is not to say that those aspects are irrelevant in data projects, quite the opposite. Poorly written code can lead to many problems down the road, especially with respect to long-term maintainability. But data quality should have priority over purely technical aspects.
Testing both code and data
Software engineers pioneered unit testing and test-driven development. By leveraging automation, we can ensure that our code works as expected at all times. Still, for data projects, unit tests are not enough. Without testing your data, it only covers a part of the equation. Imagine that your code is perfectly designed and does what it needs to do, but your data suddenly comes in corrupt, stale, or incomplete due to a failure in a source system that generates this data. While unit tests pass, data tests will fail, thereby alerting you about a data quality issue. Additionally, unit tests are evaluated only upon a new build and deployment (for instance, when you commit and push changes to your code) while your data tests are running continuously in your data pipelines. Therefore, we need both. Unit tests protect from code and program logic failures, and data tests protect from ingesting bad data into our system. If you want to learn more about data testing, there is a great open-source Python library built for this purpose.
The outcome of a project
A deployment artifact in software engineering can be a package, an API, a website, a mobile app, or a container image. In contrast, data projects deal with different artifacts. The outcome of a data pipeline can be a table in a data warehouse, a parquet file in a data lake(house), an automated report, or a productionized data science model.
Keeping it simple
It’s not meant to be a negative statement, but software engineers can overcomplicate things at times when working on data projects. For instance, something that could be solved by a data engineer by means of a single SQL query may end up becoming a fully-fledged backend API with an ORM layer if the same problem is tackled by a software engineer. Building APIs is great (as is the API-first approach), but often we may be better served by implementing a simple solution first.
We often hear the statement to use “the right tool for the job”. But we also need to consider the right persona with the right skill set for the job. Following the same example, the API may be a better solution if the use case requires an audit trail of each request, different user access permissions, or would need to be connected to a front-end. But if you need to regularly move data from A to B at scale and bring it to the right format along the way, a data engineer (and SQL) will likely fit the task better.
How to split tasks between a software engineer and a data engineer?
Imagine the following scenario: you need to build a custom analytical application that should display specific visualizations and needs to be embedded in a mobile and web application. The underlying data has to be consolidated since the charts require a combination of data from various source systems, and specific business logic needs to be applied when processing this data.
What’s a possible solution? You could assign a data engineer to the task of building a data mart in your cloud data warehouse. He or she would be responsible for extracting the source data, transforming it into a set of metrics and dimensions, and doing it all at scale and on schedule by building processes ensuring that data stays up to date.
In contrast, the software engineer can take care of building and deploying a responsive React application that will be properly designed, secured, tested, and will present data from a data mart to the end-users. Ideally, you would also have an API layer between the data serving layer and the application presentation layer.
Reducing friction in the development process
- data engineers can build a semantic data model, as well as the underlying data pipelines keeping the data up-to-date,
- The same semantic data model can be used by software engineers implementing front-end applications for analytics and reporting.
If you want to try GoodData, there is a free Community Edition that you can install by using a single command that will run the entire platform in a single Docker container:
docker run --name gooddata -p 3000:3000 -p 5432:5432 \
\-e LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES gooddata/gooddata-cn-ce:latest
In the following article, you can find a step-by-step instruction on how you can leverage this Docker container to build Business Intelligence applications:
If you want to deploy it to your Kubernetes cluster, GoodData provides a Helm chart that includes all parts of the architecture in a single package (a.o. the GoodData UI, Postgres database, analytical engine, and intelligent caching layer using Redis).
In this article, we looked at use cases when a software or data engineer might be a better person for the task at hand. We looked at the differences between software engineering and data projects and how we can reduce friction between them by using the API-based Business Intelligence platform provided by GoodData.
Thank you for reading!
Written by Anna Geller |