Written by Anna Geller |
Imagine that you’re asked to give an estimate for how long a data project will take. How can you realistically assess the timeline for it? When thinking about all the unknowns that we need to take into account, the only correct answer seems to be “It depends!”. There are too many unknowns in nearly every data project which makes most estimates wrong.
1. It all boils down to data availability and quality
If you need to first get data from five different data sources and establish a regular pipeline for all of them before even starting the actual project that you were assigned to do, then a one-day task can easily extend to several days or weeks. Even if data is already available, but it has poor quality, your project automatically extends beyond its planned duration.
2. Requirement engineering usually takes longer than we think
In most data projects, the requirements are almost never entirely clear. Regardless of how hard the product managers may try to document the scope and acceptance criteria, often you only know what the project really entails once you started it, i.e. once you started talking to stakeholders and diving deeper into the problem that needs to be solved. That’s one of the reasons why agile development gained so much popularity.
Data projects inherently lend themselves for iterations. In the first phase, you may be examining the problem and talking to business stakeholders to know which data and other requirements are needed to solve it. In the next phase, you gather data and build pipelines to ensure you have all you need in your staging area. Then, you analyze it and see what transformations or additional data may be required for the problem at hand. Next, you will think about data schemas and the processes that will consume that data, such as ML models or business logic. And finally, you can ask the stakeholders for their feedback on your initial prototype (the dataset, table, report, ML model, business logic, or API). It can even happen that based on the feedback, you may need to repeat the full cycle.
3. You found the “real” business problem
Sometimes you may be assigned to do project A, but after taking a deeper look into the actual data, you found out that the “real” problem that needs to be solved is something else. Perhaps your task was to combine data for a specific report and you realized that a crucial piece of data required for that is not even collected, or that this data makes no sense.
Also, sometimes, when discussing one problem with business stakeholders, we encounter another, perhaps a much bigger one, or notice that there are other issues that must be tackled first in order to start working on what we initially should be doing… Real-life data projects are riddled with unknowns.
TL;DR: We can only get a realistic estimate once we talked to the business stakeholders and looked at the data.
4. The amount and quality of available documentation
You most likely encountered this situation at some point in your data career: you start working with SQL tables and notice mysterious column names that tell absolutely nothing about the data stored there and you are unsure about the units of measure due to the lack of any data dictionary. You start searching for an explanation in your company’s internal documentation, shared emails, or within code comments. Sadly you found nothing. You ask business users about this data. They tell you, it was created by somebody who left the company, and nobody really knows the full story behind that data even though it’s used in five different business reports and data pipelines.
Sounds like utter chaos? Perhaps it’s exaggerated, but I do think that it’s not far from reality about data management in many companies. **When you then have to build some data product with this data, what will you say?**It’s hard to plan anything if you don’t even understand the column names and units of measure of that data.
5. Expecting best-case scenarios
When estimating projects, we often only account for what we think it might take given “normal” circumstances. This means that we plan for things that we typically would expect if we encounter no issues or roadblocks. But if you think about it: how often did you come across issues or roadblocks along the way? Probably quite often.
There can always be some broken dependencies, missing data, or some other impediments that we must take into account. The only viable solution is to plan enough buffer for such unexpected events. Realistic planning is better than overenthusiastic thinking that everything will follow the happy path.
6. Waiting for external resources
In many cases, your project duration can also be hindered by the time you need to wait for DevOps resources to create new users, provide access to databases, or provision resources. Also, you often need to wait for other colleagues to hand over some documentation, transfer their knowledge, point you to a location where data is stored... Then, you also need to wait for others to review and approve your code. Those are all essential factors that need to be additionally taken into account when estimating the project duration.
7. Emergency, support & maintenance scenarios
You may have given an estimate of one week for a specific project. Meanwhile, a super-critical process broke and you need to spend one or two days fixing it, which delays your project delivery. Similarly, when some junior developers need support, you’ll obviously give them a hand regardless of your project deadlines. The same is true when some colleague is sick and you need to take over some tasks from him or her.
Those factors are not specific to data projects, but they also need to be taken into account before estimating any strict project deadline.
8. The lack of process knowledge & resultant incorrect assumptions
The knowledge silos are probably equally dangerous as data silos. What may seem obvious to business users, may not be obvious to data scientists and engineers.
Even within the data team, there may be some manual processes, mysterious files being shared via SFTP, undocumented mapping tables, and processes that you will need to familiarize yourself with in order to address the project you are currently working on, which again, can affect the duration of your data project. The process-knowledge becomes particularly important if your task is to automate something. As Bill Gates famously said:
“The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency.”
9. Conflicting interests about the scope between business and IT
Typically, business users want more features and they want it NOW. From their perspective, their requests are of the highest priority. In contrast, IT wants standardization and easier maintenance.
This should illustrate that there are many stakeholders (with potentially conflicting interests) that need to agree on the way of implementing something before you can even start working on it. This makes it challenging to give a good estimate. Given that people agree on option A, the estimate could be two days. Given they go for option B, the estimate may turn into two weeks. And what if they don’t reach an agreement?
10. Dependencies in code, environments, and data
Finally, there is always the potential issue of dependencies that result in the extra time needed to consolidate code and data across different projects and teams. For instance, you may have built an amazing authorization layer for your backend API, but then when talking to other developers, it turns out that you need to standardize on a common Auth mechanism across different projects and that DevOps requires Active Directory integration for that. You then need to consolidate your work with others and agree on a common interface that can be reused across different APIs.
How to reduce dependencies in your Business Intelligence technology stack?
Another type of challenge that we need to consider is the hand-off between data engineers, data analysts, and software engineers embedding analytics in custom business applications. In order to facilitate this process, you may want to use a platform that makes it easy to run your analytical applications close to your data. With the cloud-native GoodData platform, you can spin up an entire analytical BI stack in a single Docker container. Then, within your browser, you can connect to a data warehouse of your choice (Amazon Redshift, Snowflake, …) and start building visualizations. The platform makes it easy to combine your charts and KPIs into analytical dashboards that you can share with others in your organization. Additionally, you can embed your visualizations into any custom application by leveraging the developer-friendly React-based Accelerator Toolkit.
docker run --name gooddata -p 3000:3000 -p 5432:5432 \
-e LICENSE_AND_PRIVACY_POLICY_ACCEPTED=YES gooddata/gooddata-cn-ce:latest
For a deep dive into the cloud-native BI platform and all its benefits, you can have a look at my previous article. And if you want to try it directly in a Kubernetes cluster, you can use the following Helm chart making the deployment so much easier.
In this article, we discussed different aspects that can impact the duration of your data projects. Estimates are important in prioritizing work across teams. Nonetheless, data projects are often riddled with so many unknowns that realistic estimates can only be given by explaining our assumptions, taking into account potential roadblocks, consolidation efforts, and planning ahead enough buffer for the unexpected events.
If you provide a larger than expected estimate for project duration, it doesn’t mean that you are slow or lazy, but rather that you are experienced in how long data projects really take. It’s better to overdeliver than to disappoint. Everything that we create needs to be maintained, and this can already take up a significant portion of our time.
Thank you for reading!
Written by Anna Geller |