Introducing Last Mile ETL: One Tool for Better Data Transformation


Imagine you connect a database to analytics and the data is not in the shape you would like to have it. For example, coordinates should be separated by latitude and longitude, some values are in a different format or type, and some tables can even have a bad design. Normally, you would need to spend a lot of time context-switching between the ELT/ETL pipeline and analytics, where, in the transformation (T) phase you would need to edit code that transforms data from the application shape to the analytics shape. This experience is sort of normal in our industry but does it have to be? For this reason, we are introducing Last Mile ETL! With it, you can do everything in one tool which significantly improves the ability to iterate on and the speed of development, customization, and security. You may now ask how? Well, enough words, it’s time for an example. Let’s deep dive into it!
Exploratory Analytics With Last Mile ETL
What do I mean by "exploratory analytics"? Let’s say, we have just three tables with data, and we would like to explore its value or learn some facts based on this data. It means that instead of one concrete goal to achieve, we will primarily try to explore some value in this data! With defined exploratory analytics, here are three tables in a database (Airports, Country list, and GDP — Gross Domestic Product):

You can see that the coordinates are in one column called coordinates instead of latitude and longitude, or value in GDP the table is text rather than numerics. We can address these issues with the help of Last Mile ETL inside the analytics.
Let’s connect the database to the analytics (if you are not familiar with GoodData, I encourage you to check the documentation). The result is the following:
What you can see in the image above are datasets. We can convert a dataset to a so-called SQL dataset:
The SQL dataset gives us the possibility to write SQL queries that are executed directly in the database. Let’s just check what types of airports we have in the database:
It seems that the Airports table contains several types of airports such as heliports, or even closed airports. Let’s say that I am interested just in the medium and large airports — it’s not a problem at all. I don’t need to go to the ELT/ETL pipeline, I simply create a new SQL dataset called Airports transformed directly in the analytics, and I can do that with the following SQL code:
You can see that I can do it with quite a simple SQL query and the result is the following:
Unfortunately, there is one more problem we need to solve. The coordinates are not in the format we need for analytics. We need to break this column into longitude and latitude to render a geo chart. Let’s edit the Airports transformed dataset:
Now, with this little transformation, we can simply render a geo chart to see the all airports in the whole world:
As this was rather simple, let’s explore something more valuable — what is the correlation between GDP and the number of airports in a country?
First, we can just list the values of Airports and GDP tables:
The bad news is that the Airports table does not have the full country names (for example, Albania) but only iso codes of countries (for example, AL). On the other hand, the GDP table does not have iso codes but only has the full name of the country. Sadly, there is one more issue with the column value. You can see that the values contain commas and this is not the correct format for numbers. Let’s solve the mentioned problems. We can start with the Country list table where both the full names and the iso codes are found:
Now, we just need to create a new SQL dataset for the GDP where we join Country list to add the missing iso code (or as I named it — country code), and remove commas from the column value and convert it to numeric type:
Perfect, we have the GDP SQL dataset that we need! Also, to count the airports correctly, I will add an identifior to the Airport transformed SQL dataset:
Both of the SQL datasets now contain country codes, we can simply join them based on these country codes:
The result is as follows. I also removed the Country list dataset as it does not have any value for us right now:
Let’s move to the Analyze tab to find out what the correlation between the GDP and the number of airports in the country is. First of all, we need to calculate the number of airports in the country (for more information, check how to create metrics in GoodData):
The last step is to calculate the correlation between GDP and the number of airports in the country:
Everything is transformed and computed, the result is as follows:
Conclusion
Together we explored what Last Mile ETL is, and how you can benefit from it. In short, you can do everything within the analytics and, as a result, avoid the need for context-switching. On top of that, you can benefit from the security implications — that being, you do not need to give access to the database in order to make simple changes within the analytics. A further plus point is the ability to iterate, meaning that you can start with something simple, and then move to more complex results. Finally, thanks to GoodData's analytics-as-code approach you can easily version everything in Git and thus apply software engineering best-practices to the analytics.
Why not try our 30-day free trial?
Fully managed, API-first analytics platform. Get instant access — no installation or credit card required.
Get started