Written by Patrik Braborec |
You might have heard a lot about the benefits of Apache Arrow's columnar storage, and maybe you're already utilizing it for certain use cases. But have you considered leveraging Apache Arrow for analytical purposes? In this article, we will cover how GoodData is leveraging Apache Arrow as a core component of our analytics platform to see huge benefits with the semantic model and high-performance caching.
GoodData is a robust analytics platform made up of various components. These components handle tasks such as computing, querying data sources, post-processing, and caching. Our primary goal is to efficiently serve results to our customers. Our results encompass metrics, visualizations, and dashboards accessible through multiple endpoints, including client applications and our APIs/SDKs.. In order to keep the analytics platform functional, reliable, and responsive (even during high demand) our stack is built on Apache Arrow which gives the platform a very strong foundation for high-performance caching (improved performance). Furthermore, Apache Arrow gives us possibilities like:
- Analytics on top of CSV or Parquet data.
- Performance improvements for different data source types (e.g. Snowflake, PostgreSQL, etc.) thanks to ADBC (Arrow Database Connectivity).
- Analytics on top of lake houses (e.g. Iceberg, Delta, etc.).
- SQL interfaces to the GoodData platform.
This article summarizes the architecture and benefits, and hopefully gives you some inspiration for using Apache Arrow in analytics yourself!
Why Build Analytics With Apache Arrow?
A key component of GoodData is the semantic layer, which you may already understand is a layer that maps analytics objects (facts, metrics, datasets, etc.) to a physical model of the data from which the analytics are computed.
The semantic layer has huge benefits in abstracting analytics from physical models. In short, you maintain just one single source of truth, and a change in the physical model means just one change in the semantic layer, instead of changing numerous analytical objects (metrics, visualizations, etc.). Another benefit is that the semantic layer enhances “data” with descriptions and metadata — i.e., you can describe what a particular metric means. To make it work, a query engine must translate the semantic model by translating analytics requests into a physical plan (in simple words, sending an SQL query to a data source).
The query engine is very critical, but on the other hand, the process requires a component that realizes the aforementioned physical part:
- Querying the data source(s).
- Post-processing the data (pivoting, sorting, merging, etc.).
- Caching or storing prepared results.
This is where Apache Arrow comes in. We chose to employ Apache Arrow for these tasks because it provides us with the capability to use a columnar format optimized for computations and designed to minimize friction during data exchange between individual components. The Apache Arrow columnar format empowers computational routines and execution engines to optimize their efficiency during the scanning and iteration of substantial data chunks. The columnar format is ideal for analytical purposes, rather than transactional use cases (e.g. CRUD operations).
Note: We tested the caching potential of Apache Arrow on the advanced in-memory Intel AVX-512 architecture and results were returned in just 2 milliseconds, which is orders of magnitude faster than even the speediest database.
Now, it is not just the optiminzed data format that stands out, Apache Arrow gives us some other great benefits too, including:
- I/O operations with the data.
- Convertors from and to different other formats (CSV and Parquet).
- Flight RPC — API blueprint and infrastructure to build data services.
All of this means that we can focus more on key features of our analytics platform and let Apache Arrow deal with all the low-level technical details. One of the key features is high-performance caching, discussed in the next section.
Main Feature: High-Performance Caching
Apache Arrow allows us to perform data and analytics operations through native Flight RPC. In short, this means:
- Retrieving data from the customer’s data source.
- Data processing (computations on top of caches, data manipulation, data derivation).
- Reading and writing of caches.
Consider this usage of Flight RPC as an interface to all data, which hides all the complexity related to external data access, caching, and cache invalidation. Imagine a client (web application, or API) wants to retrieve data from the analytics platform, and this client does not care if the data is already computed and in caches. How does it work? Flight RPC uses Flight abstraction to represent data either by path description, or command descriptor. You can think of path descriptors as the “specification” of what data you want to retrieve, and command description as the “specification” of how data you want to retrieve.
With this background, envision a basic usage where a service oversees the entire cache mechanism as follows. The Flight described by path (what data to retrieve) are elementary materialized cache:
- The flight path serves as a unique identifier (the structured nature of the path allows for flexible categorization, i.e. raw-cache/
- Categorization allows you to selectively apply different caching strategies and policies.
The Flight with command (how to retrieve data):
- The command descriptor contains the identifier of the data source and SQL to run.
As mentioned above, a client (web application, or API) does not care if the data is already computed or needs to be computed. The client is only interested in the result, and Flight RPC will abstract the client from that by the correct usage of path descriptor (what) and command descriptor (how).
The following diagram explains all the implications of the usage of Apache Arrow and Flight RPC. One more time, clients read/write caches from the path descriptors (Flight path), and do not care:
- Whether it is stored in memory, on disk, or whether it has to be loaded to caches from durable storage.
- How the cached data is managed, how the data moves between storage tiers, or when it is only available on durable storage.
We Won't Stop There
As we mentioned, Apache Arrow has the potential to be used in a number of different ways, and while our first task was to use it for caching, we won't stop there. For example, we currently use Arrow Format, but we have plans to support other formats like CSV or Parquet, which has some connotation. Our most immediate next step is employing DuckDB, which is known as a good tool for querying on top of parquets. The second step is providing a “little” extension which will allow us (and also you, if you use Apache Arrow) to query all external data sources that use such formats.
To fully harness Apache Arrow capabilities, we also integrate with other open-source projects, like pandas (which works natively with Arrow Format from pandas version 2.0 onwards), or turbodbc (ODBC library that can create result sets in the Arrow Format).
What is your experience with Apache Arrow? If you have any, let’s discuss it in our community slack!
Written by Patrik Braborec |
Subscribe to our newsletter
Get your dose of interesting facts on analytics in your inbox every month.Subscribe