Written by Jan Soubusta |
From SaaS to On-Prem With GoodData Cloud Native
GoodData has been providing a SaaS analytics platform for over a decade. This year, we decided to break the status quo and offer an on-premises platform — GoodData.CN — as well. There are a few key reasons why we made this decision:
- Customers do not have to move their data into the GoodData cloud
- Customers can get data for their insights directly from their data sources, in real time
Of course, the on-prem platform must integrate with various types of customer data sources. We started with support for PostgreSQL, and then followed with support for Snowflake, Redshift, BigQuery, and Vertica. We also adopted Apache Calcite for SQL query generation, which we will continue to use in order to contribute complete SQL dialects for most common data sources.
The Challenge: Data Source Overload
There are, however, countless different data sources, and adding support for all of them would be very challenging. In addition to this, there are many advanced use cases that would also have to be implemented. So, after weighing our options, we started to search for a data source management solution that could offer the required capabilities to expand our data source selection.
We identified the following requirements for the desired solution:
- Must integrate with all the relevant data sources used by data engineers
Ability to join data from multiple data sources in a single SQL query
- Allows a metric to analyze data from various data sources simultaneously
- Ability to cache data locally in formats optimized for analytics use cases (columnar)
- Ability to update caches incrementally (ideally)
- Low latency/high throughput when querying local caches
- Able to push down operations where integrated data sources can manage the operation more effectively — for example, push aggregation to a cloud data warehouse such as Snowflake
- Containerized, no SPOFs, and scales horizontally
Research Results: Two Above the Others
After shortlisting different solutions, we started working with Dremio, Drill, Druid, and Presto. Druid was quickly removed from the list due to its complexity. Soon after, we rejected Presto because it does not support adding, updating, and removing data sources without downtime (we are discussing this with Presto here).
After a quick prototype of our platform on top of Drill and Dremio, we reached the following results:
Dremio provides significantly better performance than Drill.
- Dremio’s query engine is generally faster because of technologies like Arrow and Gandiva.
- Dremio enables the incremental fetching and caching of data from data sources to Dremio’s local columnar storage.
- Updates of the caches can be scheduled and, recently, even orchestrated to keep models (datasets) consistent.
Dremio supports most of the market leaders in the area of data sources.
Drill supports more data sources.
- Supported data sources include, for example, Kafka and HTTP API.
Both Dremio and Drill are based on Apache Calcite.
- Collaborating with Calcite for GoodData purposes can help Drill and Dremio, too.
Both Drill and Dremio provide free community editions for production.
- Dremio’s advanced performance-related features are available in the enterprise edition.
In the end, we decided to support both Apache Drill and Dremio. Apache Drill is directed mainly for community free edition usage while Dremio is suitable for all organizations — especially for enterprises, due to its better performance and advanced fetching and caching capabilities.
Try It Yourself
Ready to test how cloud-native analytics works together with data federation? Download both the GoodData.CN Community Edition and Dremio Community Edition for free, and follow GoodData’s Dremio integration documentation to get started.
Written by Jan Soubusta |