Blog | tags:

Why GoodData Decided to Integrate With Dremio

3 min read | Published Nov 15, 2021

As Field CTO at GoodData, Jan brings over a decade of experience in backend systems, databases, and cloud-native architectures. His background spans data warehouse operations, large-scale platform engineering, and applied generative AI. Jan has led multiple major long-term projects at GoodData, including the delivery of our data warehouse-as-a-service built on Vertica as well as the ground-up design and development of our next-generation analytics platform. Today, he splits his time between driving product innovation (particularly in the area of AI assisted BI) and representing GoodData in the field — through conferences, articles, and conversations with customers and partners.

Why GoodData Decided to Integrate With Dremio

From SaaS to On-Prem With GoodData Cloud Native

GoodData has been providing a SaaS analytics platform for over a decade. This year, we decided to break the status quo and offer an on-premises platform — GoodData.CN — as well. There are a few key reasons why we made this decision:

Customers do not have to move their data into the GoodData cloud
Customers can get data for their insights directly from their data sources, in real time

Of course, the on-prem platform must integrate with various types of customer data sources. We started with support for PostgreSQL, and then followed with support for Snowflake, Redshift, BigQuery, and Vertica. We also adopted Apache Calcite for SQL query generation, which we will continue to use in order to contribute complete SQL dialects for most common data sources.

The Challenge: Data Source Overload

There are, however, countless different data sources, and adding support for all of them would be very challenging. In addition to this, there are many advanced use cases that would also have to be implemented. So, after weighing our options, we started to search for a data source management solution that could offer the required capabilities to expand our data source selection.

We identified the following requirements for the desired solution:

Integrations
- Must integrate with all the relevant data sources used by data engineers
Ability to join data from multiple data sources in a single SQL query
- Allows a metric to analyze data from various data sources simultaneously
Performance
- Ability to cache data locally in formats optimized for analytics use cases (columnar)
- Ability to update caches incrementally (ideally)
- Low latency/high throughput when querying local caches
- Able to push down operations where integrated data sources can manage the operation more effectively — for example, push aggregation to a cloud data warehouse such as Snowflake
Cloud-native deployment
- Containerized, no SPOFs, and scales horizontally

Research Results: Two Above the Others

After shortlisting different solutions, we started working with Dremio, Drill, Druid, and Presto. Druid was quickly removed from the list due to its complexity. Soon after, we rejected Presto because it does not support adding, updating, and removing data sources without downtime (we are discussing this with Presto here).

After a quick prototype of our platform on top of Drill and Dremio, we reached the following results:

Dremio provides significantly better performance than Drill.
- Dremio’s query engine is generally faster because of technologies like Arrow and Gandiva.
- Dremio enables the incremental fetching and caching of data from data sources to Dremio’s local columnar storage.
- Updates of the caches can be scheduled and, recently, even orchestrated to keep models (datasets) consistent.
Dremio supports most of the market leaders in the area of data sources.
- The community can implement additional connectors easily using the Advanced Relational Pushdown (ARP) framework.
- There is Dremio Hub with officially supported connectors and various GitHub repositories (e.g., for BigQuery).
Drill supports more data sources.
- Supported data sources include, for example, Kafka and HTTP API.
Both Dremio and Drill are based on Apache Calcite.
- Collaborating with Calcite for GoodData purposes can help Drill and Dremio, too.
Both Drill and Dremio provide free community editions for production.
- Dremio’s advanced performance-related features are available in the enterprise edition.

In the end, we decided to support both Apache Drill and Dremio. Apache Drill is directed mainly for community free edition usage while Dremio is suitable for all organizations — especially for enterprises, due to its better performance and advanced fetching and caching capabilities.

Try It Yourself

Ready to test how cloud-native analytics works together with data federation? Download both the GoodData.CN Community Edition and Dremio Community Edition for free, and follow GoodData’s Dremio integration documentation to get started.

Experience GoodData in Action

Discover how our platform brings data, analytics, and AI together — through interactive product walkthroughs.

Explore product tours

If you are interested in GoodData.CN, please contact us. Alternatively, sign up for a trial version of GoodData Cloud: https://www.gooddata.com/trial/