Improve Database Performance

The speed of GoodData, i.e. how fast your analytics load, depends a lot on the condition of your database. Especially when you are working with large amounts of data. Below are general tips to help improve performance.

These are general suggestions. For tips specific to your database type, check the articles in the Create Data Sources section.

Database Technology

Use database engines that support columnar storage. With columnar compression, denormalizing the relational data model into wide flat fact tables is much cheaper.

In addition, some database engines provide special features supporting analytical workloads. If you work with large volumes of data, we recommend that you consider these engines.

Typically, the most suitable solution is using relational databases. No-SQL or other database types can crunch extremely large volumes of data, but usually do not provide sub-second latencies, which is very important for analytics use cases such as executing a visualization ad-hoc or embedding analytics elements.

Data Types

Avoid setting all columns to be VARCHAR(2000). Though it simplifies loading data to the database (no rejections during load), it negatively impacts performance of subsequent data loads.

Databases optimize processing significantly if data types are specified accurately. The databases then can efficiently compress data and allocate only the resources that are necessary for the processing.

Denormalization

As mentioned in the Database Technology section, we recommend using columnar engines and denormalization into wide flat tables. It helps compress columns efficiently and allows executed queries to read only the columns that are required by visualizations (avoiding reading all columns as they would do in a row-store database).

Some database engines provide special features that help with denormalization, for example, materialized views or flatten tables.

Clustered Engines

We recommend that you use database engines that can natively run in a cluster so they can scale horizontally. That might, however, lead to more complex queries suffering from too much data being sent across the network.

Some database engines allows you to define a distribution key (a set of columns in a table) to determine how the data is distributed across the cluster. We recommend focusing on this aspect and use distribution keys in your wide flat tables. The keys should correspond to your analytical requirements (what columns are used for aggregations or JOINs).

Creating a single key may not be sufficient for all dynamic analytics requirements, therefore we recommend that you create more table versions distributed in different ways. Ideally, those table versions should be automatically refreshed from the parent table (materialized views, projections).

Partitioning

We recommend that you use partitioning to utilize so-called partition pruning. You may need to partition your large fact tables by the appropriate level of date granularity. For example, visualizations requiring recent data are processed much faster.

Materialized Views

Many database engines support materialized views. We recommend that you use them to manage certain operations once in the database and not repeatedly in each visualization execution.

Materialized views provide the following features:

JOIN and aggregation
Incremental updates (you do not need to invest into your own complicated implementation of incremental updates)

Index

Row-store or hybrid databases provide indexes for query optimizations. Different types of indexes serve different purposes.

However, the cost of adding many indexes to serve all analytics requirements is typically too high. It may require a lot of additional storage, and data ingestion may slow down too much. To avoid this, index only the columns that are most frequently used for JOIN and aggregation operations.

Data Source Routing

Data Source Routing lets you route analytics queries to different compute clusters that share the same storage and Logical Data Model. This supports horizontal scaling without changing dashboards, insights, or user workflows.

What It Solves

Data Source Routing helps when one compute cluster cannot handle both:

high-concurrency dashboards, and
heavy queries over large, detailed datasets.

With routing, you can separate workloads so long-running queries do not slow down everyday analytics.

Benefits

Better performance by sending heavy queries to dedicated compute
Workload isolation so long-running queries do not affect dashboards
Cost optimization by using powerful compute only when needed
Improved concurrency by distributing workloads across compute clusters

How Routing Works

Each dataset in the Logical Data Model is assigned a primary data source. A data source can also define an alternative data source.

When a query runs, GoodData selects a data source like this:

Prefer primary data sources whenever possible.
Use an alternative data source only when needed so the query can run on a single compute cluster that contains all required datasets.

Routing is automatic and transparent to end users.

Set Up Data Source Routing

Create compatible data sources
Create multiple data sources that point to compute clusters sharing:
- the same underlying storage, and
- the same database schema.
Typical setup:
- a smaller cluster for pre-aggregated workloads
- a larger cluster for detailed transactional workloads (and also contains all objects from the smaller cluster)
- shared dimensions available in both clusters
Assign primary data sources to datasets
In your LDM configuration (Analytics as Code), set the primary data source for each dataset. This determines the default routing target whenever that dataset is used.

Link an alternative data source

On the primary data source, set alternativeDataSourceId to the ID of the alternative data source.

This is a one-way relationship.
The primary data source is not used as a fallback for the alternative.

Example:

{
  "data": {
    "type": "dataSource",
    "id": "ds_primary",
    "attributes": {
      "name": "Primary compute",
      "type": "POSTGRESQL",
      "url": "jdbc:postgresql://host:1234/db",
      "schema": "analytics",
      "username": "user",
      "password": "*****",
      "alternativeDataSourceId": "ds_alternative"
    }
  }
}

Confirm dataset availability
To avoid unsupported federated queries, make sure:
- the alternative data source contains all objects from the primary data source
- both data sources use the same database schema
- aggregated datasets and their original datasets are in the same data source (do not place aggregates only in the alternative)

Routing Example

You have two data sources:

DS1 for pre-aggregated data
DS2 for detailed transactional data

Datasets:

D1 exists in DS1 and DS2
D2 exists only in DS2 and references D1
DS1 has alternativeDataSourceId set to DS2

Query behavior:

A report sliced only by attributes from D1 runs on DS1 (preferred).
A report using a metric from D2 runs on DS2, because DS1 does not contain D2.

Limitations

All objects in a primary data source must also exist in its alternative data source.
Primary and alternative data sources must use the same database schema.
Aggregated datasets must be in the same data source as their original datasets.
Circular alternative references are allowed, but routing may become unpredictable in those setups.

Use AWS PrivateLink

Upload CSV Files