Improve Database Performance

Performance of GoodData or, in other words, how fast visualizations are computed, strongly depends on the state of your database. When dealing with larger data volumes, review the main factors that may impact performance and see how you can improve it when working with large tables.

Database Technology

Use database engines that support columnar storage. With columnar compression, denormalizing the relational data model into wide flat fact tables is much cheaper.

In addition, some database engines provide special features supporting analytical workloads. If you work with large volumes of data, we recommend that you consider these engines.

Typically, the most suitable solution is using relational databases. No-SQL or other database types can crunch extremely large volumes of data, but usually do not provide sub-second latencies, which is very important for analytics use cases such as executing a visualization ad-hoc or embedding analytics elements.

Data Types

Avoid setting all columns to be VARCHAR(2000). Though it simplifies loading data to the database (no rejections during load), it negatively impacts performance of subsequent data loads.

Databases optimize processing significantly if data types are specified accurately. The databases then can efficiently compress data and allocate only the resources that are necessary for the processing.

Denormalization

As mentioned in the Database Technology section, we recommend using columnar engines and denormalization into wide flat tables. It helps compress columns efficiently and allows executed queries to read only the columns that are required by visualizations (avoiding reading all columns as they would do in a row-store database).

Some database engines provide special features that help with denormalization, for example, materialized views or flatten tables.

Pre-Aggregation

Instead of using tables with raw data, prepare pre-aggregated tables in your database and map your logical data model (LDM) to them. Pre-aggregation helps you avoid unnecessary calculations such as calculating a sum of sales by month from a sales table with minute granularity.

Clustered Engines

We recommend that you use database engines that can natively run in a cluster so they can scale horizontally. That might, however, lead to more complex queries suffering from too much data being sent across the network.

Some database engines allows you to define a distribution key (a set of columns in a table) to determine how the data is distributed across the cluster. We recommend focusing on this aspect and use distribution keys in your wide flat tables. The keys should correspond to your analytical requirements (what columns are used for aggregations or JOINs).

Creating a single key may not be sufficient for all dynamic analytics requirements, therefore we recommend that you create more table versions distributed in different ways. Ideally, those table versions should be automatically refreshed from the parent table (materialized views, projections).

Partitioning

We recommend that you use partitioning to utilize so-called partition pruning. You may need to partition your large fact tables by the appropriate level of date granularity. For example, visualizations requiring recent data are processed much faster.

Materialized Views

Many database engines support materialized views. We recommend that you use them to manage certain operations once in the database and not repeatedly in each visualization execution.

Materialized views provide the following features:

  • JOIN and aggregation
  • Incremental updates (you do not need to invest into your own complicated implementation of incremental updates)

Index

Row-store or hybrid databases provide indexes for query optimizations. Different types of indexes serve different purposes.

However, the cost of adding many indexes to serve all analytics requirements is typically too high. It may require a lot of additional storage, and data ingestion may slow down too much. To avoid this, index only the columns that are most frequently used for JOIN and aggregation operations.