Apache Drill

Disclaimer

Support for the Apache Drill Data Source Manager (DSM) is in beta. Beta features are available for users to test and provide feedback. They do not have their implementation finalized. The behavior or interface for these features may change in the future.

Known limitations

  • MEDIAN (or any alternative like PERCENTILE_CONT) analytics function is not supported by Drill.

Deployment

You can run Apache Drill in a docker container. The image for Apache Drill is available on Dockerhub .

The following example demonstrates how to start GoodData.CN with Apache Drill using Minio to serve as S3 storage:

version: '3.7'

services:
  gooddata-cn-ce:
    image: gooddata/gooddata-cn-ce:1.4.0
    ports:
      - "3000:3000"
      - "5432:5432"
    volumes:
      - gooddata-cn-ce-data:/data
    environment:
      LICENSE_AND_PRIVACY_POLICY_ACCEPTED: "YES"

  drill:
    image: apache/drill:1.19.0
    ports:
      - '8047:8047'
      - '31010:31010'
    volumes:
    volumes:
      - drill-data:/data
      # Inject JDBC drivers for data sources which you want to manage with Apache Drill, e.g.:
      - ./db-drivers/POSTGRESQL/postgresql-42.2.16.jar:/opt/drill/jars/3rdparty/postgresql-42.2.16.jar
      - ./db-drivers/VERTICA/vertica-jdbc-10.0.1-2.jar:/opt/drill/jars/3rdparty/vertica-jdbc-10.0.1-2.jar
      - ./db-drivers/REDSHIFT/RedshiftJDBC42-no-awssdk-1.2.50.1077.jar:/opt/drill/jars/3rdparty/RedshiftJDBC42-no-awssdk-1.2.50.1077.jar
      - ./db-drivers/MSSQL/mssql-jdbc-8.4.1.jre11.jar:/opt/drill/jars/3rdparty/mssql-jdbc-8.4.1.jre11.jar
      - ./db-drivers/SNOWFLAKE/snowflake-jdbc-3.12.9.jar:/opt/drill/jars/3rdparty/snowflake-jdbc-3.12.9.jar
      - ./db-drivers/ADS/datawarehouse-jdbc-driver-bundle-3.5.1.jar:/opt/drill/jars/3rdparty/datawarehouse-jdbc-driver-bundle-3.5.1.jar
      # If needed, override default settings
      - ./ds_managers/drill/drill-override.conf:/opt/drill/conf/drill-override.conf
      # Register default storage plugins
      - ./ds_managers/drill/storage-plugins-override.conf:/opt/drill/conf/storage-plugins-override.conf
    stdin_open: true
    tty: true

  minio:
    image: minio/minio:RELEASE.2021-08-25T00-41-18Z
    volumes:
      - minio-data:/data
    ports:
      - '19000:9000'
      - '19001:19001'
    environment:
      MINIO_ACCESS_KEY: tiger_abcde_k1234567
      MINIO_SECRET_KEY: tiger_abcde_k1234567_secret1234567890123
    command: server --console-address ":19001" /data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

volumes:
  gooddata-cn-ce-data:
  drill-data:
  minio-data:

Prepare Apache Drill for GoodData.CN

To learn how to register Data Sources to Apache Drill, refer to the official Apache Drill documentation for connecting a Data Source .

For additional considerations, refer to Preparing Data Source Managers for GoodData.CN .

Data Source Details

Use the following information when creating a data source to use with your Apache Drill DSM:

  • The following considerations apply when you are configuring the JDBC URL:
  • Basic authentication is most likely supported but is untested. You can test authentication by specifying the user and password.
  • You can set enableCaching to true and cachePath to ["dfs", "data"]

You must configure the writable storage plugin so that the path for dfs.data points to the local filesystem. You can find more information in the official Apache Drill documentation for Configuring Storage Plugins .

You can configure the DSM through the web UI, or you can store the configuration into the file storage-plugins-override.conf and mount it as a volume into the container.

The following example is a snippet that demonstrates the configuration settings for the Apache Drill DSM:

"storage": {
  dfs: {
      type: "file",
      connection: "file:///",
      enabled: true,
      workspaces: {
        "tmp": {
          "location": "/tmp",
          "writable": true,
          "defaultInputFormat": null,
          "allowAccessOutsideWorkspace": false
        },
        "root": {
          "location": "/",
          "writable": false,
          "defaultInputFormat": null,
          "allowAccessOutsideWorkspace": false
        },
        "data": {
          "location": "/data",
          "writable": true,
          "defaultInputFormat": null,
          "allowAccessOutsideWorkspace": false
        }
      },
      formats: {
        "parquet": {
          "type": "parquet"
        },
        .... add other formats based on your needs ....
      }
    }
  }
}  

Performance Tips

If you want to query large datasets or even join large datasets from different data sources, we recommend you first snapshot the datasets into Apache Drill (CREATE TABLE AS) and then querying the table snapshots.