Data Profiling

The databases you connect to GoodData often contain large tables that vary significantly in data quality. Without intimate knowledge of these data sets, it’s easy to overlook issues that could later impact the development of the Logical Data Model (LDM), or even cause complications when defining metrics and creating visualizations.

Data profiling feature scans your connected tables and creates a detailed profile. This profile includes statistics on key data characteristics like missing values, data types, and distribution. By doing so, the data profiling feature helps users gain a better understanding of data quality and distribution at a glance, setting the stage for more accurate and reliable analytics.

Review Your Dataset at a Glance

Once you connect a data source to the LDM, you go on to create a new datasets from the available tables. After dataset creation, GoodData automatically gathers profiling information for the dataset. You can review this information in the Data profiling tab that resides in the dataset’s View details dialog:

Profiling tab in a dataset

You can hover over the columns to display additional information about the field. Overall the data profiling tab provides information about:

  • Table dimensions
  • Empty (null) values
  • MIN, MAX, AVG and a histogram of values for facts
  • Frequency distribution of unique values for attributes

Spot Anomalies in Your Data

Use the Data profiling tab to do a quick review of the database tables you have connected to ensure your mental model of the data is in line with reality.

Table Dimensions

Something as simple as checking what the table dimensions are might reveal useful insights.

Profiling tab showing table dimensions

Are you dealing with annual sales data for your entire company? Is 6000 rows, corresponding to 6000 transactions a plausible number? If you believe you only made 100 sales, or conversely if you believe your company made at least 1 million sales over the past year, this may be an indication that there is an issue with your data.

Suspicious Outliers

Reviewing the profiling information will let you do a quick “sanity check” on your data. For example a suspicious looking histograms can alert you to data outliers:

Profiling tab showing a histogram

Do you really expect there to be an item worth $20,001 if your average item price is less than $100?

Missing Values

Fields with missing values will display warnings:

Profiling tab showing empty values

GoodData can work with empty values, but they can create unseemly gaps in your data when it is visualized, so it is good to know the gaps are there before you start building analytics on top of the data.

Please note that all datasets with empty values will generate a warning that is displayed in the Model validation panel on the right side of the modeler:

Data validation panel showing empty values

To learn more about this panel, see Model Validation.