Private Beta Feature
Interested in trying this feature? The functionality detailed in this article is in private beta testing. To request exclusive access, please connect with us on Slack.
The databases you connect to GoodData often contain large tables that vary significantly in data quality. Without intimate knowledge of these data sets, it’s easy to overlook issues that could later impact the development of the Logical Data Model (LDM), or even cause complications when defining metrics and creating visualizations.
Data profiling feature scans your connected tables and creates a detailed profile. This profile includes statistics on key data characteristics like missing values, data types, and distribution. By doing so, the data profiling feature helps users gain a better understanding of data quality and distribution at a glance, setting the stage for more accurate and reliable analytics.
Review Your Dataset at a Glance
Once you connect a data source to the LDM, you go on to create a new datasets from the available tables. After dataset creation, GoodData automatically gathers profiling information for the dataset. You can review this information in the Data profiling tab that resides in the dataset’s View details dialog:
You can hover over the columns to display additional information about the field. Overall the data profiling tab provides information about:
- Table dimensions
- Empty (null) values
- MIN, MAX, AVG and a histogram of values for facts
- Frequency distribution of unique values for attributes
Spot Anomalies in Your Data
Use the Data profiling tab to do a quick review of the database tables you have connected to ensure your mental model of the data is in line with reality.
Something as simple as checking what the table dimensions are might reveal useful insights.
Are you dealing with annual sales data for your entire company? Is 6000 rows, corresponding to 6000 transactions a plausible number? If you believe you only made 100 sales, or conversely if you believe your company made at least 1 million sales over the past year, this may be an indication that there is an issue with your data.
Reviewing the profiling information will let you do a quick “sanity check” on your data. For example a suspicious looking histograms can alert you to data outliers:
Do you really expect there to be an item worth $20,001 if your average item price is less than $100?
Fields with missing values will display warnings:
GoodData can work with empty values, but they can create unseemly gaps in your data when it is visualized, so it is good to know the gaps are there before you start building analytics on top of the data.
Please note that all datasets with empty values will generate a warning that is displayed in the Model validation panel on the right side of the modeler:
To learn more about this panel, see Model Validation.