Blog | tags:

Why AI in Analytics Needs Metadata

7 min read | Published Jun 9, 2025

Štěpán is a developer advocate with a deep interest in ML and GPU computation. He started in GoodData in 2022 as UX Designer, later switching to the role of Sr. Developer Advocate. He currently forms part of the AI team as a backend developer, shifting his interest towards AI and how it can help people utilize their data more efficiently.

Why Metadata Is Enough (Most of the Time)

When you interact with data effectively, you're rarely dealing directly with raw values. Instead, you're primarily engaging with metadata, along with already computed metrics and aggregations. Data engineers and analysts also typically don't consume raw data directly; they rely heavily on defined functions, metrics, and computed results to identify trends, clusters, or anomalies.

This insight applies equally to AI-driven analytics. AI doesn't necessarily require access to your raw data to generate meaningful insights. Metadata alone (such as schema definitions, computed metrics, aggregations, and data relationships) is often sufficient for AI to execute precise analytical queries and produce reliable visualizations or actionable insights.

Let's start small with a concrete example using PandasAI. PandasAI effectively demonstrates how an AI model can utilize column names, data types, and computational functions without needing direct access to raw underlying data:

import pandasai as pai

# A simple DataFrame
sales_df = pai.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "revenue": [5000, 3200, 2900, 4100, 2300, 2100, 2500, 2600, 4500, 7000]
})

# API key for PandasAI (using BambooLLM by default)
pai.api_key.set("your-pai-api-key")

top_countries = sales_df.chat('Which are the top 5 countries by revenue?')

print(top_countries)
# Output: [China, United States, Japan, Germany, United Kingdom]

Here, PandasAI doesn't directly access raw data; it leverages available metadata (like column names [“country”, “revenue”], data types, and predefined computations) to execute the required analytical operation and return a precise answer. This foundational concept (metadata-based analytics) is precisely what GoodData expands upon at scale, integrating seamlessly into more complex and secure enterprise environments.

AI Analytics Scaling Beyond a Simple Dataframe

Running AI analytics on simple dataframes works well for straightforward scenarios, but real-world analytics typically involve multiple large datasets, complex relationships, diverse metrics, and data stored across different sources. Basic dataframe operations become insufficient as complexity grows. Out-of-the-box AI systems lack inherent understanding of intricate data relationships, business contexts, and permissions structures, which is precisely where metadata-driven analytics shine.

How Does Metadata-First AI Analytics Work in Practice?

For AI to effectively answer sophisticated business questions, it needs a deep and structured understanding of your data’s semantic organization: what data points represent, how they interrelate, and the business logic governing their usage. Metadata-first analytics enable AI to translate this comprehensive understanding into accurate dashboards, visualizations, or actionable insights without needing to directly handle or expose the raw underlying data.

In practice, this approach results in two clearly separated layers:

Data Layer: Secured raw data, accessed strictly within your control.
Metadata Layer: Contains structured definitions, such as table schemas, calculated metrics, dimension hierarchies, and permission rules, but not the data itself.

In GoodData’s metadata-first AI analytics architecture, the Large Language Model (LLM) interacts with the metadata layer, ensuring data privacy and security. The analytic’s execution, computation, and actual data crunching is done by a deterministic algorithm.

Simplified architecture of metadata-first AI Analytics

This metadata-first architecture isn’t just a design preference — it’s a response to how risky and fragile alternative approaches can be. It may be much simpler to opt into what appears to be a simpler path: letting an LLM generate SQL directly. But this shortcut introduces serious trade-offs.

Why Letting LLMs Write SQL Is Playing with Fire

While this may seem like a fast path to natural language querying, it introduces serious risks and long-term limitations. And the risks are two-fold:

SQL injections (A security time bomb)
Exposing entire physical data model (unnecessary vulnerability vector)

SQL Injections

*While LLM models may not “intend” (using this term very generously) to cause you harm, let’s imagine a scenario, where you are using Big Query. You can for example fall to the Select * trap, which can extremely cost you. And if you use LLM to create your SQL, you inherently trust it to not only return valid results, but to also do it efficiently. And believe me, that's a dangerous assumption with very real financial consequences. If you’d like to learn more about that, see this Medium Article.

And now the second issue. If you let LLM generate the SQL to be executed against your database a bad actor can utilize the LLM to inject SQL statements, which can read unauthorized data. We mitigate this with a complex set of permissions, which is respected by our GoodData AI.

Exposing the Entire Physical Data Model

To write accurate SQL, the LLM must fully understand your database: table names, column types, relationships, and other schema-level details. This means exposing your entire physical data model to the LLM, significantly expanding your attack surface and reducing control. Even outside the SQL execution itself, it is not a great idea to expose your entire physical data model.

Even beyond the security concerns, this approach is fragile:

Small schema changes can break LLM-generated queries.
There's often no governance layer to validate or explain the logic.
Generated SQL is opaque, hard to audit, and difficult to maintain.

At that point, the need to fully access your schema is just a step away from the disclosure of your raw data. That’s not just risky, it undermines the principles of safe, governed analytics.

GoodData take a fundamentally safer approach. We don’t let LLMs guess SQL. Instead, our semantic layer functions as a perfect abstraction to define queries in a structured, business-aligned way. Our analytics engine (refined for over a decade) translates these definitions, ensuring governance, stability, and security at every step.

Does Metadata-First Mean You Can’t Use an LLM to Explain Your Data?

Adopting a metadata-first approach doesn't prevent you from selectively leveraging LLMs in advanced analytical scenarios. We're actively exploring specialized use cases where machine learning algorithms first analyze your data directly, and the summarized results — not the raw data itself — are then provided to an LLM. The LLM helps deliver intuitive explanations, simplifying complex analytical insights such as key driver analysis, clustering, or anomaly detection, effectively serving as an analyst on demand.

Crucially, these enhanced scenarios remain clearly defined, optional, and on top of local LLMs, where customers can bring their own on-premise LLM. Deterministic algorithms will continue to securely manage core analytics computations, ensuring accuracy and reliability, while still offering flexibility to leverage LLM-driven insights when additional clarity or depth is beneficial.

Want to see what GoodData can do for you?

Request a demo

Leveraging the Semantic Layer for Precision

At GoodData, our Logical Data Model (LDM) serves as the core semantic layer, capturing all the necessary context and metadata required for meaningful analytics. The LDM structures data clearly, logically, and intuitively — initially created for human analysts but equally powerful for AI applications.

To enhance the capabilities of our semantic layer further, GoodData leverages vector databases to store the semantic embeddings of analytical objects. This facilitates efficient semantic searches, enabling the LLM to rapidly identify and utilize the correct metadata definitions. GoodData's semantic search system automatically verifies the compatibility, correctness, and computability of all analytical objects before they are exposed to the LLM, ensuring accuracy and consistency in every AI-generated insight.

What the Future Holds for AI Analytics?

The AI industry is evolving at a pace that is very hard to keep up with, so making a blanket statement regarding its future would be very short-sighted. So instead, let’s see how you can prepare for the future (how to be AI-friendly) and what is the next big thing.

How analytics can be AI-friendly

If there's one thing AI excels at, it’s generating, updating, and refining code. GoodData capitalizes on this strength by providing a robust, developer-centric analytics environment powered by a highly structured semantic layer.

At the core of GoodData’s AI success is Analytics as Code (AaC). AaC transforms analytics into a code-first practice, making it incredibly developer-friendly. Analytical objects are expressed in clean, versionable .yaml files. This structure enables developers and data analysts to seamlessly create, modify, and collaborate on analytics directly from their IDEs or command-line interfaces, just like managing code in Git.

Because analytics definitions are represented in structured, human-readable code, AI models can effortlessly understand and manipulate them. An LLM, armed with the semantic metadata, can translate natural language questions into precise visualizations, dashboards, or analytical objects. In practice, you can simply ask your AI assistant for a particular dashboard or metric, and it can quickly generate the corresponding .yaml definitions, ready to be integrated.

Metric and Dataset as a code with their UI counterparts

It is also very important to provide comprehensive APIs and SDKs, enabling developers and AI models alike to orchestrate complete analytics workflows programmatically. For example, an LLM can use GoodData’s API documentation to automate tasks such as creating analytical pipelines, updating metrics definitions, or dynamically generating visualizations tailored precisely to user requests.

In essence, to make analytics AI friendly, you need to to seamlessly integrate AI into every aspect of analytics — from defining new analytical components through structured code, to orchestrating complex analytics tasks via APIs and SDKs — providing a fully flexible, scalable, and developer-focused analytics solution.

Sneak Peek into the Future of Analytics — Ontology

We're currently experimenting with ontology as the foundational source of truth for analytics. Ontology enhances the semantic layer by providing AI models with deep, structured knowledge of business contexts, concepts, and relationships. This structured representation allows AI to understand not just the data itself but the underlying business semantics and logic.

With ontology integrated into analytics, AI models gain the ability to understand complex business relationships as thoroughly as domain experts or seasoned data analysts.

Imagine an analytics future where decision-making is simplified to describing the decision you need to make. Your AI assistant, leveraging ontology-driven knowledge, instantly delivers comprehensive, actionable insights tailored precisely to your business context.

Conclusion

AI-driven analytics dramatically accelerates decision-making by seamlessly bridging the gap between complex questions and actionable insights. Looking ahead, AI integrated deeply with business semantics through ontology and structured metadata will revolutionize how decisions are made, transforming data analytics into proactive decision support systems that deliver insights exactly when you need them.

However, deploying AI analytics securely and reliably requires strong protection against data exposure and safeguards to ensure metric accuracy. GoodData comprehensively addresses these challenges with its metadata-first analytics architecture and Analytics as Code framework. By clearly separating raw data from AI interactions, GoodData ensures your data stays safe while ensuring your analytics remain precise, powerful, and adaptable to any scenario.

Experience the power and security of AI-driven analytics firsthand — sign up for a free GoodData trial today and explore how easily and effectively you can integrate intelligent analytics into your workflow.

Why not try our 30-day free trial?

Fully managed, API-first analytics platform. Get instant access — no installation or credit card required.

Get started

Blog | tags:

Artificial Intelligence