Blog | tags:

Streamlit Meets Headless BI

6 min read | Published May 15, 2023

As Field CTO at GoodData, Jan brings over a decade of experience in backend systems, databases, and cloud-native architectures. His background spans data warehouse operations, large-scale platform engineering, and applied generative AI. Jan has led multiple major long-term projects at GoodData, including the delivery of our data warehouse-as-a-service built on Vertica as well as the ground-up design and development of our next-generation analytics platform. Today, he splits his time between driving product innovation (particularly in the area of AI assisted BI) and representing GoodData in the field — through conferences, articles, and conversations with customers and partners.

Why Headless BI?

We describe it here.

In particular, you can connect Streamlit directly to data warehouses or even to files, but headless BI offers more:

Declare a semantic model just once (logical data model, metrics, reports, …)
Connect any clients (including Streamlit), while relying on a single source of truth
Provide low-enough latency to end users (scalability, caching)
Prevent data warehouses from becoming performance bottle-necks or being too costly

Solution

Let me spoil it here and show you the full picture first. This is a screenshot of the final application:

What can you see in the picture? What am I going to talk about in the following chapters?

Use cases in self-service analytics!

Briefly:

Semantic mode — presented in the left panel. Users build reports by selecting business names. No SQL!
Reports: presented in the main canvas. Various visualization types.
Interactivity: filters, sorting
Context awareness - catalog is filtered based on an already existing report
Multi-tenancy - switch between multiple isolated workspaces
Caching - both Streamlit and GoodData caching

If you want to start immediately with a hands-on experience instead of preparing the whole ecosystem on your laptop, you can try it here.

Otherwise, start with the top-level README to prepare data and analytics, then follow it with the README for the Streamlit app to start the app locally.

Why not try our 30-day free trial?

Fully managed, headless BI platform. Get instant access — no installation or credit card required.

Get started

Semantic model

The demo repository contains all the information about how the semantic model is generated.

We want to expose the model to end users in the Streamlit data application. Python SDK provides various functions for this purpose. It is possible to list each type of entity - e.g. list attributes, facts, metrics, etc. Additionally, it provides a function to return the full catalog.

Moreover, the SDK provides a function to filter the model by the already existing report. What does it mean? When you put some entities into a report, it can limit what other entities you can combine them with. The model consists of datasets connected by relations. Not all datasets must be connected, and even when they are, the direction of the connection can impact the ability to combine the entities.

Finally, we want to cache the catalog so we do not call the backend with every page refresh.

For instance, here is the function collecting the whole semantic model (catalog):

Then, a Streamlit component like “multiselect” can be populated by catalog entities:

Helper functions are used here to extract IDs and titles. Also, the Streamlit state is utilized here to set the selected values.

Report executions

Python SDK provides various options on how to execute reports. Because we are building a Python application, it makes sense to use the Pandas extension, which can return Pandas data frames. They can be printed 1:1 in Streamlit or they can be directly passed as arguments to various visualization libraries provided by Streamlit, in this case, I use the Altair and Folium libraries.

We need to collect all the selected catalog entities and fill them into a report definition.

Every unique request is cached by Streamlit. It is possible to clear the cache by using a dedicated button in the left panel.

Metrics

Although GoodData provides an editor for creating metrics in a custom MAQL language (which is far easier to use than SQL), the users often just want to create very simple metrics like SUM(fact) or COUNT(attribute). The Streamlit application supports it, allowing users to pick a fact/attribute as a metric and for each to specify an analytics function (SUM, COUNT, …).

Filters

The application provides an option to pick an attribute as a filter. It is possible to list all the available values for each attribute and display them in the Streamlit “multiselect” component.

Here is how the attribute values can be collected from the server:

Though I implemented only positive attribute filters (attribute values equal to multiple values), GoodData, through Python SDK, provides many other types of filters out-of-the-box, e.g. negative filters, metric value filters, date filters, etc.

Sorting, paging

I decided to apply sorting and paging in the Streamlit application, on the full result set(data frame). However, GoodData supports sorting/paging out-of-the-box. In the future, I would like to extend the current solution accordingly.

Multi-tenancy

GoodData provides an option to create isolated workspaces. It is easy to support it in the Streamlit app — we just list the available workspaces, populate them to a dedicated “selectbox” and let users pick the workspace which they wanna explore.

Why Streamlit Rocks?

It is really easy to onboard. Many building blocks are already implemented and easy to use, e.g. checkbox, multiselect, inputbox(textarea), etc.

Streamlit offers first-class support for state management. It is easy to persist even more complex variables to state and access them (after page reload) using dict or the property syntax.

It is possible to cache even very complex structures. You just simply use the @st.cache_data annotation and the result of the annotated function is cached for each combination of values of function arguments.

Finally, Streamlit provides a good cloud offering. Developers must register, and then they can create apps and bind them to GitHub repositories. Any merge to the repository redeploys the app with zero downtime. Cool! Moreover, once the app is displayed in the browser, it provides a developer console containing logs, settings, etc.

Where Streamlit Fails?

Although state management is powerful and easy to use, it is sometimes tricky, especially when you need to refresh components based on changes in other components, which is the case with catalog filtering. When you pick an attribute in “View by” you can limit the list of metrics. The most robust solution I found is to specify the “key” property of selectbox/multiselect components. But, sometimes it did not work as expected and I spent hours finding a workaround solution. That is why the code is full of “debug” calls, btw ;-)

Regarding cache management — the @st.cache_data annotation can be put on class methods, but it does not work. I contributed to the corresponding Streamlit forum.

There is a big difference between Javascript/Typescript apps and Streamlit apps - page reloading. Every action in Streamlit requires a full reload of the page. Sometimes it’s handy, but often it’s not, as it does not perform. This is a general limitation of the Streamlit architecture, when everything is running on the Streamlit server, not in the user’s browser.

With rising latency between the Streamlit application and the GoodData, the application starts behaving weirdly during the page reload - e.g. the same selectbox is displayed twice - once active and once inactive.

Custom page design is quite hard to achieve. In my case, for instance, I wanted to create a top bar containing e.g. the workspace picker, but I did not find a solution for it. There is a corresponding issue opened for years.

Moreover, a typical self-service analytics application provides a drag-and-drop experience. However, implementing this feature with standard Streamlit building blocks seems impossible. Fortunately, my colleague successfully overcame this limitation by implementing a separate React application. This React application can easily be integrated with a native Streamlit app. I plan to write about the integration in a follow-up article.

Finally, I was sad that Gitlab is not supported. What a pity! My pipeline benefits from Gitlab a lot. To test the cloud deployment, I finally pushed from the local to a Github “clone” repo, and it worked as expected. Personally, I would appreciate it a lot if it would be possible to trigger the deployment from the pipeline, even before the merge, to create a DEV environment, which can be used as a part of the code review. It would be perfect if the URL to such DEV deployment could be put to the pull request as a comment ;-)

So, Should You Use Streamlit?

Short answer — definitely yes.

Long answer — definitely yes, if you are OK with the limitations described in the previous chapter. Otherwise, Streamlit (and Python in general) provides so much functionality and so many libraries in the area of data analytics/science. Personally, I am most excited by the idea of mixing the demo app I described here with an embedded Jupyter notebook(library exists), and providing a mixed experience for data analysts/scientists.