Data Analytics and Machine Learning Integration
Written by Patrik Braborec |
The rise of machine learning applications in the world is indisputable. We can see that almost every company tries to utilize technology to help them grow their business. We can say the same about the application of data analytics in companies. What does this mean? Every company wants to know what works, what does not, and what will work in the future. The combination of data analytics and machine learning tools can significantly help companies give answers and predictions to the aforementioned questions/problems.
The issue is that building data analytics and machine learning systems can be very difficult and usually requires highly specialized and skilled people. On top of that lies the fact that both worlds work separately — you need a set of people to build analytics and a different set of people to build machine learning. How to overcome this issue? In this article, I will demonstrate an example with stock price prediction that the right technologies can help companies with data analytics and machine learning without having to employ dozens of software engineers, data scientists, and data engineers — using the right technologies can give the right answers and save money. Let’s dive into it!
Prediction of Stock Price with Data Analytics and Machine Learning
The best example of building data analytics and machine learning systems is to show the process in a real use case. As the title suggests, it will be about the stock price prediction. If you have read something about stocks, you might know that predicting stock prices is very difficult and maybe even impossible. The reason is that there are tons of variables that can influence the stock price. You might ask yourself, why bother with something like this if it is almost impossible? Well, the example I will show you is quite simple (please note that it is only for demo purposes), but at the end of the article, I want to share my idea of how the whole stock price prediction/analysis might be improved. Now, let’s move to the next section with an overview of the architecture of the mentioned example.
Overview of Architecture
You can imagine the whole architecture as a set of four key parts. Every part is responsible just for one thing, and data flows from the beginning (extract and load) to the end (machine learning).
The solution I built for this article runs only locally on my computer, but it can be easily put, for example, into a CI/CD pipeline — if you are interested in this approach, you can check my article How to Automate Data Analytics Using CI/CD.
Part 1: Extract and Load
The extract part is done with the help of RapidAPI. RapidAPI contains thousands of APIs with easy management. The best part of RapidAPI is that you can test single APIs directly in the browser, which lets you find the best API that fits your needs very easily. The load part (load data into a PostgreSQL database) is done by a Python script. The result of this part is schema input_stage with column data which is type JSON (API response is JSON content type).
Part 2: Transform
The data is loaded to a PostgreSQL database into a JSON column, and that’s not something you want to connect to analytics — you would lose information about each item. Therefore the data needs to be transformed, and with dbt it is quite easy. Simply put, the dbt executes SQL script(s) upon your database schemas and transforms them into desirable output. Another advantage is that you can write tests and documentation, which can be very helpful if you want to build a bigger system. The result of this part is the schema output_stage with transformed data ready for analytics.
Part 3: Analytics
Once the data is extracted, loaded, and transformed it can be consumed by analytics. GoodData gives the best possibility to create metrics using MAQL (proprietary language for metrics creation) and prepare reports that are used to train an ML model. Another advantage is that GoodData is an API-first platform, which is great because it can fetch data from the platform. It is possible to use the API directly or to use GoodData Python SDK that simplifies the process. The result of this part are reports with metrics used to train an ML model.
Part 4: Machine Learning
PyCaret is an open-source machine learning library in Python that automates machine learning workflows. The library significantly simplifies the application of machine learning. Instead of writing a thousand lines of code where deep domain knowledge is necessary, you write just a few lines, and being a professional data scientist is not a prerequisite. I would say that in some way it is comparable to AutoML. Still, according to PyCaret documentation, they focus on the emerging role of citizen data scientists — citizen data scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would have previously required more technical expertise.
Example of Implementation
The following section describes key parts of the implementation. You can find the whole example in the repository gooddata-and-ml — feel free to try it on your own! I added notes to README.md on how to start.
Please just note that to run the whole example successfully, you will need to have a database (such as PostgreSQL) and an account in GoodData — you can use either GoodData Cloud with a 30-day trial or GoodData Community Edition.
Step 1: Extract and Load
To train an ML model, you need to have historical data. I used Alpha Vantage API to get historical data on MSFT stock. The following script needs the RapidAPI key and host — I mentioned above, RapidAPI helps with management of the API. If fetch of API is successful, the get_data function will return data which will be loaded into the PostgreSQL database (to schema input_stage).
Step 2: Transform
From the previous step, data is loaded to input_stage and can be transformed. As discussed in the architecture overview section, dbt transforms data using an SQL script. The following code snippet contains the transformation of loaded stock data, note that it is important to extract data from the JSON column and conversion to single database columns.
Step 3: Analytics
The most important step is the metric definition using MAQL — for the demonstration, I computed simple metrics that compute on the fact close (the price of the stock when the stock market was closed) simple moving average (SMA). The formula for SMA is as follows:
An = the price of a stock at period n
n = the number of total periods
SMA and other metrics are used by people who invest as a technical indicator. Technical indicators can help you determine if a stock price will continue to grow or decline. It is computed as the average of a range of prices by the number of periods within that range. The definition of the SMA metric using MAQL is the following (you can see that I selected range 20 days):
The ML model will not be trained just on this one metric but on the whole report. I created the report using GoodData Analytics Designer with simple drag and drop experience:
Step 4: Machine Learning
Last step is to get data from GoodData and train an ML model. Thanks to the GoodData Python SDK, it is just a few lines of code. The same applies to the ML model, thanks to PyCaret. The ML part is done by two function calls: setup and compare_models. Setup initializes the training environment. Compare_models function trains and evaluates the performance of all the estimators available in the model library using cross-validation. The output of the compare_models function is a scoring grid with average cross-validated scores. Once training is done, you can call the function predict_model, which will predict the value (in this case, the close price of the stock) — see the next section for a demonstration.
The demonstration is just for the last step (machine learning). If you run the script for machine learning mentioned above, the first thing you will see is printed data from the GoodData:
Immediately after that, PyCaret inferred data types and ask you if you want to continue or not:
In case everything is alright, you can continue and PyCaret will train models and then pick the best one.
For prediction of data, the following code needs to be executed:
The result is as follows (Label is the predicted value):
That’s it! With PyCaret it is very easy to start with machine learning!
At the beginning of the article I teased an idea for an improvement that I think might be pretty cool. In this article, I demonstrated a simple use case. Imagine if you add data from multiple other APIs/data sources. For example, news (Yahoo Finance, Bloomberg, etc.), Twitter, and LinkedIn — it is known that the news/sentiment can influence the stock price, which is great because these AutoML tools give the possibility for sentiment analysis. If you combine all this data, train multiple models on top of it, and display the results in analytics, you can have a handy helper when investing in stocks. What do you think about it?
Thank you for reading! I want to hear your opinion! Let us know in the comments, or join our GoodData community slack to discuss this exciting topic. Do not forget to follow GoodData on medium to avoid missing any new content. Thanks!
Written by Patrik Braborec |
Subscribe to our newsletter
Get your dose of interesting facts on analytics in your inbox every month.Subscribe