Hi, I’m Naoki Komoto (河本 直起) working as Machine Learning Engineer in AnyMind.
At AnyMind, we are developing our MLOps environment and in a previous article, I introduced the architecture used during the initial phase of our machine learning implementation. In this article, I would like to introduce our current batch prediction architecture implemented with Vertex AI as a sequel.
Previous Architecture
As described in this article, in the previous architecture, the product application fetched features from its RDB and requests them to a predict API. In addition, data used for training was fetched directly from the product application’s RDB.
Problems with the Previous Architecture
In terms of the predict API, the first problem is that the implementation on the product application depends on the features used by the machine learning model. The input/output of the predict API would change with each feature change, requiring development of the product application team each time. At the same time, it leads to unnecessary communication costs and bugs due to miscommunication.
Second problem is that, since the product application is the one connected to the dataset during prediction, the accumulation and reuse of predict results must be implemented on the product application side.
In cases where it was necessary to accumulate prediction results, such as using prediction results as features for other machine learning models or returning cached results for once predicted records, it was necessary to communicate with the product application team and have them implement the accumulation method in each case. Here, the communication becomes more difficult because the requirements for storing data for machine learning applications and the requirements for product applications are very different. Storing data for machine learning applications includes specific requirements such as versioning of models and storing data to verify prediction results. Another problem arose because the RDB was originally prepared to meet the requirements of the product application and did not fit the data storage requirements of the machine learning application.
The last problem is that it is necessary to distribute requests for A/B testing and canary releases on a model-by-model basis, but this distribution must be done on the product application in the previous architecture. Multiple Vertex AI Models can be deployed in Vertex AI Endpoints, and requests can be allocated in a defined proportion. However, while features needed for prediction differed from model to model, the features were fetched on the product application, so it was not possible to take advantage of that functionality.
In summary, in the previous architecture, the separation of responsibilities and the scope of actual development were not aligned. The machine learning engineering team had to leave what should have been implemented by them to the product application team, resulting in high communication costs, miscommunication, and discrepancies with the requirements. In addition, because the interface for the product application was not consistent, development was required on the product application team for each release, making it difficult to make frequent improvements to the model.
Requirements to Solve the Problems
In this section, I will explain what requirements are needed to resolve the above issues.
Creation of Data Sources for Machine Learning Application
First, the data source for machine learning must be isolated from the product application. Therefore, we created a new data source in BigQuery that is commonly used by prediction processes and model training processes, and switched from RDB to it. For more information on this, please refer to the following article.
Cloud Composer (Airflow) for Machine Learning Data Pipeline
As a result, regarding data sources we were able to migrate to BigQuery as follows.
Batch Prediction Process
From the above issues, it is a requirement that the following two functions be included and separated as functions to be handled on the machine learning application.
- Distribution of requests to machine learning models
- Storing of prediction results
Since business requirements can currently be met by only batch prediction, we set it as first goal to perform batch prediction while meeting the above requirements.
New Architecture
In the following sections, we will introduce new architecture that meet the above requirements.
Batch Prediction Flow
Vertex Pipelines (Kubeflow) is used for model training and batch prediction. For more information, please refer to the following article.
Vertex Pipelines (Kubeflow) for Machine Learning Model Training Pipeline
The flow of the model training and batch prediction is as follows.
First, a model training pipeline generates models from features. The subsequent batch prediction pipeline performs batch prediction using the generated models and stores prediction results in BigQuery and Firestore (Datastore mode). BigQuery is mainly used to check and verify prediction results and to reuse prediction results among machine learning as described below, while Firestore (Datastore mode) is used to retrieve prediction results from serving API described below.
The reason for using Firestore (Datastore mode) instead of Vertex Featurestore or Cloud Spanner is that it meets the necessary requirements and at the same time has the lowest estimated cost.
Serving of Predicted Result
The architecture of serving prediction results is as follows.
The product application requests keys of prediction results it needs (e.g., user ID for per-user inference) and metadata (e.g., top k of recommendations) to the newly added serving API. The Serving API then retrieves the prediction results from Firestore (Datastore mode) based on the request and returns them. The Serving API runs on Cloud Run.
In this way, the accumulation of prediction results can be done on the machine learning application, and the interface for the product application stays same. This eliminates the need for development of the product application when changing features or machine learning models.
Allocation of Requests per Model
As an interface for the product application, the serving API fetches prediction results in response to requests. And at the same time, internally the serving API distributes requests on a model-by-model basis for A/B testing and canary releases.
Currently, machine learning models are stored with {project}_{version of model}_{time the model was trained}
as the minimum unit. {version of model}
is a unit that is switched if the model’s internal processing has changed. Prediction results are then stored in units of {project}_{model version}
.
The serving API keeps a configuration of which version of the model to use and in what proportions. If the version of the model is not specified by the client, the model to use is chosen according to the defined proportions in the configuration. Model version switching in prediction results is done by separating table in BigQuery and Kind in Firestore (Datastore mode).
The following is how the prediction results are assigned on a model-by-model basis.
Deployment Unit
The overall machine learning process is divided into the following deployment units, each of which has its own version.
- Serving of predicted result
- Training model / Model-dependent data processing / Batch prediction
- Generation of features
This prevents duplication of implementation, such as creating a feature generation process for each model change, while at the same time facilitating future generalization and making handover to other teams and division of work easier.
Use of Prediction Results across Machine Learning Applications
There are cases where you may want to use prediction results of one machine learning model as features for other models, for example, using the predicted KOL categories as features for calculating KOL similarity, or using embeddings generated by a model in common between models.
In batch prediction and model training, BigQuery is used as a common interface to fetch prediction results. In order to avoid mixing prediction results from different models, the version of the model used is fixed and the prediction results are fetched.
Although not yet developed, we are considering using the above serving API as a common interface for online prediction.
In addition, even processes that do not use machine learning models, such as tokenization of text, are considered as models and used in the same way. By sharing prediction results, improvements in one model can lead to improvements of other models.
Summary
In this article, I have introduced architecture of serving batch prediction results. I hope it will be helpful.