Tech Blog

Facebook Icon Twitter Icon Linkedin Icon

AnyMind Group

Facebook Icon Twitter Icon Linkedin Icon

[Tech Blog] Online Inference Using Vertex AI Models / Endpoints on Early Phase of Machine learning Introduction

Hi, I’m Naoki Komoto (河本 直起) working as Machine Learning Engineer in AnyMind.

In AnyMind, we are developing our MLOps environment from scratch.

In previous articles, we have introduced our efforts to create an MLOps environment such as data pipeline and model training pipeline.

This time, we would like to introduce our online prediction infrastructure using Vertex Pipelines (Kubeflow) and Vertex AI Models / Endpoints, which we have been taking during the early phase of machine learning introduction.


As mentioned in the article about the model training pipeline, AnyMind had a variety of model serving methods from project to project that led to developer specific code and architecture. Therefore, it was necessary to resolve the increase in improvement and operation costs associated with that by setting up a common environment.

While that problem exists, as a machine learning engineer, I needed to develop the MLOps environment while developing for the release of multiple machine learning functions, and few development man-hours can be devoted to develop the MLOps environment. Therefore, it was necessary to create a simplified common environment based on the assumption that it would be improved and expanded step by step.


To reduce man-hours required for development, I set following restrictions:

  • Using the RDB provided by the product
  • Returning all results with online prediction

And I developed with the following simple architecture

The model is trained using Vertex Pipelines, as described in this article. Features are fetched from the RDB of the product application. Once the model is trained, an image containing the trained model and predict logic is uploaded to Vertex AI Models, which is then deployed to Vertex AI Endpoints. The product application fetches features from the RDB and requests them to Vertex AI Endpoints to get prediction results. Fetching and preprocessing of features, uploading of images to Vertex AI Models, and deployment to Vertex AI Endpoints are executed from Vertex Pipelines.

Why Vertex AI Models / Endpoints ?

API for prediction could be developed using Cloud Run, but in the end I decided to implement it using Vertex AI Models / Endpoints.

The first reason was the ease of coupling with the model training pipeline on Vertex Pipelines (Kubeflow). An API that performs prediction using a model is released only after the model has been trained. However, in order to achieve that release flow with services such as Cloud Run, it is necessary to develop the functionality for that on the prediction API. On the other hand, Vertex AI Models / Endpoints can incorporate a release of the prediction API as part of its pipeline using the google-cloud-aiplatform SDK for Python.

The machine learning models used for prediction need to be trained and switched periodically. Since Vertex AI Models / Endpoints separates the interface from the models used internally, switching models after release can also be done easily as part of the pipeline using google-cloud-aiplatform.

The second reason was that it’s envisioned to incorporate batch prediction in the future. Batch prediction and online prediction require different machine resources. On the other hand, differences in processing between the two will lead to differences in prediction results, so it is necessary to restrict developers to use the common processing between the two. Vertex Models can be deployed to Vertex Endpoints for online prediction and can also be used for batch prediction as Vertex Batch Prediction.

That’s why I chose to use Vertex AI Models / Endpoints.

Implementation Details

I will introduce the details below.

Deployment of Prediction API

A simple API that loads trained models and outputs inference results using features as input is implemented in FastAPI. To develop the API, it is necessary to satisfy the interface defined in Vertex AI Models.

Then, I prepared functions for uploading Vertex AI Models and deploying the uploaded Vertex AI Models to Vertex AI Endpoints as shown below. The prepared functions are executed as components. To serve trained models, the components are excecuted after the model training process in Vertex Pipelines (Kubeflow) is completed.

def upload_predict_model(
        project_id: str,
        region: str,
        model_display_name: str,
        predict_image_uri: str,
        predict_route: str = "/predict"
    ) -> str:
    light weight component to upload image as vertex ai model
    from import aiplatform

    # initialize vertex ai client
    aiplatform.init(project=project_id, location=region)

    # deploy model to endpoint
    model = aiplatform.Model.upload(
    return model_display_name

def deploy_predict_model_to_endpoint(
        service_account: str,
        project_id: str,
        region: str,
        machine_type: str, 
        min_replica_count: int, 
        max_replica_count: int, 
        endpoint_display_name: str,
        input_model_display_name: str
    ) -> None:
    from import aiplatform
    light weight component to deploy uploaded vertex ai model to endpoint

    aiplatform.init(project=project_id, location=region)

    # get existing endpoint
    exist_endpoints = aiplatform.Endpoint.list(
    # create new if not exist
    if len(exist_endpoints) == 0:
        endpoint = aiplatform.Endpoint.create(
        # use latest endpoint
        endpoint = exist_endpoints[-1]

    # get deployed model
    deployed_models = endpoint.list_models()

    # get uploaded model
    uploaded_models = aiplatform.Model.list(
    if len(uploaded_models) == 0:
        raise ValueError(f"model_display_name [{input_model_display_name}] does not exist")
        uploaded_model = uploaded_models[-1]

    # deploy uploaded model

    # undeploy other models
    for deployed_model in deployed_models:
        print(f"undeploy: {}")

Overall Flow

The overall flow of how the prediction API is deployed is as follows

The prediction API image is pre-uploaded via Cloud Build in Container Registry. Once the models have been trained, the components described above are kicked, the prediction API image is uploaded as Vertex AI Models, and deployed to the Vertex AI Endpoints. Trained models are loaded from the prediction API container.


This mechanism was developed to be expanded and improved in stages and has the following problems:

  • Inability to perform batch prediction
  • Client applications need to implement feature fetching
  • Interface is not fixed

With regard to the second problem, while the discrepancy between the scope of responsibility and the scope of development leads to unnecessary communication and development costs, It also leads to difference between feature for model training and online prediction due to lack of common creation. Regarding to the third problem, because the features used are included in the interface, the interface changes each time the features used are changed, requiring development on the client side each time. These will be bottlenecks to improve the model.

These problems have now been resolved, and I hope to be able to introduce about it in the next article.


In this article, I have introduced the online prediction mechanism using machine learning models that I was taking in the early phase. I hope this will be of some help.

Latest News