Hello everybody, my name is Anton. I’m a tech lead in AnyTag, a product that you might have heard about from one of our previous articles.
In short, the product is a matching platform for social media influencers and big advertisers. Right now, we have 2 sub-platforms: AnyCreator and AnyTag. AnyCreator is a platform for influencers, where they can find advertising campaigns to join, while AnyTag is mostly for brands that look for influencers to promote their services and products.
■ Influencers
We have about 200 000 influencers in the database whose social media activity is tracked by the system on a daily basis. The number doesn’t seem so frightening, but don’t forget that we are not talking about typical social media users. We are talking about influencers who are very active: they post a lot, they have lots of followers, and they generate more data in general.
And even though we have so many influencers, the main load doesn’t actually come from influencers directly. It comes from keeping their social media data up-to-date. This aspect of the system is crucial: we need to store the most recent information from social media because we promote these influencers to our business clients on AnyTag.
■ Challenges
Before we define the challenges, let’s take a quick look at our architecture.
As you can see from the picture above, we have the following components:
- 1. Postgres databases where we store influencers’ data (Google Cloud SQL)
- 2. AnyTag back-end services where we process the business logic (GKE)
- 3. Async Amazon Lambda functions that we use to get the social media data using 3rd parties’ APIs like YouTube API, Facebook API, etc (AWS)
The social media data update flow that we talked about is fairly simple:
- 1. K8S CRON jobs trigger REST services
- 2. The services trigger Amazon Lambda functions that collect the social media data for influencers asynchronously
- 3. The functions get the data using 3rd party APIs and send the data back to the AnyTag back-end using REST APIs
- 4. The services persist the received data in the DB
Everything had been working fine for a while before we faced our first backend outages under the load. The problems arose in step 3. We just had way too many concurrent updates from Lambda functions, which negatively affected the overall performance of the system. The DB struggled to perform so many transactions and the back-end services couldn’t cope with the load. As a quick fix, we just vertically scaled the DB and also scaled the back-end horizontally by adding more replicas, however we couldn’t buy more servers every time we had a performance problem. It was apparent that we couldn’t rely on REST APIs anymore for the following reasons:
- 1. Poor load-balancing capabilities
- 2. Reliability issues
Let’s talk about these problems in more detail.
□ Poor load-balancing capabilities in REST APIs
Even with a load-balancer running in front of your REST APIs, it’s quite challenging to cope with the load gracefully without harming the client-side. Let’s say that your REST service cannot handle more than 100 requests at a time. Imagine it gets killed by K8S for reaching the memory limits(OOM) when you receive request 101. What are you going to do with 100 000 requests coming at the service? There are at least 2 potential solutions: you can either buy more worker nodes to cope with the load or reject all incoming requests that you can’t process with HTTP status 429.
In the first scenario, everything is fine, except for the fact that communism hasn’t been established yet and you have to pay for every piece of hardware that you need. In the second scenario, the service stays healthy but the client-side has most of its requests rejected.
□ Reliability issues
REST API services are simply not reliable. If the back-end is offline, then all incoming requests fail, which leads to data losses. Even if the back-end side is online but a system that you rely on is not(for example, a DB instance), then the same problem arises.
■ The solution
As the result, we decided to migrate from REST API to async messaging. Given the fact that most of our components run in the Google environment, Google PubSub seemed like an excellent candidate.
□ How does it work?
In short, you publish a message to a topic and read the message in a subscription.
- 1. A publisher(your client) sends a message to a topic
- 2. Google PubSub persists the message
- 3. Google PuSub sends the message to all connected subscriptions
- 4. A subscriber(your app) reads the message and processes it
- 5. If the message is processed correctly, then the subscriber sends an acknowledgment request to PubSub to signal that the message was processed with no problems. If Google PubSub receives a negative "ack" or doesn’t receive it at all(timeout is configurable), then the message gets re-delivered to another subscriber.
A topic may have multiple subscriptions. In this case, Google PubSub delivers all messages to all connected subscriptions. A subscription, in turn, may have multiple subscribers(apps). In this case, the messages are load-balanced between subscribers.
There are 2 types of supported subscriptions: pull and push. In pull, the subscriber’s side pulls the messages, which is good for throughput since the subscriber may pull multiple messages at once. In push, Google PubSub calls the API from your back-end on every message, which is good for latency.
In our situation, we use pull because most of our auto-update processes work in the background. Among other notable features I’d like to mention these:
- 1. Dead-letter support
- 2. Flow control
- 3. Message filtering
- 4. Message ordering
- 5. Data encryption
■ The impact on AnyTag
The migration from REST API has had a huge impact on the system’s performance, which may be grouped by following aspects:
- 1. System reliability
- 2. Async communication
- 3. Advanced load balancing
- 4. Error handling
- 5. Decoupling
- 6. Impact on CPU
Now, let’s cover the topics one by one.
□ System reliability
With Google PubSub, the back-end doesn’t have to be online when the client-side publishes messages. The only concern for clients is if Google PubSub persisted the messages successfully.
The back-end will read and process the messages right after its recovery.
Moreover, if it was a partial failure(a single replica died) in the middle of message processing, then Google PubSub re-delivers all unprocessed messages to a healthy subscriber. Isn’t it great?
□ Async communication
Google PubSub, unlike REST API, is asynchronous by nature, which has had a big impact on the system. So now, the client-side publishes messages without waiting for the back-end side to process anything. The only thing that the client-side has to wait for is for Google PubSub to persist a message. In certain cases, this aspect of communication is vital. For example, in AWS Lambda functions you pay for time.
Let’s consider an example where request processing takes 10 seconds on average. This is how it had been working before with REST API:
- – A Lambda function gets the data (~5 sec) from 3rd party APIs
- – The function calls REST API and waits until it has been finished(~10 sec)
In total, the function had to wait for 15 seconds.
The situation changes dramatically with Google PubSub:
- – A Lambda function gets the data (~5 sec) from 3rd party APIs
- – The function publishes a message to Google PubSub(~ 100 ms)
In total, the function has to wait for 5 seconds only, reducing the cost for Amazon Lambda functions.
□ Advanced load balancing
With Google PubSub flow control, the back-end side can control the message throughput gracefully without harming the client-side. Flow control lets you specify the number of messages that the subscriber side pulls from Google PubSub, letting you limit the memory footprint with no penalty and giving you an opportunity to share the load with other subscribers evenly.
For example, if you have a pod that can store or handle no more than 100 requests, then you can configure the flow control to pull no more than 100 messages at a time. As the result, the subscriber side will not request Google PubSub more messages than it can handle. It’s especially important for cases when the flow of messages is higher than the pace at which the back-end processes the requests.
Let’s be more specific and say you have 10 000 messages in a topic. A pod with flow control configured to store no more than 100 messages will read just the first 100 messages, leaving the remaining 9900 messages for other pods to process.
□ Error handling
With PubSub, it’s easier to handle certain errors. Let’s imagine a situation when your DB was temporarily offline but then recovered very quickly. This is how it had been working with REST APIs:
- 1. An async Lambda function gets the data from 3rd party APIs
- 2. The function calls a REST service
- 3. The REST service fails with status 5xx because the DB was offline
- 4. The Lambda fails
- 5. Amazon re-runs the function repeating everything from the beginning, even step 1.
As the result, you re-run the code that had no problems in execution at all. With PubSub, the situation changes again:
- 1. An async Lambda function gets the data from 3rd party APIs
- 2. The function publishes a message and exits
- 3. The subscriber fails to handle a message because the DB was offline
- 4. The message is re-delivered to the subscriber again later on. As the result, you only re-process the part that actually failed, not the whole flow from the beginning.
□ Decoupling
The system becomes decoupled, meaning that the client-side doesn’t have to know your REST API URLs at all. You just publish messages to an abstract entity called a “topic”. This is especially crucial in microservice environments with dozens of services.
□ Impact on CPU
The system became more stable and processes more requests with very limited computer resources.
As you can see in the picture, the CPU load on the back-end side has decreased after the migration from REST API to PubSub on the 10 of April even though the load got significantly higher.
■ Challenges, pitfalls, and problems
Unfortunately, everything has downsides and Google PubSub is no exception.
□ Infinite loops and dead letters
It’s very easy to get into an infinite loop with just 1 faulty message if dead lettering is not configured correctly. Let’s say, you have a message that cannot be processed because of a bug in your software(div by zero for example). In this situation, PubSub acknowledges the message negatively, which leads to the message getting re-delivered again to a different subscriber. If another subscriber cannot handle the message, then it gets re-delivered again, and again, and again until the message has reached its maturity and dies(by default it’s 7 days). If you are not careful enough, it can exhaust the system and put it offline.
Thankfully, it’s possible to configure dead lettering so Google PubSub stops the re-delivery process after a certain number retries. With dead lettering on, "bad" messages can be re-processed again as they are stored in separate topics.
However, be very careful with dead letters. At the first glance, it may seem like it works fine while that’s not the case. Always pay attention to dead letter topics’ permissions. If the permissions are missing, then the infinite loop issue is hiding and waiting to kill the back-end.
Moreover, even if you don’t see the warning, maybe you just don’t have the permissions to get the warning about missing permissions!
□ Spring GCP library spawns too many threads
The Spring GCP library creates way too many threads. This is what you may observe in Java Visual VM one day:
After digging into the source code, we found out that the library generated 6 threads per subscription that we defined in our app. Most of these threads are created for handling scheduled tasks and not for message processing itself. Thankfully, it’s possible to reconfigure the library to use one fixed-size shared thread pool for this purpose.
■ Conclusion
Google PubSub may be a good solution for your APIs but it has to be used wisely. Always monitor your system logs and metrics before you make any conclusion.