News

Facebook Icon Twitter Icon Linkedin Icon

AnyMind Group

Facebook Icon Twitter Icon Linkedin Icon

[Tech Blog] Improving the search for influencers using Elasticsearch in AnyTag

AnyTag is an influencer marketing platform that connects advertisers with influencers. Want to promote your business? No problem. The platform contains more than 300,000 influencers and content creators spanning across Asia. To find the right influencer or content creator that fits your advertising campaign – we’ve built in a search feature that allows users to filter by:

  1. Number of followers
  2. Demographic attributes
  3. Engagement rate
  4. Gender
  5. Interests and many more…

But how does this actually work? That’s the main topic of this article.

The first try

In the beginning, we used Postgres for searches, which worked well initially as we only had a few thousand influencers and almost no platform users. As the business grew and continued to scale with more influencers and users, so did the load on the system. Postgres was not able to get the information needed on time: slow response time and timeouts were inevitable. At first, we tried to optimize 300 line SQL queries but we shortly realized we had reached our limits. Optimizing it any further had little to no effect. That was the moment we decided to migrate to Elasticsearch.

Elasticsearch overview

Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning-fast search, fine‑tuned relevancy, and powerful analytics that scale with ease.

How it works?

If you haven’t worked with Elasticsearch but have some RDBMS experience then the following glossary may greatly help you to understand the rest of this article. The RDBMS-to-Elastic glossary is oversimplified and doesn’t imply that there is a 1-to-1 match, however.

RDBMSElasticsearch
DBIndex
Record, rowDocument
SchemaMapping
To insert, updateTo index

Elasticsearch stores documents in indices that hold data in a data structure called an inverted index. Elasticsearch analyzes texts, splits text by terms (words, syllables, or however you configure it), and uses inverted indices to store term-to-documents relations. For example, a document A "Hello World" and a document B "Hello Tony" would be stored in the following way:

TermDocument
helloDocument one, document two
worldDocument one
tonyDocument two

To split texts into terms, Elasticsearch provides a variety of text analyzers. For example, the English analyzer turns "Hello World! This is Elasticsearch! I want a hamburger" into terms "hello", "world", "elasticsearch", "i", "want", and "hamburg". Some words are filtered out as they are too common. For instance, "this" and "is" are not terms. Some words change their forms by stemming. For example, "hamburger" was changed to "hamburg".

By default, Elasticsearch tries to guess how to analyze your text. But you can help Elastisearch by defining a mapping since as a developer you know the nature of your data. For example:

"content": {
    "type": "text",
    "analyzer": "thai"
}

In this situation, the Thai language analyzer will be used to split texts into terms instead of the default one.

It is also worth mentioning, that Elasticsearch is called "Elastic" for a reason as it stores data in shards that are distributed among multiple nodes. That greatly improves high availability and load balancing, making the system stable, safe, and robust.

Shards are distributed among multiple nodes

One more major difference is that Elasticsearch indexes documents in near real-time which means that recently indexed documents don’t become visible for search immediately. Instead, Elasticsearch buffers data and, after some time, makes the buffered data visible. This is called "refresh". By default, the refresh interval is 1 second.

Why searches are fast?

Even with the default configuration, Elasticsearch is incredibly fast at searching documents. But what makes it work so fast? Well, there are at least 2 major technical features to put your focus on:

  1. All documents are meant to be self-contained. No joins required as all needed data is already in one place.
  2. Inverted indices fit perfectly for full-text searches as you don’t have to scan the whole dataset sequentially. It’s as if every word in your favourite relation database had an index.

If you have decent experience in IT then you must understand that there is no silver bullet technology. That simply translates to "if something works too good then there is something that works too bad". In Elasticsearch, the "too bad" is indexing. Forget about transactions, real-time updates, and data consistency. Basically, you sacrifice "writes" for "reads".

Deployment and monitoring

In AnyTag, we use a managed Elasticsearch cluster provided by cloud.elastic.co with 2 data nodes and 1 tiebreaker node. The tiebreaker node doesn’t store data but decides what node becomes the master-node to avoid split-brain situations.

Elasticsearch deployment in AnyTag

It’s possible to monitor the cluster directly from cloud.elastic.co.

Deployment monitoring from cloud.elastic.co

However, we also use Elasticvue Google Plugin to visualize storage internals such as shards, indexes, cpu load, etc.

Elasticvue monitoring

Testing and debugging

Elasticsearch can be easily tested. In AnyTag, we write code in Kotlin and Python and we don’t have any problems yet. In CircleCi, the CI/CD solution used in almost all AnyMind projects, you can simply run an Elasticsearch container for your tests:

  - image: docker.elastic.co/elasticsearch/elasticsearch:7.15.2
        environment:
          - transport.host: localhost
          - network.host: 127.0.0.1
          - http.port: 9200
          - cluster.name: es-test-cluster
          - discovery.type: single-node
          - xpack.security.enabled: false
          - ES_JAVA_OPTS: "-Xms256m -Xmx256m"

There is only one thing to remember. You need to make a 1 second delay between indexing and searching with default configurations because indexing is near real-time and data is not visible immediately as mentioned earlier.

Moreover, debugging is a pleasure in Elasticsearch because it provides RESTful APIs that can tell you how text is analyzed, why documents were or weren’t included in the results, etc.

Pricing

In AnyTag, we have 2 data nodes: 8GB RAM and 180GB of storage each. The cost at this very moment is $0.60 an hour which is $432 a month.

The impact on AnyTag

After the migration from Postgres to Elasticsearch, searches have become faster and better: web pages take no more than 2 seconds to load now and the influencers are finally sorted by relevance. Users are very happy to see how fast the search among 80 million social media posts works. To top it off, Elasticsearch has given us an opportunity to implement other search-related features for the future such as auto-complete.

Challenges

Elasticsearch, as any other technology around, has it’s downsides too. In our situation we have encountered the following problems:

  1. Slow indexing
  2. Japanese language support

Slow indexing

With default configurations, Elasticsearch is not that good at handling big volumes of incoming data. In AnyTag, we have very frequent data updates which was a problem for our Elasticsearch deployment. The indexing response time was slow and CPU load was 100% on all nodes. At some point in time Elasticsearch rejected our requests after a 30s timeout.

We had to consider the following:

  1. Don’t index those values that are not used for filtering
  2. Change the refresh interval

Let’s go one-by-one.

Don’t index those values that are not used for filtering

In Elasticsearch, every value is put to an inverted index by default to make it searchable. There are cases, however, when values are meant just to be read from the storage without any filtering among them. For example, we don’t filter among influencers’ profile pictures but render them on web pages. For this particular case, you can tell Elasticsearch not to put values to an inverted index:

 "profile_pic_url": {
    "type": "keyword",
    "index": false
 }

With "index": false, values can’t be filtered but can be read as usual. It reduces indexing time and storage.

Change the refresh interval

Refresh interval is the interval after which Elasticsearch makes recently written data available for search. By default, the value is 1 second, which doesn’t work well with frequent updates. We changed the value to 30s, which means that Elasticsearch makes 30 second intervals before actually making the data visible. Surprisingly, this little change has had the biggest impact on indexing time. If you have troubles in indexing big volumes of data then consider this first.

Other optimizations for fast indexing can be found in the official documentation.

Japanese language support

Another very unexpected problem we faced was Japanese language support. The main contributor in AnyTag is Japan so we have a special Japanese version for both AnyCreator and AnyTag apps. As you might expect, Japanese managers search for Japanese influencers in Japanese. Here is a catch though: nobody speaks Japanese in the tech team! How can a Japanese manager explain a non-Japanese engineer why they can’t find relevant data in Japanese? Believing or not, we managed to work it out gracefully even without speaking the language.

Luckily, Elasticsearch supports many text analyzers out-of-box including CJK(Chinese, Japanese, Korean). We used the analyzer from the beginning, however the result contained too many irrelevant documents. To understand the problem, let’s take a look the following sentence:

"世界中のボランティアの共同作業によって執筆及び作成されるフリーの多言語[インターネット百科事典である]"

It stands for:

"A free multilingual [Internet encyclopedia] written and created by the collaboration of volunteers from around the world"

For non-Japanese speakers this might be a big surprise that the Japanese sentence is completely continuous with no spaces in-between words. Moreover, the text includes Katakana, Hiragana, and Kanji symbols in one sentence together.

Maybe it’s not a big problem for Japanese guys but for us it was an issue. However, after a short investigation, we found out that the CJK analyzer divides text into bigrams. For example, "ソウル" (soru) becomes "ソウ"(so) and "ウル"(uru). Let’s draw an oversimplified analogy with English and say the text is "Hello World". CJK would split it into "he", "el", "lo", "ow", "wo", "or", "rl", and "ld". By default, Elasticsearch returns documents that contain any of terms. Of course, many text documents contain "he" and "el" because these bigrams are quite common in English. As a result, we had too many irrelevant documents returned. If we had documents "Hello World", "Hey, how are you doing?", and "It was a hell of a day" then all of the documents would be returned if the search query was "Hello World". But once again, this is just an example we introduced in the name of simplicity. English texts are actually analyzed differently with CJK.

Fortunately, we found a solution to the introduced problem. We configured our search queries to search for documents that contain at least 75% of terms. In the example mentioned above, the query would require at least 75% of terms "he”, “el”, “lo”, “ow”, “wo”, “or”, “rl“, “ld” to be in a document. This solution gives an ability to tolerate typos(25% threshold) and returns relevant documents in most cases.

Summary

Elasticsearch might be an excellent tool in your arsenal. It’s fast and highly available, however, the learning curve is high. If you want to use it in production environment then knowing Elasticsearch internals is the must. You can’t just install Elasticsearch and expect it to work fast on its own.

Latest News