– by Sourav, AnyMind Datascience team
At AnyMind Group, we take a keen interest in trends expressed by our influencers and their followers. Our users span commercially important locations such as Japan, Vietnam, Thailand, Philippines, Indonesia and Taiwan, and more. Our core users use posts and comments in various languages, over multiple SNS such as Instagram, Twitter and Facebook. Often our data includes posts and comments having more than one language and multiple hashtags.
Multimedia, in the form of images & videos, may not always give the true perspective, but comments and post descriptions convey important feedback. Sentiment analysis or opinion mining is the computational study of user opinions, sentiments, and attitudes towards products, services, and issues. Sentiment analysis can allow tracking the mood of users to create actionable knowledge. In contemporary business culture, understanding the customer is becoming increasingly important. Knowledge of customer likes and dislikes can be used to understand and predict future business directions. We identify them to be an important metric in improving business strategies and gain insights.
In this blog post, we will go over how sentiment analysis has progressed and discuss the state-of-the-art methods that we are employing for social media posts across Instagram, Youtube etc.
In the past, several different methods were used to understand the emotion behind social media messages. Shallow perceptrons, Deep belief networks, Convoltional Neural nets (CNNs) as well as Sequence modeling methods such as Recurrent Neural networks (RNNs) and Long short term memory (LSTM’s) with their derivative architectures have been tried.
For a very long time, LSTM-based methods (especially bi-directional LSTMs i.e. models trained forward -> back and back -> forward in inputs) remained the dominant way by which we could model our data. We encoded input sequences to word embeddings or processed by statistical techniques such as TF-IDF before learning against a perceptual score signifying the emotion (for e.g. -1 for unhappy, 0 for neutral and +1 for happy). There were several challenges: availability of high quality annotated multi-lingual data, expensive computation and reliability of deployment. With LSTMs, data scientists could only daisychain a few of these LSTM units (~3-4). Consequently, we could only segregate our inputs in to human-designated sizes which were thereafter processed.
"Enter the Transformer"
In the past few years, there have been significant advances in Natural Language Processing domain. However, our focus in this article is on a breakthrough architecture: Transformers, proposed by Vasvani et al. in 2017. Transformer excel in data models with sequential inputs. Since natural language are sequences to begin with, we have seen tremendous progress in modeling them. At its core, Transformers are stacks of encoders and decoders, with associated embedding layers. Encoders contains the all-important self-attention layers that computes relationships between input words.
While processing our inputs, attention mechanisms in Transfomers enables a model to focus on words in the input that are closely related. For e.g. Consider a sentence: A boy is rolling a red ball down a slope. The token Ball is closely related to red & rolling but not boy. The Transformer architecture use self-attention to statistically learn relationships between every word in the input sequences to every other word. These architecural changes have resulted in far superior performance than conventional designs.
In our sentiment analysis workflow, we are working with BERT, a transformer-based language model. Its key innovation is to apply bi-directional training on input sequence data. The model is token position-agnostic and excellent masked-language modeler. Our training corpus involves user submitted social media posts and comments, which we have carefully scored in a pre-designated range. Using the original text and a simplified translation (without any special tokens added), we train it via our BERT model. We have observed that we can derive a deep sense of language context and flow in user comments, with an accuracy that is far superior to any previous iteration even though a single model is handling inputs of multiple langauges. For these workflow implementations, we are using publicly available APIs from HuggingFace Inc. With thousands of open-source commits, Hugginface APIs are well-tested for several different sequence modeling tasks. Apart from output layers, the model architecture remains the same between Google’s pre-training and our fine-tuning process. This is very beneficial to us since we operate with a model which has been trained on Google’s vast datalakes. The pre-trained model parameters are used to initialize networks for specific downstream tasks and all parameters are fine-tuned to our requirements. Using tokenizers available from Huggingface, we can apply special tokens such as
[SEP] to our multilingual data to better annotate the model inputs.
BERT is publicly available as
LARGE models. Each of these models has variants such as
multilingual-uncased. In our sentiment analysis task, we have built on the state-of-the-art
designs provided with the
BASE architecture employing approximately 110 million paramters. Whereas our previous designs relied on creating elaborate PyTorch methods for tokenizing and data-loading, we have now abstracted away much of the code and
made the core structure re-usuable. With this approach, we are finetuning on annotated social media comments with some clever model-agnostic training functions. Thereafter, we are deploying inference logic by relying on Huggingface’s
pipeline() architecture. This approach abstracts most of the complex code from the library, reducing bugs and improving memory overhead:
from transformers import pipeline, TextClassificationPipeline model = AutoModel.from_pretrained(model_checkpoint, num_labels=3) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, return_all_scores=True) ... results = [classifier(text) for text in List[texts]] ## results: NEGATIVE, with score: 0.998 ## POSITIVE, with score: 0.77 ## ... ## NEUTRAL, with score: 0.48
pipeline() model is a lightweight codebase which needs only a few PyTorch modules to run. Inference can be done with standard
E2 GCP instances with as little as 2 vCPU and 4 GB of RAM. Our inference platform can process up to 10 large blocks of multi-lingual text in social media per second. Furthermore, whenever we need to upgrade our model to cover more data or better architecture, we only need to swap out the model & tokenizer types. With a model-agnostic training routine, we only need to specify a new model type. All the aforementioned services can be deployed using Docker, TorchServe or Kubernetes-based services. By keeping such best practices, we can rapidly employ the state of the art in sentiment analysis, while deriving accurate insights from our users.
 Systematic reviews in sentiment analysis: a tertiary study, Ligthart et al., 2021
 Reccurent Neural networks, Neural Computation, Vol. 31 Issue 7, 2019
 Long short term memory, Hochreiter & Schmidhuber, 1997
 Attention is all you need, Vasvani et. al, NIPS 2017
 BERT, Google Research, 2018
 Huggingface pipelines, https://huggingface.co/transformers/main_classes/pipelines.html