Social media's influence today is massive, spanning personal, social, political, economic, and cultural realms. Monitoring user sentiment can help organizations to quickly understand public reaction to events, trends, and products. This data-driven insight is crucial for reputation management, market research, and decision making for organizations. For example, Twitter, now known as "X," plays an influential role among social media platforms, which provides a platform for real-time communication and information sharing. Twitter has approximately 530 million active users as of 2024, who post an estimated 500 million tweets daily, making it a powerful real-time channel for gauging public sentiment. Therefore, tracking real-time changes enables organizations to understand sentiment patterns and make informed decisions promptly, allowing for timely and appropriate actions. However, real-time sentiment monitoring is a compute-intensive task and can quickly drive-up resources (Both compute and cost) if not managed effectively.
In this blog, we will demonstrate how to build a distributed kubernetes cluster on Arm Neoverse-based CPUs to monitor sentiment changes in real time based on tweets. So, you can fully utilize the computing foundation of Arm Neoverse for the performance, efficiency and unmatched flexibility.
Amazon Web Services (AWS) offers EC2 instances powered by AWS Graviton processors, which are based on Arm Neoverse architecture. These instances built on Graviton2, Graviton3, and Graviton4 provide strong performance with significant cost efficiency [1-5]. In order to leverage the best of these benefits, we developed our use case on AWS Graviton instances using Amazon Kinesis, Apache Spark, Amazon EKS (Graviton 3 instances), Amazon EC2 (Graviton 4 instance) Amazon Elastic Search and Kibana dashboard, Prometheus, and Grafana (see Graph 1). Our use case can enable the fast creation and execution of massively parallel machine-learning jobs to derive real time insights by enabling different nodes. So that organizations can utilize the real-time insights to stay adaptable, responsive, and resilient in a rapidly changing world.
Graph 1: logical architecture diagram using AWS as example.
Please also keep in mind that Arm-Neoverse powered instances are available in Google Cloud and Microsoft Azure. So, using this type of logical architecture should also allow you to set up a similar solution using Google Cloud and Microsoft Azure services. Now we will walk you through each component in the diagram to explain its purpose and how it’s constructed, giving you a full understanding of the entire system by taking AWS as example. Later, we’ll release a learning path with code examples so you can replicate and build your own solution.
To timely retrieve new tweets published on Twitter's website, we will use Twitter Developer API, which is a set of programming tools and protocols provided by Twitter that allows developers to access and interact with Twitter data programmatically. It allows us to gather, filter, and analyze information from Twitter's vast database of tweets, user information, and other social media content.
To set it up, you will first need to create a Twitter developer account to use Twitter API. You must first create a project and an App using the developer portal. Then, create an API Key, API Secret, Access Token, and Access Token Secret to authenticate your application and read the tweets. Note that Twitter applies rate limits and constrains on the number of tweets you can retrieve per your subscription to the applications to provide reliable service.
Amazon Web Services (AWS) Kinesis is a fully managed data streaming service, built to handle and process large volumes of real-time data. In our setup, we will use AWS Kinesis to capture live data from the Twitter API, ensuring that every tweet matching our filters (such as hashtags, keywords, accounts, language, timeframe, etc.) flows directly into Kinesis as soon as it’s posted. To configure this, follow the step-by-step guide provided in this document. The Twitter API script sends each tweet as a JSON object into a Kinesis stream, making the data readily available for subscribers to consume.
The sentiment analyzer is a text classification model that detects the emotional tone of tweets, categorizing them into three or more groups based on the words used. This allows application users to quickly understand real-time opinions on a specific topic without having to read each tweet manually. The results provide valuable sentiment insights, enabling users to make data-driven decisions. There are several ways to calculate sentiment: you can train your own text classification model, which requires labeled data and can be time-consuming, or, as in our approach, you can use a pretrained sentiment classification model.
We process the sentiment in tweets using Spark Streaming which is an API within Spark for reliable stream processing of high-throughput streams of data such as Kafka, AWS Kinesis, HDFS/S3, Flume. It splits the input stream into mini batches and runs them through the Spark engine, creating a stream of batches of processed data. Spark comes with the streaming API on top of Spark SQL called Structured Streaming. It allows data to be presented as Datasets/Data frames (APIs on top of RDD) and allows optimized Spark SQL engine processing over the streaming data.
The Spark Streaming API reads the stream of tweets from the kinesis stream. The Spark engine runs the jobs on the received data frames, processing them with a pretrained sentiment classification model from the Stanford core NLP library, producing an output for each tweet into one of the following labels: [VERY_NEGATIVE, NEGATIVE, NEUTRAL, POSITIVE, VERY_POSITIVE]. The results are then sent to Elasticsearch.
Elasticsearch is a robust, open-source search and analytics engine designed to efficiently store, search, and analyze large volumes of data in near real-time. It allows for fast data ingestion and nearly instant searchability. Its real-time indexing capability is crucial for handling high-velocity streams, such as tweets, that continuously flow in from APIs or event streams. To set up Elasticsearch on an AWS EC2 instance, you can follow these instructions.
Kibana is an open-source visualization tool that works seamlessly with Elasticsearch, providing an interface for exploring, visualizing, and interacting with data. With Elasticsearch and Kibana, users can interact with the data, apply filters, and receive alerts if sentiment drops sharply, all in real time. If your Elasticsearch deployment didn’t include a Kibana instance initially, you can follow this instruction to enable Kibana first. For new Elasticsearch clusters, a Kibana instance is automatically created for you so that you can access directly. Once Kibana is enabled, you can follow this document to set up your desired visualization to display the data from elastic search.
Prometheus is a monitoring and alerting toolkit. It’s widely used for collecting and querying real-time metrics in cloud-native environments like Kubernetes. Prometheus collects essential metrics (e.g., CPU, memory usage, pod counts, request latency) that help in monitoring the health and performance of Kubernetes clusters.
Grafana is a visualization and analytics tool that integrates with data sources from Prometheus, to create interactive dashboards to monitor and analyze Kubernetes metrics over time. We deployed Prometheus and Grafana on Kubernetes using Helm. This blog provides a good tutorial.
Amazon Elastic Kubernetes Service is a managed Kubernetes service by AWS that allows you to deploy, manage, and scale applications. For our applications, since tweets volume can fluctuate greatly depending on trending topics or event, EKS allows for auto-scaling of Kubernetes pods and nodes, ensuring that sentiment analysis applications have the resources to handle peak loads and automatically scale down when traffic subsides, optimizing cost efficiency.
HashiCorp has provided the documentation on how to provision an EKS cluster on AWS. There is also Terraform scripts available to help to set it up automatically. To run it Graviton3-based instances, it requires a few changes:
Name = "eks-nodes-aarch64" ami_type = "AL2023_ARM_64_STANDARD" instance_types = ["r7g.4xlarge"]
The inference time for each tweet depends on its length and the model used. To achieve accurate sentiment predictions, you can select a large model [7], which increases latency compared to a smaller model but yields higher sentiment accuracy. Since tweet lengths vary, inference times also fluctuate accordingly, averaging around a couple of hundreds msec per tweet. This means our use case can process approximately 5-10 tweets per second with the large model. Running smaller model usually fasters reducing the latency by half, processing 20-30 tweets per second [8].
Globally, about 6,000 tweets are posted on Twitter every second, with roughly 4,000 unique hashtags identified [6]. This translates to 1-2 tweets per hashtag per second.*
We have built our use case on Twitter, but you can take the main principles in this use case and deploy a similar solution on other social media platforms and take the full benefits of Arm Neoverse-based cloud instances in multiple major cloud providers including AWS, Google Cloud and Microsoft Azure. If you would like to learn more about this use case or explore the significant performance and efficiency benefits of Arm-Neoverse based instances in the cloud, please visit our booth (N12) at KubeCon 24.
*This use case currently operates with two worker nodes, which might not handle all global tweets but can effectively manage those related to specific hashtags.
Scaling up the number of worker nodes would enable processing of a higher tweet volume if needed.