Every day, globally an approximate 2.5 quintillion bytes of data are created. A quintillion is 1 with 30 zeroes after it. Let that sink in for a moment. While a lot of this data could be cat videos on Internet, there is still a considerable amount of data that is produced in text and traditional readable content. While popular search engines like Google have simplified the parsing of this public data deluge, organizations both small and large still rely on search-based tools to unravel the insights from day to day content generated within the company firewalled boundaries or custom text and datasets that help with their business needs.
Elasticsearch is a highly scalable open-source text-search and analytics engine based on Apache Lucene library. The primary use cases for Elasticsearch include full text search majorly in e-commerce-based applications, document storage with cataloging, time-series data events and metrics and so on.
Recently, the team at elastic.co added Arm64 binaries for Elasticsearch. This allows users to deploy Elasticsearch on Arm Neoverse powered AWS Graviton2 instances. In this blog, we show Elasticsearch analytics use case for Twitter data analysis on a cluster of AWS Graviton2 based Amazon EC2 M6g instances. In addition, we conducted performance benchmarking using Rally as the benchmarking tool comparing Arm-powered Amazon EC2 M6g instances to x86-based M5 instances to showcase the benefit of using these instances for an Elasticsearch deployment.
For performing analytics with Elasticsearch, these instances provide better throughput and lower latency values up to 25% respectively compared to x86 based M5 instances while performing varying type of data analytics. These instances also provide 20% cost benefit. These are significant cost and performance benefits for customers as deploying Elasticsearch on Arm is seamless and requires no additional investments in time.
In this use case, we gather tweets and relevant data based on keywords and insert that data into Elasticsearch. We also create indexes and shards while inserting the data. We execute a python script that interacts with the twitter streaming api and fetch live tweets based on the keywords we specify. This data is then inserted into an Elasticsearch cluster running on AWS Graviton2 based Amazon M6g instances. On completion of the script, search queries are executed, and the data is analyzed.
The entire flow of the use case is captured in the following video:
For more details on how setup this use case, please refer to the Configurations section towards the end of this blog.
Now, let us look at performance metrics on Elasticsearch comparing AWS Graviton2 based M6g instances with the x86 based M5 instances. For benchmarking Elasticsearch, we used Rally from Elasticsearch. We executed benchmarking tests on two instances with following specifications:
Table 1. EC2 instances types and size
Following tracks are tested using Rally to measure Elasticsearch performance representing a variety of datasets:
For each of the tracks described above, following metrics are measured from Rally’s benchmark reporting:
In Rally, each track has different tasks that are performed while benchmarking Elasticsearch. We observed two major types of operations:
The table below shows a comparative analysis of metrics observed during the ‘index-append’ task for Elasticsearch. Below three (3) tracks are with a data size that is large in nature (20-30 GB), comprising of many logfiles and documents. In this table we see 15%-25% better performance from M6g instances over the equivalent M5 instances.
Table 2. Single node Elasticsearch instance – Performance metrics for batch-style operations
The two (2) tracks covered below are for geological queries and structured data, with smaller data size (2-3 GB). In this case, we observe 6%-15% better performance from M6g instances when compared to M5 instances.
Table 3. Single node Elasticsearch instance - Performance metrics for batch-style operations
The following table shows a comparative analysis of metrics during interactive tasks like, scroll. As explained above in such cases a target throughput is defined and a lower latency and service time designates stable performance from the instance. The following table shows a comparative analysis of such tasks for both type of instances.
Table 4. Single Node Elasticsearch instance - Performance metrics for interactive operations
Additionally, the following table shows the performance metrics for a three (3) node Elasticsearch cluster consisting of EC2 instances. It’s for the ‘index-append’ task.
Table 5. Three (3) Node Elasticsearch Cluster - Performance metrics for batch-style operations
Following table shows a comparative analysis for interactive operations for a three (3) node Elasticsearch cluster:
Table 6. Three (3) Node Elasticsearch Cluster - Performance metrics for interactive operations
To conclude, Elasticsearch can be used for a variety of use cases and AWS Graviton2 provides better performance and cost benefits. Arm-based M6g instances provide better throughput and lower latency values up to 25% respectively compared to x86 based M5 instances while performing varying type of data analytics. These instances also provide 20% cost benefit.
For more information on software ecosystem on AWS Graviton2, please visit the AWS sessions at Arm DevSummit and for questions reach us here.
Register for Arm DevSummit
These are the performance related settings we updated to achieve the results described previously:
1. Change the default JVM heap size to 50% the memory of each instance.
sudo vi /etc/elasticsearch/jvm.options -Xms8g (for an xlarge instance) -Xmx8g
2. Turn off memory swap and make sure Elasticsearch is the only service running in the instance.
sudo swapoff -a (on each instance)
3. If turning off memory swap is not possible for some reason, then edit the elasticsearch.yml file and change the following settings.
bootstrap.mlockall: true sudo vi /etc/default/elasticsearch MAX_LOCKED_MEMORY=unlimited
4. Ensure that Elasticsearch is configured and running on a machine with a 10GbE networking interface.
5. Run the Rally tool on a separate instance and make sure there are no network disconnects, latency issues while connecting to Elasticsearch.
Following are the pre-requisites before installing Elasticsearch:
Amazon Machine Image (AMI) – Ubuntu 20.04 (arm64 based)
Open-jdk-14.0.1+7 installed on each EC2 M6g instance (only required if you’re using the no JDK binary of Elasticsearch)
You can use either the default distribution of Elasticsearch or the open-source version as described in the following.
Download Elasticsearch using the following command:
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.8.0-aarch64.deb
Additionally, download sha512 from the same location:
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-oss-7.8.0-aarch64.deb.sha512
Verify the sha for the binaries that you have downloaded:
shasum -a 512 -c elasticsearch-oss-7.8.0-aarch64.deb.sha512
You should see an OK message printed out on the screen.
Now, you are ready to install Elasticsearch.
sudo dpkg -i elasticsearch-oss-7.8.0-aarch64.deb
Repeat these steps on each M6g instance in AWS.
After the installation is complete, edit the following configuration file on each instance.
sudo vi /etc/elasticsearch/elasticsearch.yml
Set the name of Elasticsearch cluster by locating the field cluster.name and replace its value with your own.
Now, set the name of Elasticsearch node by editing the following field:
node.name: <hostname of the node>
Start the Elasticsearch service with the following command.
sudo systemctl start elasticsearch.service
Check the status of the service.
sudo systemctl status elasticsearch.service
Execute a simple curl command to check whether you are able to query Elasticsearch.
Figure 2. Command output to show successful install of Elasticsearch on a single node
After the installation is complete on all three nodes, check the status of the Elasticsearch cluster by executing the following commands on each node:
curl -XGET ‘http://<clusterIP>:9200/_cluster/state?pretty’
Check the health of the Elasticsearch cluster by executing the following command. It should display the status as green.
curl -XGET ‘http://<clusterIP>:9200/_cat/health\?v’
Figure 3. Command output to show health of the Elasticsearch cluster
Once the Elasticsearch cluster is running, we install a python application called tweepy on our client machine. It is a simple python-based library that is used to interact with the Twitter streaming api.
pip3 install tweepy
We have a sample python application that’s going to use the twitter streaming api and fetch live tweets based on the keywords we specify. To use this application, you will need to sign up for a Twitter developer account. It is straightforward and the steps are listed here.
Once you have created the account and registered your application, you should have an access token, api key and api secret key. These needs to be provided in the python script. The sample script can be downloaded from the github repo here.
As shown in the following image, we run a python script that searches for live tweets based on specific keywords. The script then connects with a three (3) node Elasticsearch cluster running on AWS Graviton2 based Amazon M6g instances. The live tweets that are collected in the script are formatted and sent to the Elasticsearch cluster. On successful completion of the script, search queries are executed, and the data is analyzed.
Figure 4. Script execution result
In this script we look for keywords like “aws”, “graviton2” or “arm”. On executing this script, it is going to look for live tweets that reference any of these keywords and insert the tweet data into Elasticsearch database. We run the script for two hours to collect considerable amount of data. Now, it is time to search our keywords and analyze the tweets.
Executing the following command to search for keyword ‘graviton2’.
curl -XGET ‘http://<esclusterIP>:9200/sentiment/_search?q=graviton2’
Figure 5. Search and analysis result for keyword
It shows data related to a tweet from a few minutes ago that referenced ‘graviton2’ as a keyword. These stats can also be viewed with tools such as Kibana or Grafana.
Before installing Rally, you need to have Python3 configured on the machine.
sudo apt install python3-pip
To install rally, execute the following command:
pip3 install esrally
After the tool is installed, add its location to PATH environment variable and execute the configuration command:
We should see the following output on a successful configuration: