Arm Treasure Data is a Customer Data Platform built with the necessary scalability and architecture flexibility to support clients across all industry verticals and sizes and help them tackle a wide range of pesky Big Data challenges such as dealing with data silos, broken pipelines, and rigid schemas. In today's dynamic digital landscape, important customer data often gets generated and stored in multiple disparate systems, and then needs to be managed by different teams. This disconnect between data sources and end users can create unwanted hurdles and delays in important business initiatives that rely on quick and easy access to data. In fact, at Treasure Data we believe that most marketing problems are just Big Data problems, based on our experience working with clients in the IT and Martech space for many years. Our main mission as a CDP is to help our clients eliminate those Big Data problems and provide their business users and marketing teams with quick and easy access to clean, unified, reliable customer data. This is so they can execute data-driven use cases, drive business growth and improve marketing KPIs.
One of those challenges is the process of building and productizing Machine Learning models on very large volumes of data. Typically, when Data Science teams set out to plan the architecture of a complex ML model that needs to be trained on hundreds of Gigabytes (or even terabytes) of data and built to run efficiently in production, they often think of Apache Spark as one of their go-to solutions. Spark is a unified computing engine and a set of libraries for parallel data processing with high-level APIs that supports languages such as Java, Scala, Python, and R. Spark does not require its own disk storage of data after executing each step of data transformation, but rather handles loading data from storage systems and performing multiple computations on it in-memory with only a single scan of the original data. This architecture results in faster processing speed and lower write-to-disk requirements of most Spark tasks. This makes it a great solution when it comes to processing speed, but because it does most of its processing in-memory, it can at times become very memory intensive for certain tasks. Furthermore, to make the best use of efficient Spark functions, an end user, such as a Data Engineer in charge of maintaining an RDBS system, often has to find a way to incorporate Spark clusters into their existing data architecture to help with the heavy in-memory compute requirements as well as learn Spark syntax for complex data transformations.
To make life easier for our clients, who are SQL users, Arm Treasure Data supports an Open Source ML Library called Hivemall, that runs on Apache Hive and is written entirely in SQL language. To understand the advantages of Hivemall when building production-level ML-models on very large volumes of data, one must first understand Hive. Apache Hive is a data warehouse system that converts SQL-like queries into MapReduce jobs for easy execution and processing of extremely large volumes of data. Hive’s storage engine is called Hadoop, which uses Hadoop Distributed File System (HDFS) to store the data across multiple servers, allowing distributed data processing. The MapReduce method stores Intermediate results on disk, which makes Hive very efficient when it comes to memory usage. Hive is also much more resilient to node failures because any individual Hive task can be restarted on a different node and users can easily manage the task duration and possible restart time, which allows to handle troubleshooting and build an error-proof architecture of complex data processes such as building efficient ML models on large volumes of data. Hive comes with lots of enterprise-grade features and capabilities which can help organizations build efficient, high-end data warehousing solutions.
One such enterprise-grade solution built on top of Hive is Apache Hivemall - enter the world of distributed machine learning. Hivemall is an open-source project that was started by Treasure Data engineers and was awarded the InfoWorld Bossie Awards (the best open source big data tools) in 2014. Its sole mission is to enable the wider community of Data Engineers or Data Analysts, who are not Python or Spark users, to be able to apply Data Science methods using only SQL language. Those methods can be applied directly to any HDFS type database through the Apache Hive interface, so once you have the Hive engine enabled on your platform, no additional infrastructure is required. Apache Hivemall offers various functionalities such as regression, classification, recommendation, anomaly detection, k-nearest neighbor, feature engineering and more. It also supports state-of-the-art machine learning algorithms such as Soft Confidence Weighted, Adaptive Regularization of Weight Vectors, Factorization Machines, AdaDelta, XGBoost, and more.
At Treasure Data, we have built a lot of efficient ML-driven solutions using only Hivemall SQL-based functions such as Propensity Models (to Churn, to Buy, to Engage with certain ad or content), Customer Lifetime Value (CLTV) Predictions, Next Best Action recommendation systems, Profile Engagement Scores and Auto-Segmentation Clustering algorithms.
So, let us go a little deeper into “the rabbit hole” and look at the tech specifications of one of those models, to understand the flexibility and efficiency that Hivemall offers. Today we look at our Next Best Action recommendation system that we built entirely using Hivemall functions, and below are some of the reasons why we decided to go with Hivemall for his model:
Next Best Action Recommendation Engine Overview
Our NBA Recommendation System informs what specific marketing action to take against different segments of users in real time, and where & when that marketing action is most likely to generate best results and help optimize marketing KPIs. Thus, enabling the very popular marketing mantra - “Deliver the right message to the right user, on the right channel, at the right time” to improve the customer experience and optimize the likelihood of engagement.
Next Best Action Model Setup and Architecture
The following is an architecture Diagram of how our CDP allows clients to go from raw user data, to a unified view of the customer, with all their NBA Recommendation scores available to marketing teams in an easy and clean format, so they can use those to do Segmentation and Campaign Orchestration based on Next Best Action Recommendation scores.
The following is a quick example of how easy it is to apply business rules to build granular segments and apply smart targeting using insights from our Next Best Action Recommendation engine.
Next Best Product Recommendation Logic Deep Dive
The Hivemall recommendation approach is based on cosine similarity of item cooccurrence. The elements in the item-item matrix are cooccurrence of user purchases and views for each item-item pair. We call this the “all-pairs similarity” problem, sometimes known as a “similarity join.” More specifically, the recommendation logic first computes approximate cosine similarity in an item-item matrix using the DIMSUM algorithm. Then, based on recent user interests (for example, top-5 recent item purchases or recent product pageviews) and their corresponding similarity scores with all other items available, top-k recommended items for each user will be computed and presented to the end user (marketer). The topk_recommended_items table is generated by the DIMSUM algorithm using a “k-nearest-neighbor” matrix of similarity scores, which is often used in collaborative based Recommender systems.
We are using DIMSUM (“Dimension-Independent Matrix Square using MapReduce) because it is the most efficient algorithm to solve the similarity join problem, given how sparse some of these item-item matrices can be. The DIMSUM approach in Hivemall can make very expensive computations on large datasets 40% more efficient, because it allows for a sampling technique that only calculates cosine similarity for pairs that are similar enough to each other and have higher frequency of occurrence and excludes a lot of low similarity and low occurrence pair from the vectorization and cosine similarity computations. For more detail and code examples, please refer to the Hivemall documentation page: https://hivemall.incubator.apache.org/userguide/recommend/item_based_cf.html
To put that in perspective, we applied the algorithm to a transaction table with 10 years’ worth of data and over 2.3B rows (transaction records), 35m unique users and 15 unique items, and it took DIMSUM less than three minutes to calculate all similarity scores. The full model also included various data transformation steps, a top-k recommendation algorithm as well as splitting the transaction data into a train and test set and calculating recommendation accuracy based on what each user did in the real world post recommendation. Written entirely in SQL, the full model took a little over 35 minutes to run end-to-end, which is quite impressive given that we are not using Spark or any other in-memory processing engine and we are working with billions of rows of data and highly matrix dimensionality.
The following is an example of some of the filters that can be easily applied to the SQL code to achieve the efficient runtime and compute mentioned previously:
Ensures that we are excluding items recently purchased from the final recommender logic, which can reduce matrix dimensions and improve model efficiency.
Tells algorithm to only compute item-to-item similarity for the top 3 items, for each item in data because that is how many items we want to recommend in the end. Again, this saves a lot of compute time, by not performing unnecessary calculations on lots of item combos that we will not need for the final recommender output.
Depending on use case and business model, we can tell the algorithm to look at a fixed n-number of recent items purchased and take aggregate recommender scores based on that. Reducing that number will also reduce model complexity and runtime.
This is a standard hyperparameter for most recommendation systems, which just tells Hivemall code how many recommended items we want to output per user.
This is another important parameter that can reduce the matrix sparsity significantly, by ensuring that we are only calculating cosine similarity for users who have purchased at least two items, otherwise no item-to-item similarity can be calculated with just 1 purchase.
This filter is applied to the final top-k recommender SQL code, ensuring that we are only considering item-to-item scores that are above our acceptable threshold when trying to pick top-k recommended items. This also helps reduce total compute time.
So far we have covered most of the details of the Data Engineering and Data Science processes that went into building this Hivemall recommendation system. The last important piece of the puzzle is model deployment. Depending on the use case and the applications that need to consume the model output Arm Treasure Data has a wide arsenal of activation points. We have over 200 OTB integrations to Martech platforms and BI tools, which can send those recommendations in near real-time to activate use cases such as:
Another option is to make those real-time NBA recommendations available at the edge of devices such as IoT devices and web applications. That can be executed with REST APIs, Webhooks or FluentBit. The ladder is a lightweight solution, developed by Treasure Data engineers, which allows you to collect data or logs from different sources, unify and send them to multiple destinations in real-time very quickly, efficiently and securely.
In conclusion, this article’s purpose was to focus on one end-to-end example of the flexibility, ease, and speed with which one can build and deploy scalable and efficient ML models on data of large volumes and complexity, using Hivemall and SQL only code. For more information, on some of the other models that we have pre-built and made available to our clients, please check out our Treasure Boxes Catalog of solutions here https://boxes.treasuredata.com/hc/en-us or reach out to us at this link with any questions and requests for a deeper dive demo of Arm Treasure Data CDP and our ML-driven solutions: https://www.treasuredata.com/contact-us/.