Think of the times when you want to talk to your friend. You make a call to him/her but the other person wouldn’t pick up the phone. It’s irritating, isn’t it! Well, here is the team Drishti to rescue you from this problem. We will help you in predicting the best time when your friend can pick up the phone so that irritation is wiped off from your life. Sounds interesting, right?
As of now, we are developing the system so that it helps our customers to give personalized service to it’s customers by calling them at their preferred time. Yeah, you got it, now the things will be a little technical from here but don’t worry I will try to explain things in layman terms. Hope every reader understands. So, let’s get started.
I will start with the two components of the Best Time To Call (BTTC) system:
- Training
- Prediction
Understand it in this way, a teacher teaches students and then teacher asks questions pertaining to the same subject taught. Similarly, first we train a model in the training pipeline with the data provided (We will talk about the nature of data later in the blog). Once the trained model is ready, we can use it to predict if the customer will get connected in a particular slot or not.
Let us dig in a little deeper!
When the system was at it’s inception stage, the system was thought to be comprising of four important parts:
- A computing framework – Spark
- A scalable, highly available and high performance database – Cassandra
- A workflow engine – Oozie
- A machine learning library – DL4j
SPARK: Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application and which is why it is preferred over Hadoop Mapreduce. Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
CASSANDRA: Apache Cassandra is a free and open-source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.
OOZIE: Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is a scalable, reliable and extensible system.
DL4j: Deeplearning4j is the open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J is designed to be used in business environments on distributed GPUs and CPUs.
(For some people who might be wondering, “OMG! What is Apache?”. Don’t worry people, even I had no clue of it once. The mission of the Apache Software Foundation (ASF) is to provide software for the public good. They do this by providing services and support for many like-minded software project communities of individuals who choose to join the ASF. You can find more details on this link)
We use AMBARI for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. The services provided by Ambari that we use in our system are:
- HDFS: It is a Java-based file system that provides scalable and reliable data storage and we use it to store data across the Hadoop cluster.
- YARN : It acts as a resource manager for SPARK jobs.
- ZOOKEEPER : It enables highly reliable distributed coordination. Basically it provides failover facility.
Now, you must be getting an overview of the BTTC system. Let’s get you going further!
As I told, there are two components of the BTTC system, we will talk about both of them one by one.
- TRAINING: It further has two components:
- DATA CLEANING: The data is brought from the Ameyo Database (Ameyo is a contact centre suite) into the BTTC system and store it in the Cassandra Database. Once the data accumulation is done, small Spark jobs doing a specific task run over the data using Oozie workflow, in Yarn-Cluster mode. These tasks involves data cleaning which includes removing unwanted data, replacing null values, breaking the features and manipulating the features.
- DATA TRANSFORMATION: The cleaned data has to be transformed into a form that can be understood by the machine learning algorithms (basically numbers). There can be many ways of transforming the data depending upon the nature of the data. Mean Normalisation and One-Hot Encoding are the ones that we have used in our BTTC system. Once the data is transformed, a model is formed and saved in the HDFS which can be restored and used for prediction later. The model is a 2-layer DNN (Deep Neural Network) which is made using DL4j (Deep Learning 4j) library.
- PREDICTION: It has basically four components:
- DATA CLEANING: The data is brought from Customer Data Tables into the BTTC system and stored in Cassandra. Once the data gets accumulated, we apply the same process of data cleaning used for training. (IMPORTANT: One thing which I want to bring to your notice is that we have to strictly follow the same steps of data preparation that we did for training as we cannot train on some data and predict from totally different data. For example: Let’s say we decide to convert zip-code of the customer location into a particular format during training, then we have to convert the zip-code of customer using the same format during prediction as well.)
- DATA TRANSFORMATION: After the data is cleaned it has to be transformed into machine learning algorithm understandable data exactly similar to what we did at the time of training. (IMPORTANT: We need to transform every field in data, the same way we did at the training time and using the same factors. For example: Let’s say we decide to normalise a particular feature using mean as 0.5 and standard deviation as 1, then we have to use the same mean and standard deviation for the feature during prediction as well).
- GET PREDICTION: Once the data is prepared, the trained model is restored from the HDFS and used for getting the prediction of the customers.
- SAVE PREDICTION: The final step after we get the prediction is to update the prediction in the Ameyo DB based on the results of prediction so that call can be scheduled to the customers.
So, this was our BTTC world, hope you enjoyed knowing about this new world. Hope to meet you next time with a completely new world. Till then, take care.