Benchmarking Machine learning prediction models

When surfing the internet, it is quite easy to find sites comparing the most popular Machine learning toolkits (datascience.stackexchange.com, oreilly.com or udacity.com ). These sites give you a lot of information about the strengths and weaknesses of the libraries, how they work and some examples to compare how easy it is to use these types of tools. Therefore, if you are new to the business, they are very helpful for finding the right library to begin to study your data. Actually, they are written by Data Scientists for Data Scientists.

However, as a Software Engineer you would rather know if these tools are going to work well or just crash your servers. Based on this premise, the main objective of this article is to explore some Machine learning libraries and see how they work in a real time semi-production scenario.

A bit of background

Firstly, a little background of how Machine Learning projects work. These projects are mainly divided in two phases:

Modeling: this first one is focused on data and the objective is to find the better model that fits best and can explain the input.
- Reading: obtaining data from all the sources available such as DataBases, ETLs or Logs.
- Cleaning: although there is plenty of buck of information, they are just raw data and it would be hard to exploit it as they are, so several techniques have to be applied in order to convert them into valuable and readable information.
- Modeling: the model or the vector of equations are obtained applying different algorithms. Later this model can be used to predict or classify input data.
- Validation: after the model is created, it must be validated using several techniques like Cross-Validation in order to know if its behaviour is expected. If not, the model must be calculated again.
- Evaluation: a metric to evaluate the precision of the model is defined from the subset of the data available. Furthermore, this metric will be used as a criteria for evaluating the model in the update phase.
Model execution: in this phase the model is calculated before it can be used in order to predict or classify the input data.
- Loading the model: just reading the model from a DB or disk.
- Reading the input data: the input data for the model must be read and pre-processed in order to be used by our model.
- Execution: apply the model to our input data. However, depending on the nature of the model, it can be a prediction or a classification.

ML Toolkits

Although there are plenty of Machine learning toolkits, this article is only focused on the three most important from my point of view: R, Scikit and Spark MLLib.

Comparison Chart

In order to have a global view of the three ML libraries, there is a table below comparing them:

	R	Scikit-learn	MLLib
Programming Language	R	Python	Scala/Java/Python
Range of algorithms	Extended	Good	Limited to distributed only
Speed (small-medium data size)	Med-High	Med	Med
Scalability for Big Data	Very limited only scale vertically	Very limited only scale vertically	Excellent
Data source Integration	Good	Very Good	Very Good
Visualization tools	Very good	Very good	Limited and depends on other partners
Learning curve	High	Small	Average
IDEs	Rstudio / Jupyter	Eclipse / Jidea / Jupyter	Eclipse / Jidea / Jupyter

Model Execution Options

The main object of this article is to benchmark the execution of a ML model in a semi real-time environment. Therefore, a simple scenario has been selected where customers will send requests with some arguments of their browsers to a REST service and it will respond with a prediction of their Browser using a trained ML model.

R scenario

Although R language has a HTTP server library⁴, it is very limited. Therefore the approach is quite different from the rest of cases.

The REST server is based on the J2EE platform implemented using the Spring framework.
The server receives the request and executes a R script (creating a new session every time) using an Rserve instance.
In order to increase the performance, there are some optimizations:
- The server calls to a cluster of 2 Rserver by round robin scheduling.
- The R script cache the model using an In Memory DB (Redis).

Scikit-learn scenario

In this scenario a Python server has been implemented which will give the best performance for the Scikit-learn library.

The REST server is implemented in Python using the Flask framework.
The server receives the request and executes a Scikit-learn model.
In order to optimize the performance, the model is cached in memory.

MLLib scenario

In the last scenario, a pure Java solution has been developed giving the best performance for MLLib.

The REST server is implemented in Java using the Spring framework.
The server receives the request and executes an MLLib model.
In order to optimize the performance, the model is cached in memory.

Benchmark Specifications

These are specs for the model training:

The chosen model is a Random Forest classifier with 25 trees in the forest.
The data source is of 100000 entries with 3 input variables and 1 output.
In order to obtaining numeric features the algorithm used is FeatureHasher.

For the load tests:

The specs for the server machine is a large AWS instance with 2 cores, 8GB RAM and SSD disks.
The tools used to measure the performance will be JMeter.
Three different test scenarios have been selected:
- Low traffic: 200 requests using 4 threads.
- Average traffic: 500 requests using 10 threads.
- High traffic: 1000 requests using 10 threads.

Test results

*Low Traffic*	# request	Avg(ms)	Min(ms)	Max(ms)	Req/seg
R	200	938	516	1735	4.18
Scikit-learn	200	164	141	994	23.11
MLLib	200	65	51	130	45.58

*Average Traffic*	# request	Avg(ms)	Min(ms)	Max(ms)	Req/seg
R	500	2411	818	2807	4.1
Scikit-learn	500	366	141	445	25.97
MLLib	500	67	51	219	115.15

*High Traffic*	# request	Avg(ms)	Min(ms)	Max(ms)	Req/seg
R	1000	4833	2223	5244	4.11
Scikit-learn	1000	748	142	821	25.71
MLLib	1000	57	51	119	246.91

Conclusion

After reviewing the results of the Benchmarking, it is obvious to conclude that the fastest option is MLLib. In all the scenarios it has only taken less than 60 ms to process a request. The next fastest solution is Scikit-learn. Although it has been able to respond in less than a second, any real time system could not afford that process time even if it is less than 200ms with low traffic. Finally, it is *R* which has shown really bad results. In every scenario, the response time was longer than 1 second and under anony circumstances can it be used for predicting in a real time system. Not only is it really slow, it has pushed the system to the limit.

R

Pros:
- Wide range of ML models.
- Widely used by the scientific community.
Cons:
- Very low performance
- Not scalable and distributable model training.
- Higher maintenance costs because it would require a Rserve cluster of servers.

Scikit-learn

Pros:
- Wide range of ML models
- Widely used by the scientific community
Cons:
- Low performance
- Not scalable and distributable model training

MLLib

Pros:
- Good performance
- Scalable and distributable model training
Cons:
- Small range of ML models

Author

admin

View all posts

Benchmarking Machine learning prediction models

Product

Solutions

Use case

Partners

About us

Social

Benchmarking Machine learning prediction models

A bit of background

ML Toolkits

Comparison Chart

Model Execution Options

R scenario

Scikit-learn scenario

MLLib scenario

Benchmark Specifications

Test results

Conclusion

R

Scikit-learn

MLLib

Author

Related Posts

Cómo impacta la nueva versión de Stratio Generative Data Fabric 4.8 en la colaboración entre equipos de Negocios y Tecnología

The Impact of Stratio Generative Data Fabric 4.8 on Business and Technical Collaboration

Building Resilient Data Management in Global Uncertain Times

Product

Solutions

Use case

Partners

About us

Social