Introduction

As a data scientist who has been in the industry a while, you would have probably experienced many twists and turns along the journey of evolution for the industry. Data scientists have come a long way from working locally on their machines to generate machine learning models. The shift to the cloud has revolutionised the way data scientists work where the cloud has opened up a whole host of machine learning tools, pushing the capabilities of machine learning. One of the most popular tools is none other than the famous Databricks.

Data scientists use Databricks for several reasons, as it provides a unified analytics platform that combines big data and artificial intelligence (AI) capabilities. Databricks is powered by Apache Spark and is designed to simplify and accelerate data science and machine learning workflows.

One of the key features of Databricks is its distributed computing capabilities, allowing for grander data sets to be used and handle a tidal wave of data that would have previously been difficult to manoeuvre. As far as machine learning is concerned, Databricks not only allows for easy access to all the top-level libraries used in ML development, but it also has a bunch of surrounding features that support the model, such as deployment options and feature stores.

Stepping back from Databricks, one of the new developments in the machine learning world is Intel’s answer to improving model training efficiencies. Intel has optimised popular machine learning libraries such as Scikit-learn and TensorFlow, in order to boost computational performance gains of up to 100 times when running on Intel hardware (depending on the chip and scenario).

Bringing these two things together is something rarely advertised due to the lack of overt transparency as to what chips Databricks computing clusters use. Databricks clusters do, in fact, use Intel hardware meaning data scientists can capitalise on both things with minimal additional complexity to setup.

Use Case

Advancing Analytics, a UK-based consultancy, recently used Intel’s optimised library to improve efficiency within Databricks for one of their clients. The use-case of interest was a fraud detection model that carried long training times. This was due to elaborate aspects of the model, as well as integrating hyperparameter tuning into the process.

Advancing Analytics benchmarked the solution and found to achieve a 29% improvement in training performance. This was achieved on a training set of 60043 rows and 69 features. The solution was run on Azure Databricks where a Standard_d8_v3 cluster was used. This series all used Intel hardware of varying types. They are either one of the following:

· 3rd Generation Intel® Xeon® Platinum 8370C (Ice Lake),

· Intel® Xeon® Platinum 8272CL (Cascade Lake),

· Intel® Xeon® 8171M 2.1GHz (Skylake),

· Intel® Xeon® E5–2673 v4 2.3 GHz (Broadwell)

· Intel® Xeon® E5–2673 v3 2.4 GHz (Haswell) processors with Intel® Turbo Boost Technology 2.0

See the link for more information.

Each option has varying levels of performance gains. Within Databricks, the user can check specifically what hardware is being used by using the following command.

The notebook the performance was tested on was the main training book where hyperparameter tuning using HyperOpt was adopted. Within the objective function passed to HyperOpt, SMOTENC over-sampling combined with random under-sampling was used. Stratified k-fold cross-validation was also used to output the mean F1 score as the return value of the loss function. Scikit-learn Pipelines were used with a ColumnTransformer to preprocess the data, such that they could be passed to an XGBoost model. This routine took the majority of the time. Hence, the most significant gains in computation performance could be achieved.

Setting up Intel Optimised Scikit-learn

There are two approaches to setting up the optimised libraries.

The first method uses the cluster console to install the optimised libraries. This is the most straightforward approach where the Libraries Tab in the cluster configuration setting can be clicked. In turn, the PyPI source can then be selected and the ‘scikit-learn-intelex’ or ‘intel-tensorflow’ can then be installed. It is recommended to use Tensorflow version 2.6.0 if the user wishes to use this library.

The second option is to install the library using an init script. An init script can be created using.

The init script can then be stored in DBFS, and will be installed upon startup after doing the following in the cluster configuration settings.

Although this method is more complicated to setup, it allows a more programmatic approach to installing the library and may be useful for including into something like a CI/CD pipeline.

This doesn’t overwrite the original Scikit-learn library install on an ML runtime. If the user wishes to use the optimised version, they have to use a patch in the notebook of interest. This can be achieved by using the following.

This will, in turn, activate the optimised library version.

Conclusion

This small trick can achieve a large gain in training efficiency. It is dependent on what Intel hardware is used at the time but it is a no-brainer to reduce the time (and ultimately cost) by performing this slight tweak to a Databricks Notebook.

Blog

A Simple Trick to Utilise Databrick’s Intel Hardware to Improve Model Training Efficiency

Introduction

Use Case

Setting up Intel Optimised Scikit-learn

Conclusion