Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Distil8 — Solution Accelerators

Many businesses recognise the benefits of Artificial Intelligence (AI) and Machine Learning (ML) for boosting the potential of their data. This trend has been growing over the last few years. A Machine Learning project requires data scientists to implement machine learning models. In today’s market, more demand than supply has meant data scientists can often be expensive to hire which means getting ML projects off the ground can be tedious in terms of costs and time spent on implementing. This is where the solution accelerators come in handy by accelerating ML projects from an idea to a proof of concept within weeks. This article hopes to introduce the concept of accelerators to anyone thinking of taking on a machine-learning project.

Time spent on a Data Science project

Any veteran data scientist knows that the majority of the time spent on model development is consumed by data wrangling, cleaning and organising. It varies depending on who you ask but it can range from 60%-90% of the time spent on this prior to actually running an algorithm to ‘learn’ from the data. Hiring data scientists can vary quite wildly. It may be the case that a company has an in-house data scientist(s). This could mean costs varying from £20-£35 per hour (£160–£280 per day). Hiring data scientist consultants can be in the £1000s (per day). Projects durations can range from a couple of weeks to 6 months. This produces a huge cost range per project (£1.6k-£130k+). Considering the great cost levels these projects can reach, it would be hugely beneficial if the bulk of the time (ultimately cost) spent generating these models could be reduced. This is where solution accelerators come in. The procedure of cleaning, transforming and preparing data for models is often done through a composition of tasks, selected from a catalogue of repeatable routines. These known routines combined with experience and domain knowledge can lead to these tasks being semi-automated. Solution accelerators provide templated shortcuts for handwriting lengthy code. This makes machine-learning projects more appealing to take on, for businesses. Especially smaller businesses that rely on frugal budgets to survive. Even if money is no issue, the additional time spared from using accelerators can be allocated to longer periods of robust testing or further manual tweaks to squeeze out even greater performance. It is a no-brainer to implement solution accelerators when possible.

Distil8

Distil8 is Advancing Analytics library containing a collection of solution accelerators. It aims to provide a series of accelerators for a wide range of machine-learning tasks, reducing the time spent preparing data or performing preliminary tasks for data science projects. These solutions accelerators include routines for things such as recommendation engines, time-series forecasting, GIS routines, NLP routines and semi-automated optimisation. This tool aims to deliver Proof of Concepts (PoCs) in 2–3 weeks, dramatically reducing the duration of a project.

Where it differs from other accelerators is that it aims to require a more hands-on approach to its use. This means that the tool is appropriate for data scientists or people that have knowledge in the field. The benefit of this is that the capabilities are widened beyond some of the more accommodation accelerators for the masses. It aims to work in conjunction with data scientists on a project, reducing the time spent on upfront tasks. It also integrates into Databricks, benefiting from its Spark capabilities. Logging can be performed with either MLflow or a general python logger.

Distil8 is an ongoing project where more and more accelerators are being added on.

GIS

Geographic Information System (GIS) accelerators, allow geographical information to be manipulated with relative ease. GIS data is notorious for being very large! and often difficult to manoeuvre. It definitely comes under the category of big data. These characteristics can often make building GIS-related machine-learning models challenging. Distil8 utilises Databricks Spark engine, combined with mosaic, to deliver effective ML models.

Time-series

Time-series regression is a special case of regression where the data contains consecutive time elements that run sequentially. This can be things such as diagnostic signals from machines or devices, stock market data, weather data, etc. Although time series is a type of regression, it has to be treated differently. Distil8 provides the appropriate routines to deliver a time-series model using the latest algorithms, whilst benefiting from Spark’s performance (if required) on Databricks. It tests multiple models, allowing the user to pick the model best suited to their needs. It additionally has the ability to use MLflow, a useful library for tracking, packaging and deploying models.

NLP

Natural Language Processing (NLP) deals with text and speech analysis to deliver things such as text segmentations, documentation classification, sentiment analysis and speech recognition. Distil8 houses the appropriate routines and remedies to provide NLP projects needed for businesses.

Recommendation Engines

Disitl8 has its own accelerator for recommendation engines. This model unique to Advancing Analytics provides a multi-pronged approach to recommendations. Utilising both collaborative and content-based filtering giving a fast and robust tool for recommendations of products or services. It also incorporates ElasticSearch for quick effective lookups.

Segmentation

Segmentation is the act of taking a categorised output and adjusting the dimensions/characteristics of this output to best describe these categories. The technique used to do this is clustering. Distil8 offers a method which involves iteration and adjustment to tweak the best output available. Clustering is categorised as unsupervised learning, meaning no aid comes from human input in identifying what the right initial output should be. It, therefore, doesn’t use metrics to compare against untrained data. Distil8 uses unique ways to categorise the level of success in segmenting a domain space.

Conclusion

Accelerators have become a very inviting tool for any data science team to take up. Catapulting the projects towards the finish line whilst reducing costs is something any business should take note of. Advancing Analytics aims to deliver a whole host of these accelerators described above. If anyone is interested in knowing more, please contact us at hello@advancinganalytics.co.uk

Luke MenziesComment