Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Advancing Analytics at SQLBits 2022

SQLBits 2022 is just around the corner, taking place between March 8th and 12th at the ExCel in London.

Across our team we’ll be delivering 3 pre-con training days and 12 general sessions across Data Engineering, Data Science, and DevOps. Here’s the agenda of all our sessions, and when and where to find us.


Tuesday - 8th (Training Days)

A Data Engineer's Guide to Azure Synapse

Presenters: Simon Whiteley, Zach Stagers, and Stijn Wyants

Track: Data Engineering

There has been an explosion of interest in Azure Synapse Analytics as everyone races to get to grips with the all-in-one data analytics platform. But when opening up the box, you find it's a lot more complex that it's made out to be, with several different powerful compute engines, each with their own idiosyncrasies! Why do we have different flavours of each engine? When should you use Spark pools over SQL? What's the most cost effective approach for different scenarios? What types of users should be using each service? The answers to these questions aren't always met with clarity!

This training day breaks down the Synapse workspace into it's component parts and provides a foundation of knowledge for each piece. During the day, we will cover:

- Fundamentals of building a Lake-based analytical platform - how you structure a lake, what file format to choose, what kinds of data work it's suited for
- How the SQL Pools work, patterns for optimising performance and cost and how we can use our SQL endpoints to integrate with other services
- The Synapse Spark engine, demonstrating how you can write dynamic workflows in Python or bring your existing SQL logic to spark directly.
- Data Explorer pools and how you can use it for deep exploration of logs, time series and other fast-moving unstructured data sources
- Synapse Integrations, how you can take your workspace and integrate directly with tools such as Azure Purview, CosmosDB and the wider Dataverse

There is a huge amount to cover, but you'll be guided by Data Platform MVP Simon Whiteley & veteran analytics consultant Zach Stagers, both of whom have deep knowledge across the whole of this wide and sprawling tech stack.


Wednesday - 9th (Training Days)

Achieving DevOps Nirvana: Automating Azure Data Platform Deployments with Terraform

Presenters: Anna Wykes and Falek Miah

Track: DevOps

Adopting full Infrastructure as Code (IaC) can be a daunting task, not always accessible to every data developer, given the variety in experience and skill-set. It is important we work towards the DevOps dream of us all being part of the process , and all being responsible for and understanding our solutions infrastructure – but how do we achieve this dream?

Terraform is a highly popular and easy to learn. IaC solution that simplifies the deployment process. Terraform can be used with all the major cloud providers: Azure, AWS & GCP. Also, specialist analytics tools such as Databricks have introduced their own Terraform providers to assist with deploying and managing resources into all major cloud providers.

In this workshop you will be introduced to Terraform, and learn its core concepts and components. We will then focus on designing and deploying an Azure Data Platform solution, including a Resource Group, Key Vault, ADLS (Azure Data Lake Store), Synapse and Databricks.

Once we have our solution, we will run our Terraform via a DevOps CI/CD (Continuous Integration/Continuous Deployment) pipeline. Finally, we will cover some of the most common security and networking challenges. We then finish with best practice guidelines and comparisons with other popular IaC solutions

Join Anna Wykes and develop the core knowledge you need to work with Terraform for your Azure Data Platform solution(s), along with transferable Terraform skills that can be used with other Cloud Providers

Get there faster with Machine Learning in Azure Synapse

Presenters: Terry McCann and Gavi Regunath

Track: Data Science

Azure Synapse is Microsoft's unified data analytics platform. You will see a lot about the great problems it solves in data Engineering and with Data Lakehouse Architectures. But this is never the end of the story. A data lake needs to be distilled and refined in to its predictive potential. This is where Machine Learning comes in to play. Each month there are new and interesting features added to Azure Synapse. One area which is getting more and more attention is Machine Learning. In this full day session we go from zero to hero in not only Azure Synapse but most importantly how you build models for production in Azure Synapse.

This session is a lab heavy session. You will need to bring a laptop with you, with an Azure subscription. We start the day looking at an overview of Azure Synapse, what it is and how it works. From there we move to explore how Synapse implements Machine Learning. The morning will focus on the integrations with AutoML and cognitive services before we move to train models from scratch in Python and PySpark. Along the way we will talk about model management and an increasingly important topic, MLOps.

Learn from industry experts and Microsoft MVPs Advancing Analytics in this full day session. This session is great for those who are new to Synapse, new to Machine learning but know some Synapse or new to both.


Thursday - 10th (General Sessions)

Lessons in Lakehouse Automation

Presenter: Simon Whiteley

Track: Data Engineering

Time & Place:12:00, Room 11

The term Data Lakehouse is still new to many, but the technology has now reached a level of maturity and sophistication that makes it more accessible than ever before. But where do you start with building a Data Lakehouse? How can you achieve the same level of maturity that we have with relational data warehouses? How do you avoid reinventing the wheel?

In this session, we're going to take you on the journey that Advancing Analytics has taken over the past few years. Looking at the evolution of lakehouse architectures alongside the new techniques for code automation & metadata management that they unlock. We'll talk about some real-world problem scenarios and how you can model them within a reference architecture. We'll also touch upon our framework accelerator Hydr8 and how that design helps accelerate our client's lakehouse adoption.

Bringing Data Lakes to your Purview

Presenter: Simon Whiteley

Track: Data Engineering

Time & Place:16:40, Room 6

Data Lakes are a tricky beast, always followed with eyerolling jokes about "data swamps", but how DO you keep your lake under control? Way back in the day we had Azure Data Catalog, which did a decent job at cataloguing relational databases, but was utterly rubbish at anything else. With Azure Purview we have a second shot, a chance to perform true Data Governance over Lake-based platforms.

This session focusses on this use case specially, taking the core elements of Azure Purview and scanning lake data, creating resource sets, plugging in Hive metastores and creating that lake catalog we've always dreamed of, all in 20 mins or less!

Docker & Kubernetes for the Data Scientist

Presenter: Terry McCann

Track: Data Science

Time & Place:17:10, Room 3

Deploying Machine Learning models is known as the hardest problem in Data Science. Too many models live and die on a developers machine. We need a way to deploy our models in a repeatable way. In this session we will look at the basics and the history of Docker. We will build a Machine Learning model in Python, serialise it and containerise it.

Docker is great for packaging our applications, but we need somewhere to run it. For this we will use Kubernetes. Again we will look at the basics and history of K8s (how the kool kids write Kubernetes). We will then get our docker container running our model live and in to production.

Too few machine learning developers can deploy models, lets change this by running through all the examples together in this session.


Friday - 11th (General Sessions)

Implementing a Data Quality Framework in Purview

Presenter: Ust Oldfield

Track: Data Engineering

Time & Place:10:10, Room 10

Azure Purview is Microsoft's latest data governance offering with an extensive Data Glossary functionality. In this demo-heavy session, we'll look at Purview, it's functionality as a Data Catalog, and how we can expand it to a Data Quality solution with the help of Databricks.

So you want to be a Data Engineer?

Presenters: Anna Wykes, Michael Robson, and Ust Oldfield

Track: Data Engineering

Time & Place:15:20, Room 6

Being Data Engineers, we think it’s a cool profession to work in. It's also becoming one of the most in-demand skill-sets across industries and sectors.

In this Session - Anna, Mikey and Ust will introduce you to the role of a Data Engineer; some of the technologies and tools used within the discipline; before guiding you to resources that will help you learn the basics and further develop your expertise.

We’ll share parts of our own journeys to becoming Data Engineers, and how you can use your existing experience to transition from careers such as being a DBA, Software Engineering or just having a passion for data and problem solving.

AutoML: An Introduction To Get You Started

Presenters: Gavi Regunath

Track: Data Science

Time & Place:13:40, Room 4

AutoML which stands for Automated Machine Learning empowers data teams to quickly build and deploy machine learning models. It aims to curb the time and expertise required to generate a machine learning model by automating the heavy lifting of preprocessing, feature engineering and model creation, tuning and evaluation. While AutoML may initially appeal solely to enterprises for their citizen data scientists, it has the potential to become a valuable tool for seasoned data scientists as well. This session aims to demonstrate how to get started with AutoML using Azure Databricks to solve a machine learning problem.

Machine Learning in Azure Synapse

Presenter: Terry McCann

Track: Data Science

Time & Place:15:20, Room 6

Revised for 2022! There is a lot of content available on Synapse for Data Engineering, but what about Machine Learning? In this session we will look at how to train models in Azure Synapse with SparkML, AutoML, and cogntive services.

"Cultivating the Catalogue" - Growing Data Governance with Azure Purview

Presenters: Craig Porteous and Chris Williams

Track: Data Engineering

Time & Place:17:10, Room 3

Providing a business with an organized and propagated data catalogue is essential with ever growing demands and questions about your data platform. How are your data assets organised and how can you track lineage?

In this session we will help you sew the seeds of good data governance and show how Azure Purview fits into your pathway. We will demonstrate how processes, business terminology, and ownership drive data governance forward. To close out this introductory session we'll dig deep into ways to extend Purview's capability with the Atlas API


Saturday - 12th (General Sessions)

Synapse Data Flows - Will Citizen ETL Replace the Data Engineer?

Presenter: Zach Stagers

Track: Data Engineering

Time & Place:10:50, Room 9

We're in the middle of a data lake boom, with companies investing huge amounts in new lake-based data platforms, and all of the engineering that comes with it. But what if you don't want to learn python? What if your data experts aren't even comfortable with SQL? How do we bring data transformation to the people who understand the business?

Synapse Data Flows is a low code drag and drop "citizen ETL" tool, but under the hood it uses the incredibly powerful Apache Spark clusters to interpret and execute your transformations. But how far can you take it? Is this the death of SQL and Python, or are Mapping Data Flows just intended for lightweight projects?

In this session we'll talk about what Mapping Data Flows can do, what it can't do and how far you can take them in building a data platform framework.

Data Science and Analytics from the Trenches: Real-World Experience from Diverse voices in the field

Presenters: Tori Tompkins, Gavi Regunath, and Jennifer Stirrup

Track: Data Science

Time & Place:11:20, Room 9

In this session, we will cut through the marketing buzzwords to share experiences, tips, and tricks on how to be successful with Data Science and Analytics in the real world. Tune in to hear the team share real-world experience and get takeaways from industry insiders on real projects with impact. We will also discuss the ethics and fairness of Data Science and Analytics projects and how we can be more inclusive from a technology, people, and process standpoint.

Join this lively and interactive session to hear from the speakers to learn practical examples on how to be a more successful data scientist. Bring your questions for discussion!

Automate the deployment of Databricks components using Terraform

Presenters: Anna Wykes and Falek Miah

Track: DevOps

Time & Place:11:20, Room 4

Databricks is a great data analytics tool for data science and data engineering, but provisioning Databricks resources (workspace, clusters, secrets, mount storage etc.) can be complex and time consuming.

Automating deployment of Databricks resources has been tricky in the past using Terraform an Infrastructure as Code tool. It has required using mix of Terraform Azure providers and/or ARM, PowerShell, Databricks CLI or REST APIs. This made it harder to repeat and caused inconsistent environments.

Databricks introduced its own Terraform provider to assist with deploying and managing Databricks resources into Azure, Google (GCP) and Amazon Web Services (AWS) cloud platforms. Giving the ability to automate deployment of Databricks resources at the time of provisioning the infrastructure, making it easier to manage and maintain.

This session will be introducing you to Terraform, Databricks provider and take you through the steps required to build an automated solution to provision Databricks workspace and resources into Azure cloud platform using Terraform.

By the end of this session, you will have everything you need to automate your Databricks environments deployments and ensure consistency.

Optimisation in Business

Presenter: Luke Menzies

Track: Data Science

Time & Place:12:50, Room 8

An often overlooked application of data science in business involves prescriptive methods. These are the methods associated with making cars drive autonomously, dynamically adjusting routes to account for traffic, or automatically regulating system temperatures to deliver optimal operations. This field is often referred to by the catchier name for businesses as optimisation. The art of choosing the perfect set of actions to reach the ultimate value that can be achieved. This talk covers a range of use cases a business might not think to call upon the expertise of a data scientist for. This talk will also delve into the mechanics of the techniques used to deliver prescribed choices.