Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Tips for the Databricks Certified Data Engineer Associate Certification

After passing the Databricks Certified Data Engineer Associate certification in July of last year, I thought I’d take inspiration from Chris Williams’ great blog series on the Associate Developer for Apache Spark 3.0 exam and go through some key things to remember when attempting this certification, as well as providing links to study materials from Databricks and the wider community that were really helpful!

 

COURSE AND CERTIFICATION UPDATES

Databricks have recently updated the Data Engineer Associate certification and its course in the training platform to V3. You can still study V2 of this course and complete V2 of the exam before both are removed on May 31st 2023, however it is highly recommended that you use the V3 content instead, that way you’ll keep yourself up to date with best practices and feature updates from Databricks!

You can see the changes from V2 to V3 below (mostly new content – even more to learn!):

Databricks have also added some knowledge tests to the end of each section of the course, giving you a nice way to track your progress and make sure you’ve understood the content fully!

 

EXAM INFORMATION

Your primary resource for information on the exam is the Databricks certification page on their website. Here, Databricks runs through what a minimally qualified candidate should know, how much the exam costs and other helpful tips. Importantly, the webpage is updated regularly by Databricks, so the latest information should be available there before the Databricks Academy (which can get quite confusing with different versions of multiple courses!) The webpage also links to the Databricks Academy FAQ, which has a certification section with details on how to reschedule your exam, when to expect your results and more.

They also include this handy list of the major exam topics. This exam has 45 multiple-choice questions, split into these areas:

·        Databricks Lakehouse Platform – 24% (11/45)

·        ELT with Spark SQL and Python – 29% (13/45)

·        Incremental Data Processing – 22% (10/45)

·        Production Pipelines – 16% (7/45)

·        Data Governance – 9% (4/45)

You’ll need a mark of 70% (32/45) to pass the exam, so focusing your revision on the first three topics (which make up 75% of the questions) is the best approach. You could technically pass the exam by getting 100% in just those areas, but make sure you cover Production Pipelines and Data Governance too!

As with the Apache Spark certification, Databricks provide a practice exam (only available in Python as the actual exam is just Python currently) to test your skills before registering to do the real thing – unfortunately for me, this exam hadn’t been released when I completed this certification, but the related course on the Databricks Academy had more than enough information to pass it!

All Databricks exams are proctored by Kryterion via Webassessor, which you’ll need an account for to register for the exam. If you have any queries about how the exam is proctored, you can visit their FAQ page (or contact them directly).

Finally, it’s really important to note that unlike the Apache Spark certification, you do not have access to the Spark documentation during the exam! There are no other resources available during the exam, so you’ll need to brush up on your syntax. Databricks have a Certification Overview course that runs through the basics of the exam, but be aware that as of 20/04/2023, this course still only covers V2! It will hopefully be updated soon, but if you want to use this course for V3 of the exam, make sure you know which sections are still being examined so you aren’t caught out!

 

EXAMPLE CODE QUESTIONS

Below, we’ll look at two examples of code-based questions from the Databricks practice exam so you’re prepared for how the questions will be written in the real exam (these questions have been retired from the actual exam). The first question is below, and is the easier of the two:

This question is solved by knowing how to expand (or ‘explode’) nested data – the answer is D here. This one is easier as you just need to remember one correct method. For our more difficult example below, you need to be aware of the correct syntax typically found in Medallion ETL pipelines:

The answer is C here. A and B use aggregations which are typical of the Gold layer, and D and E are not performing transformations. The actual exam questions will definitely be more difficult, so be careful not to rely too heavily on this practice exam alone!

 

STUDY MATERIALS

The best resource for this exam is Data Engineering with Databricks V3, found in the Databricks Academy – it’s designed specifically as an all-in-one course to prepare you for the exam, and each Databricks certification has a course similar to this one. As mentioned before, make sure you’re doing V3 of this course so you’re up to date with current practices and features!

The course features some prerequisites that would be helpful to have:

When I went through the original version of this course in June 2022, the main code blocks were in Spark SQL with some additional Python cells where needed, but Databricks now provides course content in both Spark SQL and PySpark. Regardless of which you prefer however, the exam page still states that DML will be written in SQL and any other code will be in Python so it’s best to focus on Spark SQL for the code-based questions!

In my experience, the Databricks Academy courses linked to each exam are more than enough to help you pass, but now that the Data Engineer Associate exam has been updated, educational content creators on sites like Udemy are producing practice exams for this certification.

One such course that is particularly highly rated can be found here – the reviews suggest the practice questions are as challenging as the exam itself! This instructor also has a preparation course here with similarly high ratings, costing £59.99. Thankfully, both of these courses have been updated to include content for V3 of the exam, so they’ll have everything you need if you’re looking for something outside of the Databricks Academy (if Udemy courses aren’t your thing, you can always use the Apache Spark and Databricks documentation to help with specific methods and syntax!).

As a final tip, connecting with Databricks staff on LinkedIn is a great way to keep an eye out for exam vouchers, which are either 100% or 75% off any Databricks exam! I’ve connected with Youssef Mrini and Samantha Menot, who both post regularly about new Databricks features, training and exam vouchers!

Best of luck for your exam! 😊

Dylan JonesComment