image_pdfimage_print

This article on Airbyte S3 initially appeared on Medium. It was republished with the author’s credit and consent. 

In this blog, I’ll show a simple implementation of Airbyte on Kubernetes with S3 integration on Pure Storage® FlashBlade®.

From our Kubernetes server with Helm installed, I first add the required helm repo for Airbyte:

Then, I deploy with, if required, a values.yaml to the desired namespace:

I edit the service/s3airbyte-airbyte-webapp-svc to change from ClusterIP to NodePort to have a quick port forward to the web interface.

Airbyte S3

Airbyte interface.

Let’s now create a simple connector to pull data from an S3 bucket. I select Create Connector and choose S3 as the type. Then, I provide optional fields for the AWS access and secret keys, as well as the endpoint. For the endpoint, since I have no SSL certificate on my demo environment setup, I specify an http path.

Kubernetes

Airbyte will test the source and then prompt for a destination to be created.

For the sake of this blog post, I’ll simply pass a second, newly created S3 bucket on our FlashBlade as the destination and test the switch from parquet to avro as the transformation of the data.

Airbyte S3

Again, Airbyte will test the destination, and after validation, I’m presented with the Configure Connection settings page. Change settings to suit you. I’ll leave it all as per default:

Kubernetes

After setup, I’m passed to the Connection Management pages, where I can see its status, job history, replication, transformation, and settings:

Airbyte S3

While this is in progress, I quickly check the objects in the two source and destination buckets:

To see the progress of the job, select Job History > View logs:

Kubernetes

It will show the current count of records processed, for my example S3 connector job:

Airbyte S3

The job will finish and in the Job History, sync history information for the successful sync is displayed:

Kubernetes

Let’s check the destination bucket contents. I now have four objects from our parquet to avro conversion:

Airbyte provides a simple platform to extract, transform, and load data from multiple sources and destinations thanks to its 300+ connectors.

Pure Storage FlashBlade’s S3 storage is simple to integrate and provides a fast, scalable S3 layer for Airbyte and analytics applications to leverage within the larger data pipeline picture.