Hive-metastore on Kubernetes with S3 External Table

This blog on hive-metastore originally appeared on Medium. It has been republished with the author’s credit and consent.

In this blog, I’ll cover how to set up Hive metastore on Kubernetes and then leverage external S3 data sets.

Installation

In order to deploy Hive components on the Kubernetes cluster, I first add the required helm chart repo:

<span style="font-weight: 400;">$ helm repo add bigdata-gradiant https://gradiant.github.io/bigdata-charts/</span>

<span style="font-weight: 400;">$ helm repo update</span>

$ helm repo add bigdata-gradiant https://gradiant.github.io/bigdata-charts/

$ helm repo update

I can then search the new repo for the available helm charts. At the time of this writing, these are:

<span style="font-weight: 400;">$ helm search repo bigdata-gradiant</span>

<span style="font-weight: 400;">NAME                              CHART VERSION APP VERSION DESCRIPTION                                      </span>

<span style="font-weight: 400;">bigdata-gradiant/hbase            0.1.6         2.0.1       HBase is an open-source non-relational distribu...</span>

<span style="font-weight: 400;">bigdata-gradiant/hdfs             0.1.10        2.7.7       The Apache Hadoop software library is a framewo...</span>

<span style="font-weight: 400;">bigdata-gradiant/hive             0.1.6         2.3.6       The Apache Hive ™ data warehouse software facil...</span>

<span style="font-weight: 400;">bigdata-gradiant/hive-metastore   0.1.3         2.3.6       The Apache Hive ™ data warehouse software facil...</span>

<span style="font-weight: 400;">bigdata-gradiant/jupyter          0.1.11        6.0.3       Helm </span><span style="font-weight: 400;">for</span><span style="font-weight: 400;"> jupyter single server with pyspark sup...</span>

<span style="font-weight: 400;">bigdata-gradiant/kafka-connect-ui 0.1.0         0.9.7       Helm </span><span style="font-weight: 400;">for</span><span style="font-weight: 400;"> Landoop/kafka-connect-ui                </span>

<span style="font-weight: 400;">bigdata-gradiant/opentsdb         0.1.7         2.4.0       Store and serve massive amounts of time series ...</span>

<span style="font-weight: 400;">bigdata-gradiant/spark-standalone 0.1.0         2.4.4       Apache Spark™ is a unified analytics engine </span><span style="font-weight: 400;">for</span><span style="font-weight: 400;">...</span>

$ helm search repo bigdata-gradiant

NAME CHART VERSION APP VERSION DESCRIPTION

bigdata-gradiant/hbase 0.1.6 2.0.1 HBase is an open-source non-relational distribu...

bigdata-gradiant/hdfs 0.1.10 2.7.7 The Apache Hadoop software library is a framewo...

bigdata-gradiant/hive 0.1.6 2.3.6 The Apache Hive ™ data warehouse software facil...

bigdata-gradiant/hive-metastore 0.1.3 2.3.6 The Apache Hive ™ data warehouse software facil...

bigdata-gradiant/jupyter 0.1.11 6.0.3 Helm for jupyter single server with pyspark sup...

bigdata-gradiant/kafka-connect-ui 0.1.0 0.9.7 Helm for Landoop/kafka-connect-ui

bigdata-gradiant/opentsdb 0.1.7 2.4.0 Store and serve massive amounts of time series ...

bigdata-gradiant/spark-standalone 0.1.0 2.4.4 Apache Spark™ is a unified analytics engine for...

To link Trino to some S3 data sets, I’ll be deploying a Hive metastore:

<span style="font-weight: 400;">$</span> <span style="font-weight: 400;">helm</span> <span style="font-weight: 400;">install</span> <span style="font-weight: 400;">hivems</span> <span style="font-weight: 400;">bigdata-gradiant/hive-metastore</span> <span style="font-weight: 400;">-n</span> <span style="font-weight: 400;">analytics</span>

<span style="font-weight: 400;">NAME:</span> <span style="font-weight: 400;">hivems</span>

<span style="font-weight: 400;">LAST DEPLOYED:</span> <span style="font-weight: 400;">Thu</span> <span style="font-weight: 400;">Aug</span> <span style="font-weight: 400;">10</span> <span style="font-weight: 400;">10</span><span style="font-weight: 400;">:21:47</span> <span style="font-weight: 400;">2023</span>

<span style="font-weight: 400;">NAMESPACE:</span> <span style="font-weight: 400;">analytics</span>

<span style="font-weight: 400;">STATUS:</span> <span style="font-weight: 400;">deployed</span>

<span style="font-weight: 400;">REVISION:</span> <span style="font-weight: 400;">1</span>

<span style="font-weight: 400;">TEST SUITE:</span> <span style="font-weight: 400;">None
</span>

$ helm install hivems bigdata-gradiant/hive-metastore -n analytics

NAME: hivems

LAST DEPLOYED: Thu Aug 10 10:21:47 2023

NAMESPACE: analytics

STATUS: deployed

REVISION: 1

TEST SUITE: None

I check that my pods have started correctly within Kubernetes:

<span style="font-weight: 400;">$ kubectl -n analytics get pods</span>

<span style="font-weight: 400;">NAME                                                READY   STATUS  </span>

<span style="font-weight: 400;">...                0             42m</span>

<span style="font-weight: 400;">hivems-hive-metastore-0                             1/1     Running  </span>

<span style="font-weight: 400;">hivems-postgresql-0                                 1/1     Running </span>

<span style="font-weight: 400;">...</span>

$ kubectl -n analytics get pods

NAME READY STATUS

... 0 42m

hivems-hive-metastore-0 1/1 Running

hivems-postgresql-0 1/1 Running

...

I then copy out the created configmap and append the following to the data hive-site.xml section (replace with your S3 endpoint values):

<span style="font-weight: 400;">$ kubectl -n analytics get configmap hivems-hive-metastore -o yaml  &gt; hivems-hive-metastore.yaml</span>

<span style="font-weight: 400;">$ vi hivems-hive-metastore.yaml</span>

<span style="font-weight: 400;">### I MAKE THE FOLLOWING CHANGE/ADDITION ###</span>

<span style="font-weight: 400;">data:</span>

<span style="font-weight: 400;"> hive-site.xml: |</span>

<span style="font-weight: 400;">   </span><span style="font-weight: 400;">&lt;?xml version=</span><span style="font-weight: 400;">"1.0"</span><span style="font-weight: 400;">?&gt;</span>

<span style="font-weight: 400;">   </span><span style="font-weight: 400;">&lt;?xml-stylesheet type=</span><span style="font-weight: 400;">"text/xsl"</span><span style="font-weight: 400;"> href=</span><span style="font-weight: 400;">"configuration.xsl"</span><span style="font-weight: 400;">?&gt;</span>

<span style="font-weight: 400;">   </span><span style="font-weight: 400;">&lt;configuration&gt;</span>

<span style="font-weight: 400;">       </span><span style="font-weight: 400;">&lt;property&gt;</span>

<span style="font-weight: 400;">         </span><span style="font-weight: 400;">&lt;name&gt;</span><span style="font-weight: 400;">fs.s3a.endpoint</span><span style="font-weight: 400;">&lt;/name&gt;</span>

<span style="font-weight: 400;">         </span><span style="font-weight: 400;">&lt;value&gt;</span><span style="font-weight: 400;">192.168.2.2</span><span style="font-weight: 400;">&lt;/value&gt;</span>

<span style="font-weight: 400;">       </span><span style="font-weight: 400;">&lt;/property&gt;</span>

<span style="font-weight: 400;">       </span><span style="font-weight: 400;">&lt;property&gt;</span>

<span style="font-weight: 400;">         </span><span style="font-weight: 400;">&lt;name&gt;</span><span style="font-weight: 400;">fs.s3a.access.key</span><span style="font-weight: 400;">&lt;/name&gt;</span>

<span style="font-weight: 400;">         </span><span style="font-weight: 400;">&lt;value&gt;</span><span style="font-weight: 400;">PSFB....JEIA</span><span style="font-weight: 400;">&lt;/value&gt;</span>

<span style="font-weight: 400;">       </span><span style="font-weight: 400;">&lt;/property&gt;</span>

<span style="font-weight: 400;">       </span><span style="font-weight: 400;">&lt;property&gt;</span>

<span style="font-weight: 400;">         </span><span style="font-weight: 400;">&lt;name&gt;</span><span style="font-weight: 400;">fs.s3a.secret.key</span><span style="font-weight: 400;">&lt;/name&gt;</span>

<span style="font-weight: 400;">         </span><span style="font-weight: 400;">&lt;value&gt;</span><span style="font-weight: 400;">A121....eJOEN</span><span style="font-weight: 400;">&lt;/value&gt;</span>

<span style="font-weight: 400;">       </span><span style="font-weight: 400;">&lt;/property&gt;</span>

<span style="font-weight: 400;">       </span><span style="font-weight: 400;">&lt;property&gt;</span>

<span style="font-weight: 400;">         </span><span style="font-weight: 400;">&lt;name&gt;</span><span style="font-weight: 400;">fs.s3a.connection.ssl.enabled</span><span style="font-weight: 400;">&lt;/name&gt;</span>

<span style="font-weight: 400;">         </span><span style="font-weight: 400;">&lt;value&gt;</span><span style="font-weight: 400;">false</span><span style="font-weight: 400;">&lt;/value&gt;</span>

<span style="font-weight: 400;">       </span><span style="font-weight: 400;">&lt;/property&gt;</span>

$ kubectl -n analytics get configmap hivems-hive-metastore -o yaml > hivems-hive-metastore.yaml

$ vi hivems-hive-metastore.yaml

### I MAKE THE FOLLOWING CHANGE/ADDITION ###

data:

hive-site.xml: |

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.s3a.endpoint</name>

</property>

<name>fs.s3a.access.key</name>

</property>

<name>fs.s3a.secret.key</name>

<value>A121....eJOEN</value>

</property>

<name>fs.s3a.connection.ssl.enabled</name>

<value>false</value>

</property>

Note: I had to create a new container image to include the missing hadoop-aws-2.7.4.jar file. This can be obtained from my Docker repo jboothomas/hive-metastore-s3:v6. I then simply provided this as the image to use for the hive-metastore deployment.

Table Creation

To create an external table, I exec into the hive metastore pod and connect to hive:

<span style="font-weight: 400;">$ kubectl -n analytics </span><span style="font-weight: 400;">exec</span><span style="font-weight: 400;"> -it hivems-hive-metastore-0 - /bin/sh</span>

<span style="font-weight: 400;"># hive</span>

<span style="font-weight: 400;">SLF4J: Class path contains multiple SLF4J bindings.</span>

<span style="font-weight: 400;">SLF4J: Found binding </span><span style="font-weight: 400;">in</span><span style="font-weight: 400;"> [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]</span>

<span style="font-weight: 400;">SLF4J: Found binding </span><span style="font-weight: 400;">in</span><span style="font-weight: 400;"> [jar:file:/opt/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]</span>

<span style="font-weight: 400;">SLF4J: See https://www.slf4j.org/codes.html</span><span style="font-weight: 400;">#multiple_bindings for an explanation.</span>

<span style="font-weight: 400;">SLF4J: Actual binding is of </span><span style="font-weight: 400;">type</span><span style="font-weight: 400;"> [org.apache.logging.slf4j.Log4jLoggerFactory]</span>

$ kubectl -n analytics exec -it hivems-hive-metastore-0 - /bin/sh

# hive

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See https://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true

Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

<span style="font-weight: 400;">hive&gt;</span>

1	<span style="font-weight: 400;">hive></span>

Now I can create my external table:

<span style="font-weight: 400;">create</span> <span style="font-weight: 400;">external</span> <span style="font-weight: 400;">table</span><span style="font-weight: 400;"> if </span><span style="font-weight: 400;">not</span> <span style="font-weight: 400;">exists</span><span style="font-weight: 400;"> nyctaxi(</span>

<span style="font-weight: 400;">   VendorID </span><span style="font-weight: 400;">bigint</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   tpep_pickup_datetime </span><span style="font-weight: 400;">timestamp</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   tpep_dropoff_datetime </span><span style="font-weight: 400;">timestamp</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   passenger_count </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   trip_distance </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   RatecodeID </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   store_and_fwd_flag string,</span>

<span style="font-weight: 400;">   PULocationID </span><span style="font-weight: 400;">bigint</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   DOLocationID </span><span style="font-weight: 400;">bigint</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   payment_type </span><span style="font-weight: 400;">bigint</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   fare_amount </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   extra </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   mta_tax </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   tip_amount </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   tolls_amount </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   improvement_surcharge </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span>

<span style="font-weight: 400;">   total_amount </span><span style="font-weight: 400;">double</span>

<span style="font-weight: 400;">   )</span>

<span style="font-weight: 400;">   STORED </span><span style="font-weight: 400;">AS</span><span style="font-weight: 400;"> PARQUET</span>

<span style="font-weight: 400;">   LOCATION </span><span style="font-weight: 400;">'s3a://nyctaxi/'</span><span style="font-weight: 400;">;</span>

<span style="font-weight: 400;">Time</span><span style="font-weight: 400;"> taken: </span><span style="font-weight: 400;">2.756</span><span style="font-weight: 400;"> seconds</span>

create external table if not exists nyctaxi(

VendorID bigint,

tpep_pickup_datetime timestamp,

tpep_dropoff_datetime timestamp,

passenger_count double,

trip_distance double,

RatecodeID double,

store_and_fwd_flag string,

PULocationID bigint,

DOLocationID bigint,

payment_type bigint,

fare_amount double,

extra double,

mta_tax double,

tip_amount double,

tolls_amount double,

improvement_surcharge double,

total_amount double

)

STORED AS PARQUET

LOCATION 's3a://nyctaxi/';

Time taken: 2.756 seconds

From this point onwards, I can leverage various analytics tools and point to my hive-metastore service to run queries against this table (for example: trino).

Hive-metastore on Kubernetes with S3 External Table

Installation

Table Creation

Upskill Your Knowledge!

IT Orchestration vs. IT Automation: What’s the Difference?

Failover vs. Failback: What’s the Difference?

How to Deploy a Docker Image with Ansible

Docker vs. Podman

Demystifying DirectFlash Modules vs. SSDs vs. HDDs vs. Hybrid

IT Orchestration vs. IT Automation: What’s the Difference?

Failover vs. Failback: What’s the Difference?

How to Deploy a Docker Image with Ansible

Docker vs. Podman

Hive-metastore on Kubernetes with S3 External Table

Installation

Table Creation

Upskill Your Knowledge!

Related Stories

Top Stories