hive

Apache Hive Puppet Module

Module Stats

18,739 downloads

2,682 latest version

5.0 quality score

Support the Puppet Community by contributing to this module

You are welcome to contribute to this module by suggesting new features, currency updates, or fixes. Every contribution is valuable to help ensure that the module remains compatible with the latest Puppet versions and continues to meet community needs. Complete the following steps:

Review the module’s contribution guidelines and any licenses. Ensure that your planned contribution aligns with the author’s standards and any legal requirements.
Fork the repository on GitHub, make changes on a branch of your fork, and submit a pull request. The pull request must clearly document your proposed change.

For questions about updating the module, contact the module’s author.

Version information

released Mar 1st 2020

This version is compatible with:

Puppet Enterprise 2025.2.x, 2025.1.x, 2023.8.x, 2023.7.x, 2023.6.x, 2023.5.x, 2023.4.x, 2023.3.x, 2023.2.x, 2023.1.x, 2023.0.x, 2021.7.x, 2021.6.x, 2021.5.x, 2021.4.x, 2021.3.x, 2021.2.x, 2021.1.x, 2021.0.x, 2019.8.x, 2019.7.x, 2019.5.x, 2019.4.x, 2019.3.x, 2019.2.x, 2019.1.x, 2019.0.x, 2018.1.x, 2017.3.x, 2017.2.x, 2017.1.x, 2016.5.x, 2016.4.x
Puppet >=3.4.0
, , , ,

Start using this module

Installation method

Add this module to your Puppetfile:

mod 'cesnet-hive', '0.15.0'

Learn more about managing modules with a Puppetfile

Add this module to your Bolt project:

bolt module add cesnet-hive

Learn more about using this module with an existing project

Manually install this module globally with Puppet module tool:

puppet module install cesnet-hive --version 0.15.0

Tags: hadoop, hive

Documentation

cesnet/hive — version 0.15.0 Mar 1st 2020

Apache Hive Puppet Module

####Table of Contents

Module Description - What the module does and why it is useful
Setup - The basics of getting started with Hive
Usage - Configuration options and additional functionality
Reference - An under-the-hood peek at what the module is doing and how
Limitations - OS compatibility, etc.
Development - Guide for contributing to the module

##Module Description

This module installs and setups Apache Hive data warehouse software running on the top of Hadoop cluster. Hive services can be collocated or separated with other services in the cluster. Optionally security based on Kerberos can be enabled. Security should be enabled if Hadoop cluster security is enabled.

Puppet client configured with stringify_facts=false is recommended, but not required (see also schema_file parameter).

Tested with:

Debian 7/wheezy, 8/jessie: Cloudera distribution (tested on Hive 0.13.1, 2.1.1)
RHEL 6 and clones: Cloudera distribution (tested with Hadoop 2.6.0)

##Setup

###What cesnet-hive module affects

Packages: installs Hive packages (common packages, subsets for requested services, hcatalog, and/or hive client)
Files modified:
*/etc/hive/* (or /etc/hive/conf/**)
/usr/local/sbin/hivemanager (not needed, only when administrator manager script is requested by features)
Alternatives:
alternatives are used for /etc/hive/conf in Cloudera
this module switches to the new alternative by default, so the Cloudera original configuration can be kept intact
Services: only requested Hive services are setup and started
metastore
server2
Helper Files:
/var/lib/hadoop-hdfs/.puppet-hive-dir-created (created by cesnet-hadoop module)
Secret Files (keytabs): permissions are modified for hive service keytab (/etc/security/keytab/hive.service.keytab)
Facts: hive_schemas (stringify_facts=false is needed when using this fact)
Databases: for supported databases (when not disabled): user created and database schema imported using puppetlabs modules

###Setup Requirements

There are several known or intended limitations in this module.

Be aware of:

Repositories - see cesnet-hadoop module Setup Requirements for details
No inter-node dependencies: running HDFS namenode is required for Hive metastore server startup
Secure mode: keytabs must be prepared in /etc/security/keytabs/ (see realm parameter)
Database setup: MariaDB/MySQL or PostgreSQL are supported. You need to install puppetlabs-mysql or puppetlabs-postgresql module, because they are not in dependencies.
Hadoop: it should be configured locally or you should use hdfs_hostname parameter (see Module Parameters)

###Beginning with Hive

Let's start with basic examples.

Example: The simplest setup without security nor zookeeper, with everything on single machine:

class{"hive":
  hdfs_hostname => $::fqdn,
  metastore_hostname => $::fqdn,
  server2_hostname => $::fqdn,
}

node <HDFS_NAMENODE> {
  # HDFS initialization must be done on the namenode
  # (or /user/hive on HDFS must be created)
  include hive::hdfs
}

node default {
  # server
  include ::hive::metastore
  include ::hive::server2
  # client
  include ::hive::frontend
  include ::hive::hcatalog
  # worker nodes
  include ::hive::worker
}

Modify $::fqdn and node(s) section as needed.

We recommend:

using zookeeper and set hive parameter zookeeper_hostnames (cesnet-zookeeper module can be used for installation of zookeeper)
if collocated with HDFS namenode, add dependency Class['hadoop::namenode::service'] -> Class['hive::metastore::service']
if not collocated, it is needed to have HDFS namenode running first, or restart Hive metastore later
using hadoop class plus some other component (or hadoop::common::config class) - see hdfs_hostname parameter

##Usage

It is highly recommended to use real database backends instead of Derby. Also security can be enabled.

Hive is used together with other components in roles in cesnet::site_hadoop puppet module.

Or you can see the examples here, how to use the hive puppet module directly:

Example 1: Setup with security:

Additional permissions in Hadoop cluster are needed: add hive proxy user.

class{"hadoop":
...
  properties => {
    'hadoop.proxyuser.hive.groups' => 'hive,impala,oozie,users',
    'hadoop.proxyuser.hive.hosts' => '*',
  },
...
}

class{"hive":
  group => 'users',
  metastore_hostname => $::fqdn,
  realm => 'MY.REALM',
}

Use nodes sections from the initial Example, modify $::fqdn and nodes sections as needed.

Example 2: MySQL database, puppetlabs-mysql puppet module must be installed.

Add this to the initial example:

class{"hive":
  ...
  db          => 'mysql',
  #db          => 'mariadb',
  db_password => 'hivepassword',
}

node default {
  ...

  class { 'mysql::server':
    root_password  => 'strongpassword',
  }

  class { 'mysql::bindings':
    java_enable       => true,
    #java_package_name => 'libmariadb-java',
  }
}

Database is created in hive::metastore::db (hive::metastore) class.

Example 3: PostgreSQL database, puppetlabs-postgresql puppet module must be installed.

Add this to the initial example:

class{"hive":
  ...
  db          => 'postgresql',
  db_password => 'hivepassword',
}

node default {
  ...

  class { 'postgresql::server':
    postgres_password => 'strongpassword',
  }
  include postgresql::lib::java
  ...
}

###Enable Security

Security in Hadoop (and Hive) is based on Kerberos. Keytab files needs to be prepared on the proper places before enabling the security.

Following parameters are used for security (see also hive class):

realm (Kerberos realm, empty string disables the security) Enables security and specifies Kerberos realm to use. Empty string disables the security. To enable security, there are required:
- installed Kerberos client (Debian: krb5-user/heimdal-clients; RedHat: krb5-workstation)
- configured Kerberos client (/etc/krb5.conf, /etc/krb5.keytab)
- /etc/security/keytab/hive.service.keytab (on all server nodes)
sentry_hostname Enable usage of Sentry authorization service. When not specified, Hive server2 impersonation is enabled and authorization works using HDFS permissions.

####Impersonation

Authorization by impersonation of the user. Used when sentry_hostname is not specified.

Hadoop needs to have enabled proxyuser for it:

# 'users' is the group in *group* parameter
hadoop.proxyuser.hive.groups => 'hive,users'
hadoop.proxyuser.hive.hosts  => '*'

Users need to have access to warehouse directory. Group is set to users by default. Other addons (like impala) need to be in the users group too!

Another way could be to add users to hive group and use that group instead (more simple, but less secure).

####Sentry

Authorization by sentry. Used when sentry_hostname is not specified.

Hive itself runs under 'hive' user. Hadoop and Hive must have enabed security.

Warehouse directory must have 'hive' group ownership. It is set by the puppet module by default.

###Multihome Support

Multihome is supported by Hive out-of-the-box.

<a name="defaultfs" ###Changing defaultFS (converting non-HA cluster, ...)

Changing defaultFS can be needed when, for example:

changing Hadoop cluster name
using cluster name because of converting non-HA cluster to High Availability

But existing objects in Hive schema are using the old URL with previous defaultFS and needs to be converted.

Getting the old URL:

hive --service metatool -listFSRoot 2>/dev/null

Convert (you can try testing run first using --dryRun):

OLD_URL="hdfs://NAMENODE_HOSTNAME:8020"
NEW_URL="hdfs://CLUSTER_NAME"
hive --service metatool -updateLocation ${NEW_URL} ${OLD_URL} --dryRun
hive --service metatool -updateLocation ${NEW_URL} ${OLD_URL}

###Cluster with more HDFS Name nodes

If there are used more HDFS namenodes in the Hadoop cluster (high availability, namespaces, ...), it is needed to have 'hive' system user on all of them to authorization work properly. You could install full Hive client (using hive::frontend::install), but just creating the user is enough (using hive::user).

Note, the hive::hdfs class must be used too, but only on one of the HDFS namenodes. It includes the hive::user.

Example:

node <HDFS_NAMENODE> {
  include hive::hdfs
}

node <HDFS_OTHER_NAMENODE> {
  include hive::user
}

###Upgrade

The best way is to refresh configurations from the new original (=remove the old) and relaunch puppet on top of it. There is also needed to update schema using schematool or upgrade scripts in /usr/lib/hive/scripts/metastore/upgrade/DATABASE/.

For example (using mysql, from Hive 0.13.0):

alternative='cluster'
d='hive'
mv /etc/{d}$/conf.${alternative} /etc/${d}/conf.cdhXXX
update-alternatives --auto ${d}-conf

# upgrade
...

# metadata schema upgrade
mysqldump --opt metastore > metastore-backup.sql
mysqldump --skip-add-drop-table --no-data metastore > my-schema-backup.mysql.sql
/usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 0.13.0 -userName root -passWord MYSQL_ROOT_PASSWORD

puppet agent --test
#or: puppet apply ...

##Reference

###Classes

hive: The main configuration class for Apache Hive
hive::hbase: Client Support for HBase
hive::hdfs: HDFS initialiations
hive::params
hive::service
common:
hive::common::config
hive::common::daemon
hive::common::postinstall
hive::frontend: Hive Client
hive::frontend::config
hive::frontend::install
hive::hcatalog: Hive HCatalog Client
hive::hcatalog::config
hive::hcatalog::install
hive::metastore: Hive Metastore
hive::metastore::config
hive::metastore::install
hive::metastore::db
hive::metastore::service
hive::server2: Hive Server
hive::server2::config
hive::server2::install
hive::server2::service
hive::user: Create hive system user, if needed
hive::worker: Hive support at the worker node

###Facts

hive_schemas: database schema file for each database backend

###hive class

####confdir

Hive config directory. Default: '/etc/hive/conf' or '/etc/hive'.

####group

Hive group on HDFS. Default: 'users' (without sentry), 'hive' (with sentry).

For Hive impersonation (without sentry) is expected all users belong to the specified group.

It is not updated when changed, you should remove the /var/lib/hadoop-hdfs/.puppet-hive-dir-created file when changing or update group of /user/hive on HDFS.

####hdfs_hostname

HDFS hostname (or defaultFS value), if different from core-site.xml Hadoop file. Default: undef.

It is recommended to have the core-site.xml file instead. core-site.xml will be created when installing any Hadoop component or if you include hadoop::common::config class.

####keytab

Hive keytab file. Default: '/etc/security/keytab/hive.service.keytab'.

Only used with security (realm parameter).

####keytab_source

Puppet source for keytab file. Default: undef.

When specified, the Hive keytab file is created using this puppet source(s). Otherwise only persmissions are set on the keytab file.

Only used with security (realm parameter).

####metastore_hostname

Hostname of the metastore server. Default: undef.

When specified, remote mode is activated (recommended).

####principal

Hive Kerberos principal. Default: '::default' (="hive/_HOST@${hive::realm}").

####sentry_hostname

Hostname of the (external) Sentry service. Default: undef.

Non-empty value will enable Hive settings needed to use Sentry authorization service.

When sentry is enabled, you will need also hive user added to allowed.system.users in Hadoop YARN containers.

####server2_hostname

Hostname of the Hive server. Default: undef.

Used only for hivemanager script.

####zookeeper_hostnames

Array of zookeeper hostnames quorum. Default: undef.

Used for lock management (recommended).

####zookeeper_port

Zookeeper port, if different from the default (2181). Default: undef.

####realm

Kerberos realm. Default: ''.

Empty string disables the security.

When security is enabled, you also need either Sentry service (sentry_hostname parameter) or proxyuser properties to Hadoop cluster for Hive impersonation. See Enable Security.

####properties

Additional properties. Default: undef.

####descriptions

Descriptions for the additional properties. Default: undef.

####alternatives

Switches the alternatives used for the configuration. Default: 'cluster' (Debian) or undef.

Use it only when supported (for example with Cloudera distribution).

####database_setup_enable

Enables database setup (if suported). Default: true.

####db

Database behind the metastore. Default: undef.

The default is embedded database (derby), but it is recommended to use proper database.

Values:

derby (default): embedded database
mysql: MySQL/MariaDB,
postgresql: PostgreSQL

####db_host

Database hostname for mysql, postgresql, and oracle. Default: 'localhost'.

It can be overridden by javax.jdo.option.ConnectionURL property.

####db_name

Database name for mysql and postgresql. Default: 'metastore'.

For oracle 'xe' schema is used. Can be overridden by javax.jdo.option.ConnectionURL property.

####db_user

Database user for mysql, postgresql, and oracle. Default: 'hive'.

####db_password

Database password for mysql, postgresql, and oracle. Default: undef.

####features

Enable additional features. Default: {}.

Values:

manager - script in /usr/local to start/stop Hive daemons relevant for given node

####schema_dir

Hive directory with database schemas. Default: undef (/usr/lib/hive/scripts/metastore/upgrade).

####schema_file

Hive database schema file. Default: undef (autodetect).

Autodetection requires puppet configured with stringify_facts=false. But the value can be set directly instead (for example hive-schema-2.1.1.mysql.sql).

##Limitations

Idea in this module is to do only one thing - setup Hive SW - and not limit generic usage of this module by doing other stuff. You can have your own repository with Hadoop SW, you can select which Kerberos implementation to use, or Java version.

On other hand this leads to some limitations as mentioned in Setup Requirements section and usage is more complicated - you may need site-specific puppet module together with this one, like cesnet-site_hadoop.

For database there are used puppetlabs-mysql and puppetlabs-postgresql modules, but they are not in dependencies. You can disable database setup altogether with database_setup_enable parameter.

##Development

Repository: https://github.com/MetaCenterCloudPuppet/cesnet-hive
Tests:
basic: see .travis.yml
vagrant: https://github.com/MetaCenterCloudPuppet/hadoop-tests

The MIT License (MIT)

Copyright (c) 2014-2020 CESNET

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Modules

Discover

Contribute

Puppet

Premium features

Downloads

Resources

About Forge

Getting Started

hive

Contributions Requested

Support the Puppet Community by contributing to this module

Version information

This version is compatible with:

Start using this module

Documentation

Apache Hive Puppet Module

Dependencies