Basic HBase Java Classes and Methods – Part 1: Getting Started

This series of articles is to familiarize you with basic HBase Java classes and methods.  The goal of these articles is not for HBase best practices.  In fact, we will be making many compromises as we deploy on what is likely your desktop environment.  The purpose is not in the setup/configuration of the Hadoop and HBase but rather how to code against it to explore basic HBase Classes and Methods in Java.

Ideally, in a production environment, you should be using a Hadoop distribution such as Cloudera or Hortonworks. Only a fool would not take advantage of this pre-packaged goodness.  To learn HBase Java methods however, our environment will work just the same as any Cloudera or Hortonworks environment would present you with as a developer.

Download Hadoop

Get the latest stable version of Hadoop which is compatible with the latest stable version of HBase.  At the time of this article it is 2.7.5.  You should read the HBase documentation which will tell you which versions of Hadoop are compatible with which versions of HBase. Trust me, it matters.  This is more of the reasons going with distributions are smart, they figure all this out for you.  The Hadoop ecosystem is vast, and spending time figuring out version compatibility matrices is not where you want to be spending your time.

Make a directory, put the binary tarball into it and untar it.

Download HBase

Get the latest stable bin version.  Make a directory, put the binary tarball into it and untar it.

Configure HBase hbase-site.xml

We need to set HBase to psuedo-distributed mode and tell it the location of HDFS.  Of course you could be running HBase in fully distributed mode, you could be using alternate filesystems as well.  But for the purpose of this exercise we are going to assume psuedo-distributed mode with HDFS.

Ensure you port number is the correct port number for HDFS, default is 9000.

 

Configure HBase hbase-env.sh We need to find the location of our Java home.  To do this you can use the command /usr/libexec/java_home

The Java version needs to be > 1.7 because that is what HBase requires.

Now that you have the location of the Java home, you can edit hbase-env.sh and add this to the file where you see it defined.  Here I just leave the default commented and add below it.

Add environment variables

We need to have a couple of environment variables.  We need to define the location of Hadoop and HBase and also add Hadoop and HBase bin to our PATH.  Your location may vary.  Typically you would want to add these to your .bashrc or .bash_profile.

Basic Hadoop Test

Lets make sure Hadoop has its Java set right and all is well

result should be

This is running Hadoop in standalone mode.  We will put Hadoop into pseudo-distributed mode, which means that although it’s a single machine configuration, it will use a separate Java process for each function of Hadoop.

If it’s working remote the output directory:

Configure Hadoop for Pseudo-Distributed Mode

We want to setup Hadoop so each daemon is running its own Java process.

We will tell it to run HDFS on port 9000.  Add the following to etc/hadoop/core-site.xml:

We will tell it to have only 1 replicate, which turns off replication. Add the following to etc/hadoop/hdfs-site.xml:

Make sure SSH is enabled on your host, and that it is working with passphraseless operation for localhost.  To setup passphraseless for localhost do the following:

Format the HDFS filesystem:

Start namenode daemon:

The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

You may see a warning

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

This is because the default hadoop library libhadoop.so.1.0.0 is compiled on a 32-bit system and you are probably using a 64-bit system.  You don’t have to worry about this as it will not effect functionality.  However if you want you can download the hadoop source package and recompile it on your system instead of using the binary package we used as a quick start.

Browse the web interface for the NameNode:

  • NameNode – http://localhost:50070/

You should see something like this:

Make the HDFS directories required to execute MapReduce jobs:

Copy the input files into the distributed system:

Run the same example again as a test:

Copy output files from HDFS to the local filesystem and examine them:

Output should be similar to

Configure YARN in pseudo-distributed mode

This is not necessary for what we are doing, however, modern Hadoop uses YARN for MapReduce, and so we enable this because it’s straight forward and the right way to do things.

Edit etc/hadoop/mapred-site.xml:

Edit etc/hadoop/yarn-site.xml:

Start the ResourceManager and NodeMangager:

Browse the web interface for the ResourceManager:

  • ResourceManager – http://localhost:8088/

You should see something like this:

Run the same example again as a test, but this time observe the output you should see references to Yarn being used:

Now all the basic installation of Hadoop is complete.  If you wish to stop Hadoop at any time you simply run:

Check to make sure all Hadoop components are running

The parts which relate to HDFS are:

  • NameNode
  • SecondaryNameNode
  • DataNode

The parts which relate to YARN are:

  • ResourceManager
  • NodeManager

Start HBase

Check that you see the HBase processes running:

You will now see additional processes:

Zookeeper:

  • HQuorumPeer

HBase:

  • HMaster
  • HRegionServer

Browse the web interface for HBase:

  • HMaster – http://localhost:16010/

You should see something like this:

This completes the installation of Hadoop and HBase for use with our future exercises.

See you in the next part of the series Basic HBase Java Classes and Methods – Part 2: HBase Shell

This entry was posted in Data Analytics and tagged , . Bookmark the permalink.

Leave a Reply