Basic HBase Java Classes and Methods – Part 4: Putting Data into a Table

To put data into a table in HBase, we will create a Class very similar in structure to our last Class in Part 3.  Instead of using an Admin object, which is used to create a table or delete it, we will just work with a regular table object.  All data in HBase is stored as byte arrays.  Lets create our imports and basic variables to store our column family names and columns.

Now we create our main method, create a connection to our Table, instantiate a Put object and add columns to it using our addColumn method.   Finally we use the put method on the Table object to put the data into the table.  We have the table defined outside of the try block because we need to check for it in the finally block later on, and we can’t do that if its defined in the try block itself, as then it would be out of scope.

This is a very simple case.  We did not have to insert all of the columns, we could have left many blank just as we did before when using the HBase shell.  The HBase table put method is overloaded and supports either passing in a Put object or a list of Put objects.  We will now put more data in using a list of Put objects.

Last we will use our finally block to close our connection to HBase and check if we have an open table and if so close it.

So the completed program looks like so:

We can see that our data was added properly by checking with scan from the HBase shell.

Next we will explore how we can retrieve column data from the HBase table in Basic HBase Java Classes and Methods – Part 5: Getting Data from a Table.

Posted in Data Analytics | Tagged , | Leave a comment

Basic HBase Java Classes and Methods – Part 3: Table Creation

We will cover these basic steps:

Instantiating a configuration object
Establishing a connection to HBase
Manipulating tables using an administration object
Manipulating data within a table using a table instance

Creating our table

I am using Maven, and below is my pom.xml that I will be using for all of these examples.

You may also want to optionally create a log4j.properties file if using Maven.  Maven uses this file for logging.  We will create a basic properties file and store it in our main/resources folder.  For more information on log4j visit the project site.  Here is the basic properties file we will use.

AdminCreateTable basic libraries and main method.

Instantiating a Configuration object 

We will instantiate a Configuration object using the HBaseConfiguration static method.  The configuration Class is the base class for all config objects in the Hadoop ecosystem.

Establish a connection to the HBase cluster

We use the ConnectionFactory to create a connection object by passing in our Configuration object.

All code which uses our connection should be in try / finally blocks so that you can manage the connection manually and close the connection when you are done using it.

Instantiate an Admin object

Because functions such as Creating or Removing tables are administrative functions, these must be done using the Admin object.  We will create an Admin object from our Connection object.

You can see we put the instantiation of the Admin object inside of our try/finally clause.  The rest of our commands will also be inside this clause.  We added a command to close our connection in the finally clause.

Create the table schema using an HTableDescriptor

We use an HTableDescriptor to define our table, its properties such as column families, performance settings, etc.  We will create a table of name employee and two column families: personal and professional.

Create the table

We check to see if the table exists using our Admin object.  If it does not exist, we create it, and it does exist we print so.  We use the createTable method on our Admin object and pass in the HTableDescriptor we created previously.

Putting it all together we have the following:

Build and Run the code. You can verify the table has been created by going to the HBase Shell and verifying it exists.

See you in the next part Basic HBase Java Classes and Methods – Part 4: Putting Data into a Table

Posted in Data Analytics | Tagged , | Leave a comment

Basic HBase Java Classes and Methods – Part 2: HBase Shell

For the purpose of these exercises we will be working with a basic table which as two column families.  The first column family is “personal” and will contain first_name, last_name, age, gender, martial_status.  The second column family is “professional” and will contain “occupation” and “education”.

We will walk through all of the steps from creating the table, column families, populating data, changing data, deleting data and dropping the table.  We will first show you this in the HBase shell so you can be familiar with the data we are working with.  In Part 3 we will start to do these tasks programmatically using Java.

We are making a very brisk journey through the shell, with little to no explanation of the various parts of HBase, we assume you have learned the basics from reading the documentation.  Our goal is to just so some basic common HBase table operations using the shell, and then replicate it using Java.

Our employee table will look like so:

personal professional
ID first_name last_name age gender martial_status occupation education

First we fire up the HBase shell

We can request the basic status from HBase

We ask it who we are, similar to the whoami UNIX command, and get a list of any tables

We see that there are no tables.  We create the employee table with two column families, personal and professional.

We can use the describe command to give us more copious information about the table.  Many of these parameters have to do with the underlaying Hadoop layer and are not important for our exercises.

We can see we have no actual records in the table

Let’s insert a single record with an ID (row key) of 1.

You can see we inserted a single record with a column first_name inside of the column family personal.  There are no constraints in our table, so we can leave entire columns out or create new ones on the fly.

We will add a bunch more data

Notice I inserted the above record using a row key that was not 2, instead I skipped 2.  HBase doesn’t care what you make the row key, it can be a string, number, even an array.

Now that we have inserted information regarding three employees, lets take a look at our table.

We can easily make changes to any information:

We can delete just a single cell if we wish

We can use the exists command to see if a table exists

We have to disable a table before we can drop it. Disabling a table flushes all the data in memory to disk.

We can see there are no more tables

We will be repeating these commands in the next Part using Java in Basic HBase Java Classes and Methods – Part 3: Table Creation

Posted in Data Analytics | Tagged , | Leave a comment

Basic HBase Java Classes and Methods – Part 1: Getting Started

This series of articles is to familiarize you with basic HBase Java classes and methods.  The goal of these articles is not for HBase best practices.  In fact, we will be making many compromises as we deploy on what is likely your desktop environment.  The purpose is not in the setup/configuration of the Hadoop and HBase but rather how to code against it to explore basic HBase Classes and Methods in Java.

Ideally, in a production environment, you should be using a Hadoop distribution such as Cloudera or Hortonworks. Only a fool would not take advantage of this pre-packaged goodness.  To learn HBase Java methods however, our environment will work just the same as any Cloudera or Hortonworks environment would present you with as a developer.

Download Hadoop

Get the latest stable version of Hadoop which is compatible with the latest stable version of HBase.  At the time of this article it is 2.7.5.  You should read the HBase documentation which will tell you which versions of Hadoop are compatible with which versions of HBase. Trust me, it matters.  This is more of the reasons going with distributions are smart, they figure all this out for you.  The Hadoop ecosystem is vast, and spending time figuring out version compatibility matrices is not where you want to be spending your time.

Make a directory, put the binary tarball into it and untar it.

Download HBase

Get the latest stable bin version.  Make a directory, put the binary tarball into it and untar it.

Configure HBase hbase-site.xml

We need to set HBase to psuedo-distributed mode and tell it the location of HDFS.  Of course you could be running HBase in fully distributed mode, you could be using alternate filesystems as well.  But for the purpose of this exercise we are going to assume psuedo-distributed mode with HDFS.

Ensure you port number is the correct port number for HDFS, default is 9000.

 

Configure HBase hbase-env.sh We need to find the location of our Java home.  To do this you can use the command /usr/libexec/java_home

The Java version needs to be > 1.7 because that is what HBase requires.

Now that you have the location of the Java home, you can edit hbase-env.sh and add this to the file where you see it defined.  Here I just leave the default commented and add below it.

Add environment variables

We need to have a couple of environment variables.  We need to define the location of Hadoop and HBase and also add Hadoop and HBase bin to our PATH.  Your location may vary.  Typically you would want to add these to your .bashrc or .bash_profile.

Basic Hadoop Test

Lets make sure Hadoop has its Java set right and all is well

result should be

This is running Hadoop in standalone mode.  We will put Hadoop into pseudo-distributed mode, which means that although it’s a single machine configuration, it will use a separate Java process for each function of Hadoop.

If it’s working remote the output directory:

Configure Hadoop for Pseudo-Distributed Mode

We want to setup Hadoop so each daemon is running its own Java process.

We will tell it to run HDFS on port 9000.  Add the following to etc/hadoop/core-site.xml:

We will tell it to have only 1 replicate, which turns off replication. Add the following to etc/hadoop/hdfs-site.xml:

Make sure SSH is enabled on your host, and that it is working with passphraseless operation for localhost.  To setup passphraseless for localhost do the following:

Format the HDFS filesystem:

Start namenode daemon:

The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

You may see a warning

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

This is because the default hadoop library libhadoop.so.1.0.0 is compiled on a 32-bit system and you are probably using a 64-bit system.  You don’t have to worry about this as it will not effect functionality.  However if you want you can download the hadoop source package and recompile it on your system instead of using the binary package we used as a quick start.

Browse the web interface for the NameNode:

  • NameNode – http://localhost:50070/

You should see something like this:

Make the HDFS directories required to execute MapReduce jobs:

Copy the input files into the distributed system:

Run the same example again as a test:

Copy output files from HDFS to the local filesystem and examine them:

Output should be similar to

Configure YARN in pseudo-distributed mode

This is not necessary for what we are doing, however, modern Hadoop uses YARN for MapReduce, and so we enable this because it’s straight forward and the right way to do things.

Edit etc/hadoop/mapred-site.xml:

Edit etc/hadoop/yarn-site.xml:

Start the ResourceManager and NodeMangager:

Browse the web interface for the ResourceManager:

  • ResourceManager – http://localhost:8088/

You should see something like this:

Run the same example again as a test, but this time observe the output you should see references to Yarn being used:

Now all the basic installation of Hadoop is complete.  If you wish to stop Hadoop at any time you simply run:

Check to make sure all Hadoop components are running

The parts which relate to HDFS are:

  • NameNode
  • SecondaryNameNode
  • DataNode

The parts which relate to YARN are:

  • ResourceManager
  • NodeManager

Start HBase

Check that you see the HBase processes running:

You will now see additional processes:

Zookeeper:

  • HQuorumPeer

HBase:

  • HMaster
  • HRegionServer

Browse the web interface for HBase:

  • HMaster – http://localhost:16010/

You should see something like this:

This completes the installation of Hadoop and HBase for use with our future exercises.

See you in the next part of the series Basic HBase Java Classes and Methods – Part 2: HBase Shell

Posted in Data Analytics | Tagged , | Leave a comment

Downgrading Apache Hadoop YARN to MapReduce v1

This post is somewhat dated material.  Several years back, when YARN was first making headways and vendors starting adopting it as part of Hadoop 2.x, there were many times where I needed to downgrade to MapReduce v1.  I had written a lot of stuff for MRV1, and there were times where downgrading was the best approach to getting things back up and running.  For those that may need to, here are my notes for downgrading from YARN to MRv1:

all should be good on localhost:50030 and 50070

Posted in Data Analytics | Tagged | Leave a comment