This series of articles is to familiarize you with basic HBase Java classes and methods. The goal of these articles is not for HBase best practices. In fact, we will be making many compromises as we deploy on what is likely your desktop environment. The purpose is not in the setup/configuration of the Hadoop and HBase but rather how to code against it to explore basic HBase Classes and Methods in Java.
Ideally, in a production environment, you should be using a Hadoop distribution such as Cloudera or Hortonworks. Only a fool would not take advantage of this pre-packaged goodness. To learn HBase Java methods however, our environment will work just the same as any Cloudera or Hortonworks environment would present you with as a developer.
Download Hadoop
Get the latest stable version of Hadoop which is compatible with the latest stable version of HBase. At the time of this article it is 2.7.5. You should read the HBase documentation which will tell you which versions of Hadoop are compatible with which versions of HBase. Trust me, it matters. This is more of the reasons going with distributions are smart, they figure all this out for you. The Hadoop ecosystem is vast, and spending time figuring out version compatibility matrices is not where you want to be spending your time.
Make a directory, put the binary tarball into it and untar it.
1 2 3 4 |
Brian-Feenys-Mac-Pro:hbase bfeeny$ mkdir hadoop Brian-Feenys-Mac-Pro:hbase bfeeny$ cd hadoop Brian-Feenys-Mac-Pro:hadoop bfeeny$ mv ~/Downloads/hadoop-2.7.5.tar.gz . Brian-Feenys-Mac-Pro:hadoop bfeeny$ tar -xzvf hadoop-2.7.5.tar.gz |
Download HBase
Get the latest stable bin version. Make a directory, put the binary tarball into it and untar it.
1 2 3 4 |
Brian-Feenys-Mac-Pro:Documents bfeeny$ mkdir hbase Brian-Feenys-Mac-Pro:Documents bfeeny$ cd hbase Brian-Feenys-Mac-Pro:hbase bfeeny$ mv ~/Downloads/hbase-1.2.6-bin.tar.gz . Brian-Feenys-Mac-Pro:hbase bfeeny$ tar -xzvf hbase-1.2.6-bin.tar.gz |
Configure HBase hbase-site.xml
We need to set HBase to psuedo-distributed mode and tell it the location of HDFS. Of course you could be running HBase in fully distributed mode, you could be using alternate filesystems as well. But for the purpose of this exercise we are going to assume psuedo-distributed mode with HDFS.
Ensure you port number is the correct port number for HDFS, default is 9000
.
1 2 3 4 5 6 7 8 9 10 11 12 |
<property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.rootdir</name> <value>hdfs://localhost:9000/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>file:///Users/bfeeny/zookeeper</value> </property> |
Configure HBase hbase-env.sh
We need to find the location of our Java home. To do this you can use the command /usr/libexec/java_home
1 2 |
Brian-Feenys-Mac-Pro:Documents bfeeny$ /usr/libexec/java_home /Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home |
The Java version needs to be > 1.7 because that is what HBase requires.
Now that you have the location of the Java home, you can edit hbase-env.sh
and add this to the file where you see it defined. Here I just leave the default commented and add below it.
1 2 3 4 5 6 7 8 9 |
# Set environment variables here. # This script sets variables multiple times over the course of starting an hbase # process,so try to keep things idempotent unless you want to take an even deeper # look into the startup scripts (bin/hbase, etc.) # The java implementation to use. Java 1.7+ required. # export JAVA_HOME=/usr/java/jdk1.6.0/ export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home |
Add environment variables
We need to have a couple of environment variables. We need to define the location of Hadoop and HBase and also add Hadoop and HBase bin
to our PATH
. Your location may vary. Typically you would want to add these to your .bashrc
or .bash_profile
.
1 2 3 4 5 6 7 |
# Hadoop export HADOOP_HOME=$HOME/Documents/hadoop/hadoop-2.7.5 export PATH="$HADOOP_HOME/bin:$PATH" # HBase export HBASE_HOME=$HOME/Documents/hbase/hbase-1.2.6 export PATH="$HBASE_HOME/bin:$PATH" |
Basic Hadoop Test
Lets make sure Hadoop has its Java set right and all is well
1 2 3 4 |
$ mkdir input $ cp etc/hadoop/*.xml input $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar grep input output 'dfs[a-z.]+' $ cat output/* |
result should be
1 2 3 4 5 6 7 8 9 10 11 12 |
Brian-Feenys-Mac-Pro:hadoop-2.7.5 bfeeny$ cat output/* 6 dfs.audit.logger 4 dfs.class 3 dfs.server.namenode. 2 dfs.period 2 dfs.audit.log.maxfilesize 2 dfs.audit.log.maxbackupindex 1 dfsmetrics.log 1 dfsadmin 1 dfs.servers 1 dfs.replication 1 dfs.file |
This is running Hadoop in standalone mode. We will put Hadoop into pseudo-distributed mode, which means that although it’s a single machine configuration, it will use a separate Java process for each function of Hadoop.
If it’s working remote the output directory:
1 |
Brian-Feenys-Mac-Pro:hadoop-2.7.5 bfeeny$ rm -rf output |
Configure Hadoop for Pseudo-Distributed Mode
We want to setup Hadoop so each daemon is running its own Java process.
We will tell it to run HDFS on port 9000. Add the following to etc/hadoop/core-site.xml:
1 2 3 4 5 6 |
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> |
We will tell it to have only 1 replicate, which turns off replication. Add the following to etc/hadoop/hdfs-site.xml:
1 2 3 4 5 6 |
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> |
Make sure SSH is enabled on your host, and that it is working with passphraseless
operation for localhost
. To setup passphraseless
for localhost
do the following:
1 2 3 |
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys |
Format the HDFS filesystem:
1 |
$ bin/hdfs namenode -format |
Start namenode daemon:
1 |
$ sbin/start-dfs.sh |
The hadoop daemon log output is written to the $HADOOP_LOG_DIR
directory (defaults to $HADOOP_HOME/logs
).
You may see a warning
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
This is because the default hadoop library libhadoop.so.1.0.0
is compiled on a 32-bit system and you are probably using a 64-bit system. You don’t have to worry about this as it will not effect functionality. However if you want you can download the hadoop source package and recompile it on your system instead of using the binary package we used as a quick start.
Browse the web interface for the NameNode:
- NameNode –
http://localhost:50070/
You should see something like this:
Make the HDFS directories required to execute MapReduce jobs:
1 2 |
$ bin/hdfs dfs -mkdir /user $ bin/hdfs dfs -mkdir /user/<username> |
Copy the input files into the distributed system:
1 |
$ bin/hdfs dfs -put etc/hadoop input |
Run the same example again as a test:
1 |
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar grep input output 'dfs[a-z.]+' |
Copy output files from HDFS to the local filesystem and examine them:
1 2 |
$ bin/hdfs dfs -get output output $ cat output/* |
Output should be similar to
1 2 3 4 5 6 7 8 9 10 11 |
6 dfs.audit.logger 4 dfs.class 3 dfs.server.namenode. 2 dfs.period 2 dfs.audit.log.maxfilesize 2 dfs.audit.log.maxbackupindex 1 dfsmetrics.log 1 dfsadmin 1 dfs.servers 1 dfs.replication 1 dfs.file |
Configure YARN in pseudo-distributed mode
This is not necessary for what we are doing, however, modern Hadoop uses YARN for MapReduce, and so we enable this because it’s straight forward and the right way to do things.
Edit etc/hadoop/mapred-site.xml
:
1 2 3 4 5 6 |
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> |
Edit etc/hadoop/yarn-site.xml
:
1 2 3 4 5 6 |
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> |
Start the ResourceManager and NodeMangager:
1 |
$ sbin/start-yarn.sh |
Browse the web interface for the ResourceManager:
- ResourceManager – http://localhost:8088/
You should see something like this:
Run the same example again as a test, but this time observe the output you should see references to Yarn being used:
1 |
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar grep input output 'dfs[a-z.]+' |
Now all the basic installation of Hadoop is complete. If you wish to stop Hadoop at any time you simply run:
1 2 |
$ sbin/stop-dfs.sh $ sbin/stop-yarn.sh |
Check to make sure all Hadoop components are running
1 2 3 4 5 6 7 |
Brian-Feenys-Mac-Pro:Documents bfeeny$ jps 33861 ResourceManager 33943 NodeManager 33096 DataNode 33018 NameNode 33195 SecondaryNameNode 35549 Jps |
The parts which relate to HDFS are:
- NameNode
- SecondaryNameNode
- DataNode
The parts which relate to YARN are:
- ResourceManager
- NodeManager
Start HBase
1 |
Brian-Feenys-Mac-Pro:hbase-1.2.6 bfeeny$ bin/start-hbase.sh |
Check that you see the HBase processes running:
1 2 3 4 5 6 7 8 9 10 |
Brian-Feenys-Mac-Pro:hbase-1.2.6 bfeeny$ jps 37667 HQuorumPeer 33861 ResourceManager 33943 NodeManager 33096 DataNode 37897 Jps 33018 NameNode 37738 HMaster 37834 HRegionServer 33195 SecondaryNameNode |
You will now see additional processes:
Zookeeper:
- HQuorumPeer
HBase:
- HMaster
- HRegionServer
Browse the web interface for HBase:
- HMaster – http://localhost:16010/
You should see something like this:
This completes the installation of Hadoop and HBase for use with our future exercises.
See you in the next part of the series Basic HBase Java Classes and Methods – Part 2: HBase Shell