Basic HBase Java Classes and Methods - Part 1: Getting Started
Brian-Feenys-Mac-Pro:hbase bfeeny$ mkdir hadoop
Brian-Feenys-Mac-Pro:hbase bfeeny$ cd hadoop
Brian-Feenys-Mac-Pro:hadoop bfeeny$ mv ~/Downloads/hadoop-2.7.5.tar.gz .
Brian-Feenys-Mac-Pro:hadoop bfeeny$ tar -xzvf hadoop-2.7.5.tar.gz
Download HBase Get the latest stable bin version. Make a directory, put the binary tarball into it and untar it.
Brian-Feenys-Mac-Pro:Documents bfeeny$ mkdir hbase
Brian-Feenys-Mac-Pro:Documents bfeeny$ cd hbase
Brian-Feenys-Mac-Pro:hbase bfeeny$ mv ~/Downloads/hbase-1.2.6-bin.tar.gz .
Brian-Feenys-Mac-Pro:hbase bfeeny$ tar -xzvf hbase-1.2.6-bin.tar.gz
Configure HBase hbase-site.xml
We need to set HBase to psuedo-distributed mode and tell it the location of HDFS. Of course you could be running HBase in fully distributed mode, you could be using alternate filesystems as well. But for the purpose of this exercise we are going to assume psuedo-distributed mode with HDFS.
Ensure you port number is the correct port number for HDFS, default is 9000.
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>file:///Users/bfeeny/zookeeper</value>
</property> Configure HBase hbase-env.sh We need to find the location of our Java home. To do this you can use the command /usr/libexec/java_home
Brian-Feenys-Mac-Pro:Documents bfeeny$ /usr/libexec/java_home
/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
The Java version needs to be > 1.7 because that is what HBase requires.
Now that you have the location of the Java home, you can edit hbase-env.sh and add this to the file where you see it defined. Here I just leave the default commented and add below it.
# Set environment variables here.
# This script sets variables multiple times over the course of starting an hbase
# process,so try to keep things idempotent unless you want to take an even deeper
# look into the startup scripts (bin/hbase, etc.)
# The java implementation to use. Java 1.7+ required.
# export JAVA_HOME=/usr/java/jdk1.6.0/
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/HomeAdd environment variables
We need to have a couple of environment variables. We need to define the location of Hadoop and HBase and also add Hadoop and HBase bin to our PATH. Your location may vary. Typically you would want to add these to your .bashrc or .bash_profile.
# Hadoop
export HADOOP_HOME=$HOME/Documents/hadoop/hadoop-2.7.5
export PATH="$HADOOP_HOME/bin:$PATH"
# HBase
export HBASE_HOME=$HOME/Documents/hbase/hbase-1.2.6
export PATH="$HBASE_HOME/bin:$PATH"
Basic Hadoop Test
Lets make sure Hadoop has its Java set right and all is well
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar grep input output 'dfs+'
$ cat output/*
result should be
Brian-Feenys-Mac-Pro:hadoop-2.7.5 bfeeny$ cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
This is running Hadoop in standalone mode. We will put Hadoop into pseudo-distributed mode, which means that although it's a single machine configuration, it will use a separate Java process for each function of Hadoop.
If it's working remote the output directory:
Brian-Feenys-Mac-Pro:hadoop-2.7.5 bfeeny$ rm -rf outputConfigure Hadoop for Pseudo-Distributed Mode
We want to setup Hadoop so each daemon is running its own Java process.
We will tell it to run HDFS on port 9000. Add the following to etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
We will tell it to have only 1 replicate, which turns off replication. Add the following to etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Make sure SSH is enabled on your host, and that it is working with passphraseless operation for localhost. To setup passphraseless for localhost do the following:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
Format the HDFS filesystem:
$ bin/hdfs namenode -format
Start namenode daemon:
$ sbin/start-dfs.sh
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
You may see a warning
Unable to load native-hadoop library for your platform... using builtin-java classes where applicable This is because the default hadoop library libhadoop.so.1.0.0 is compiled on a 32-bit system and you are probably using a 64-bit system. You don't have to worry about this as it will not effect functionality. However if you want you can download the hadoop source package and recompile it on your system instead of using the binary package we used as a quick start. Browse the web interface for the NameNode:
NameNode - http://localhost:50070/ You should see something like this:
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
Copy the input files into the distributed system:
$ bin/hdfs dfs -put etc/hadoop input
Run the same example again as a test:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar grep input output 'dfs+'
Copy output files from HDFS to the local filesystem and examine them:
$ bin/hdfs dfs -get output output
$ cat output/*
Output should be similar to
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.fileConfigure YARN in pseudo-distributed mode
This is not necessary for what we are doing, however, modern Hadoop uses YARN for MapReduce, and so we enable this because it's straight forward and the right way to do things.
Edit etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Start the ResourceManager and NodeMangager:
$ sbin/start-yarn.sh
Browse the web interface for the ResourceManager:
ResourceManager - http://localhost:8088/ You should see something like this:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.5.jar grep input output 'dfs+'
Now all the basic installation of Hadoop is complete. If you wish to stop Hadoop at any time you simply run:
$ sbin/stop-dfs.sh
$ sbin/stop-yarn.sh
Check to make sure all Hadoop components are running
Brian-Feenys-Mac-Pro:Documents bfeeny$ jps
33861 ResourceManager
33943 NodeManager
33096 DataNode
33018 NameNode
33195 SecondaryNameNode
35549 Jps
The parts which relate to HDFS are:
NameNode
SecondaryNameNode
DataNode The parts which relate to YARN are:
ResourceManager
NodeManagerStart HBase
Brian-Feenys-Mac-Pro:hbase-1.2.6 bfeeny$ bin/start-hbase.sh
Check that you see the HBase processes running:
Brian-Feenys-Mac-Pro:hbase-1.2.6 bfeeny$ jps
37667 HQuorumPeer
33861 ResourceManager
33943 NodeManager
33096 DataNode
37897 Jps
33018 NameNode
37738 HMaster
37834 HRegionServer
33195 SecondaryNameNode
You will now see additional processes:
Zookeeper:
HQuorumPeer HBase:
HMaster
HRegionServer Browse the web interface for HBase:
HMaster - http://localhost:16010/ You should see something like this:
Recent Posts
See AllOne of the biggest bottlenecks in Deep Learning is loading data. having fast drives and access to the data is important, especially if...
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName;...
Comments