Basic HBase Java Classes and Methods – Part 1: Getting Started

This series of articles is to familiarize you with basic HBase Java classes and methods.  The goal of these articles is not for HBase best practices.  In fact, we will be making many compromises as we deploy on what is likely your desktop environment.  The purpose is not in the setup/configuration of the Hadoop and HBase but rather how to code against it to explore basic HBase Classes and Methods in Java.

Ideally, in a production environment, you should be using a Hadoop distribution such as Cloudera or Hortonworks. Only a fool would not take advantage of this pre-packaged goodness.  To learn HBase Java methods however, our environment will work just the same as any Cloudera or Hortonworks environment would present you with as a developer.

Download Hadoop

Get the latest stable version of Hadoop which is compatible with the latest stable version of HBase.  At the time of this article it is 2.7.5.  You should read the HBase documentation which will tell you which versions of Hadoop are compatible with which versions of HBase. Trust me, it matters.  This is more of the reasons going with distributions are smart, they figure all this out for you.  The Hadoop ecosystem is vast, and spending time figuring out version compatibility matrices is not where you want to be spending your time.

Make a directory, put the binary tarball into it and untar it.

Download HBase

Get the latest stable bin version.  Make a directory, put the binary tarball into it and untar it.

Configure HBase hbase-site.xml

We need to set HBase to psuedo-distributed mode and tell it the location of HDFS.  Of course you could be running HBase in fully distributed mode, you could be using alternate filesystems as well.  But for the purpose of this exercise we are going to assume psuedo-distributed mode with HDFS.

Ensure you port number is the correct port number for HDFS, default is 9000.

 

Configure HBase hbase-env.sh We need to find the location of our Java home.  To do this you can use the command /usr/libexec/java_home

The Java version needs to be > 1.7 because that is what HBase requires.

Now that you have the location of the Java home, you can edit hbase-env.sh and add this to the file where you see it defined.  Here I just leave the default commented and add below it.

Add environment variables

We need to have a couple of environment variables.  We need to define the location of Hadoop and HBase and also add Hadoop and HBase bin to our PATH.  Your location may vary.  Typically you would want to add these to your .bashrc or .bash_profile.

Basic Hadoop Test

Lets make sure Hadoop has its Java set right and all is well

result should be

This is running Hadoop in standalone mode.  We will put Hadoop into pseudo-distributed mode, which means that although it’s a single machine configuration, it will use a separate Java process for each function of Hadoop.

If it’s working remote the output directory:

Configure Hadoop for Pseudo-Distributed Mode

We want to setup Hadoop so each daemon is running its own Java process.

We will tell it to run HDFS on port 9000.  Add the following to etc/hadoop/core-site.xml:

We will tell it to have only 1 replicate, which turns off replication. Add the following to etc/hadoop/hdfs-site.xml:

Make sure SSH is enabled on your host, and that it is working with passphraseless operation for localhost.  To setup passphraseless for localhost do the following:

Format the HDFS filesystem:

Start namenode daemon:

The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

You may see a warning

Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

This is because the default hadoop library libhadoop.so.1.0.0 is compiled on a 32-bit system and you are probably using a 64-bit system.  You don’t have to worry about this as it will not effect functionality.  However if you want you can download the hadoop source package and recompile it on your system instead of using the binary package we used as a quick start.

Browse the web interface for the NameNode:

  • NameNode – http://localhost:50070/

You should see something like this:

Make the HDFS directories required to execute MapReduce jobs:

Copy the input files into the distributed system:

Run the same example again as a test:

Copy output files from HDFS to the local filesystem and examine them:

Output should be similar to

Configure YARN in pseudo-distributed mode

This is not necessary for what we are doing, however, modern Hadoop uses YARN for MapReduce, and so we enable this because it’s straight forward and the right way to do things.

Edit etc/hadoop/mapred-site.xml:

Edit etc/hadoop/yarn-site.xml:

Start the ResourceManager and NodeMangager:

Browse the web interface for the ResourceManager:

  • ResourceManager – http://localhost:8088/

You should see something like this:

Run the same example again as a test, but this time observe the output you should see references to Yarn being used:

Now all the basic installation of Hadoop is complete.  If you wish to stop Hadoop at any time you simply run:

Check to make sure all Hadoop components are running

The parts which relate to HDFS are:

  • NameNode
  • SecondaryNameNode
  • DataNode

The parts which relate to YARN are:

  • ResourceManager
  • NodeManager

Start HBase

Check that you see the HBase processes running:

You will now see additional processes:

Zookeeper:

  • HQuorumPeer

HBase:

  • HMaster
  • HRegionServer

Browse the web interface for HBase:

  • HMaster – http://localhost:16010/

You should see something like this:

This completes the installation of Hadoop and HBase for use with our future exercises.

See you in the next part of the series Basic HBase Java Classes and Methods – Part 2: HBase Shell

Posted in Data Analytics | Tagged , | Leave a comment

Downgrading Apache Hadoop YARN to MapReduce v1

This post is somewhat dated material.  Several years back, when YARN was first making headways and vendors starting adopting it as part of Hadoop 2.x, there were many times where I needed to downgrade to MapReduce v1.  I had written a lot of stuff for MRV1, and there were times where downgrading was the best approach to getting things back up and running.  For those that may need to, here are my notes for downgrading from YARN to MRv1:

all should be good on localhost:50030 and 50070

Posted in Data Analytics | Tagged | Leave a comment

Scaling data for Deep Learning

When building deep learning models, it can be very beneficial to scale your data.  Oftentimes data can have a huge range of unbounded values.  The goal of scaling is to bound these values.  Typically the activation functions of a neuron are going to be tanh, sigmoid or ReLU.

 

 

In the case of sigmoid, recall that the output is in the interval of [0,1], and with tanh the output is in the interval [-1,1].  Rectified Linear Unit (ReLU) activations are unbounded.  Although ReLU is commonly used these days, we can scale the data anyways.

We can use Python’s sklearn MinMaxScaler to accomplish this.

By default, MinMaxScaler will scale to the interval [0,1]

This form of scaling is referred to as normalization, that is, the data is rescaled so that all values are in the interval [0,1].  The way this is done mathematically is:

MinMaxScaler takes care of this for us, but it is important to understand the math that is at work and the consequences of it.

One of the key rules in machine learning in general, is that we do not want the training data to be tainted by any of the test data.  There are many ways in which this can happen, so one must be careful of anything done to the data before it is split.  This includes operations such as scaling.  Because scaling is based on the min and max of the set, this min and max would be greatly influenced if the scaling were all done to the combined dataset.  Instead, it is important that we first do our train / test split, to establish two sets of data, build our scale based on the train, and then scale both sets of data to that scale.  Failure to handle the scaling properly will bias your results in a positive  way.

 

Posted in Data Analytics | Tagged | Leave a comment

Hacking NX-OS Part 3

Some notes from when I first started hacking away at NX-OS in 2011:

Basically the Nexus underlying operating system, made by MonteVista, which was formally called Hard Hat Linux (a hardened version of Red Hat Linux).  I can tell you that there are numerous ways to attack these boxes.  Some that I have found:

  1. PATH not properly set in shell scripts
  2. Input not properly sanity checked in scripts
  3. IFS together with PATH exploitable
  4. gdbserver running has root, can allow you to kill any process, including securityd
  5. The binaries for the most part are stripped.  So there is no symbol information, I plan to eventually re-construct the symbol table using some tools.  This combined with gdb would give you the ability to call any function you want as root.
  6. Many processes run as root (via /etc/sudoers), its very sloppy
  7. I have found at least 5 ways to get shell access.
  8. gdb could (Theoritically) be used to overflow the stack on a number of functions to run arbitrary shellcode.  I haven’t done this because its tedious but should work.  The security problem is that you can use gdb to remotely connect in the first place.
  9. At least one serious problem is the ability to crash a nexus remotely via CDP, I don’t believe this is fixed yet.

Productive evening.  I was able to get shell access on a 5k, 7k, 1000v, and MDS, that is from the CLI I was able to get to an actual bash shell.  Oddly using different exploits on MDS vs. 5k/7k/1000v.  As far as I know these are not known to Cisco.  Its not really a serious issue since you have to have access to the box anyways.  I only tried as admin, but its likely to work from any user level.

I did not post every method that I was able to obtain root with, nor did I post the straight forward malicious methods such as constructing a special CDP packet that will take  NX-OS down every time (at least it used to).  If you have any interesting things you have found in NX-OS please let me know!

gdb

The gdb is visable via the which command.  you can do “sh processes” and see all processes, then use “gdp <process id>” to run gdb as a server.

Then from your workstation  you can connect to gdb process using “gdb target remote x.x.x.x:yyyyy" where x.x.x.x is the ip address of the mds and yyyyyy is the port the gdb says its listening on (starts at 10001).  Then you can use gdb to do things like stack smashing and other hacks.  These are advanced topics beyond what I am willing to write here, but trival for those that know security and gdb.

I have found many security holes in the shell programs, and can pass things from CLI that crash the system.  Yes most of these work in older versions of SanOS as well as NX-OS.

 

Posted in Cisco, Network Technology, Nexus, NX-OS | Tagged , | Leave a comment

Hacking NX-OS Part 2

You can see in my previous article, that I used the “bash” command.  In later NX-OS versions this was not possible.  After rooting the box, I spent a lot of time learning about all of the shell scripts and binaries on the filesystem, and I continued to hack at them.

What became my “goto” command was “this“.  I think “this” was an undocumented command.  But once you hack into the filesystem you could see it was a command that was available.

The most common hack I would do was to do like so:

and then just use :shell from within vi……..this gives you a shell, you can look around and do whatever you like.

When doing shells from within NX-OS, you may not end up with an interactive shell, so you must redirect to your tty to see the output like so:

 

df > /dev/pts/0
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/pssblkdrv           59493       214     56207   1% /data_store
none                    409600    158696    250904  39% /isan
none                    102400      164    102236   1% /var/tmp
none                    153600        0    153600   0% /var/sysmgr
none                    307200    25748    281452   9% /var/sysmgr/ftp
none                    204800     3936    200864   2% /dev/shm
none                     61440        8     61432   1% /volatile
none                      2048         0      2048   0% /debug
/dev/hd-cfg0             19564      1145     17409   7% /mnt/cfg/0
/dev/hd-cfg1             19317      1145     17175   7% /mnt/cfg/1
/dev/hd-pss              19580      2826     15743  16% /mnt/pss
/dev/hd-bootflash       181724     94174     78168  55% /bootflash
127.1.2.2:/mnt/cf/partner
186683    13960    163085   8% /modflash_2-1

id > /dev/pts/0
uid=2002(admin) gid=503(network-admin) groups=503(network-admin)

uname -a > /dev/pts/0
Linux MDS4 2.4.20_mvl31-cpci735 #1 Wed Dec 16 15:50:36 PST 2009 i686 unknown

cat /etc/passwd > /dev/pts/0
root:*:0:0:root:/root:/isanboot/bin/nobash
bin:*:1:1:bin:/bin:
daemon:*:2:2:daemon:/usr/sbin:
sys:*:3:3:sys:/dev:
ftp:*:15:14:ftp:/var/ftp:/isanboot/bin/nobash
ftpuser:UvdRSOzORvz9o:99:14:ftpuser:/var/ftp:/isanboot/bin/nobash
nobody:*:65534:65534:nobody:/home:/bin/sh
admin:x:2002:503::/var/home/admin:/isan/bin/vsh_perm

Posted in Cisco, Network Technology, Nexus, NX-OS | Tagged , | Leave a comment