We live in interesting times. As I write, the political life of a great nation is currently in thrall to a wealthy right-wing populist with a rather distinctive hairstyle.
But enough about Boris Johnson.
For someone used to the way things are done in Oracle, Hadoop can be something of a culture shock.
My initial introduction to Hadoop and it’s “vibrant ecosystem” was some internally written documentation.
As with most technical documentation, it was written with the assumption of a certain level of prior knowledge on the part of the reader. For this particular reader, it proved to be an erroneous assumption.
After a half an hour of struggling through this text , I was left wondering what you would use a Khafka Banana Oozie for.
Maybe it’s for killing zombie processes ? Let’s face it, with all that Java running on your system there are bound to be a few knocking around.
I’m a little hazy on my Zombie Lore, so I’m not entirely clear as to why a rapid application of fresh fruit would pose an existential threat to your average zombie. Maybe they’re potassium intolerant ?
There are a bewildering array of tools associated with Hadoop, many of which appear to provide almost identical functionality.
For example, a relational database traditionally requires only one SQL Engine…and I’ll be taking a look at two of them.
Moving from Oracle to Hadoop can feel rather like trading your car in for a box of Lego.
If the box does contain any instructions they seem to have been translated from Java into English…by someone who doesn’t speak either.
Note to reader : please substitute your own language as appropriate.
Fortunately, there are Hadoop distributions available which bundle the core tools required to get up an running. We’ll gloss over the fact that these distributions – Cloudera and Hortonworks – don’t contain the exact same tools.
In my search to find a coherent explanation of how to use Hadoop, I’ve found that the VM provided by Cloudera, together with the introductory tutorial, is a valuable aid to familiarising myself with the basics.
To start with then, I’ll cover getting hold of the Cloudera VM and setting it up in VirtualBox.
Then I’ll go through some of the tools available and what they do.
I’ll do this from the perspective of an Oracle developer (mainly because I don’t have much option) and will point out the driftwood of familiarity that you might be able to cling to in order to stay afloat in your Data Lake.
What I’ll cover is :
- The core components of Hadoop
- HDFS commands
- Transferring data between a relational database and hadoop using SQOOP
- Querying structured data using Hive and Impala
- Uploading data using Kite
Ready to Dive in ?