Thursday, June 7, 2012

Ingest data from database into Hadoop with Sqoop (2)

Here, I explore few other variations for importing data from database into HDFS. This is a continuation of previous article..

Previous sqoop command listed were good for one time fetch when you want to import all the current data for a table in database.

A more practical workflow is to fetch data regularly and incrementally into HDFS for analysis. You do not want to skip any previously imported data. For this you have to mark a column for incremental import and also provide an initial value. This column mostly happens to be time-stamp.
sqoop import 
             --connect jdbc:oracle:thin:@//HOST:PORT/DB
             --username DBA_USER 
             -P
             --table TABLENAME
             --columns "column1,column2,column3,.."
             --as-textfile
             --target-dir /target/directory/in/hdfs
             -m 1
             --check-column COLUMN3
             --incremental  lastmodified
             --last-value "LAST VALUE"

Ingest data from database into Hadoop with Sqoop (1)

Sqoop is an easy tool to import data from databases to HDFS and export data from Hadoop/Hive tables to Databases. Databases has been de-facto standard for storing structured data. Running complex queries on large data in databases can be detrimental to their performance.
It is some times useful to import the data into Hadoop for ad hoc analysis. Tools like hive, raw map-reduce can provide tremendous flexibility in performing various kinds of analysis.
This becomes particularly useful when database has been used mostly as storage device (Ex: Storing XML or unstructured string data as clob data).

Sqoop is very simple on it's face. Internally, it uses map-reduce in parallel data import from Database and utilizes JDBC connection for the purpose.

I am jumping straight into using sqoop with oracle database and will leave installation for some other post.

Sqoop commands are executed from command lines using following structure:
sqoop COMMAND [ARGS]
All available sqoop commands can be listed with: sqoop help

Article focuses on importing from database specifically Oracle DB.