Integrating MySQL with Hadoop
- Anirudh Prabhu
- Aug 8, 2019
- 3 min read
Updated: Nov 22, 2020
Think of this scenario:
Imagine you have actual relational database (RDBMS), and you want to distribute this data to a cluster.
Why would you do this? Or rather why do you ever need this?
Because RDBMS (For ex. MySQL) is monolithic in nature. So, we can perform OLTP transactions but not OLAP. Hadoop (Hive) on other hand works for OLAP but not for OLTP.
The conventional way and the simplest and most straightforward way to do this is to dump the information from the database into a CSV file and then manually load that into Hadoop.
But, it can be done in much efficient way using a map reduce tool called Sqoop. Sqoop can help export data from MySQL to Hadoop so that you can run complex queries on it or vice versa.

Sqoop is a basically a map-reduce job in the back end.

The Sqoop import is just a single line of code on command line : The import tool will extract the individual from an RDBMS and copy it to HDFS. We can also add the number of mappers that we want to use in the below import syntax by using –m parameter at the end.

Here is how I learned to load data from Sqoop to the data warehouse in Hive on Hadoop.
Prerequisite: Hadoop Environment with Sqoop and Hive installed and working. To speed up the work, I am using Cloudera Quickstart VM (requires 4GB of RAM) – although you can also work with Hortonworks Data Platform (requires 8GB of RAM) – since my laptop has only 8GB of RAM – I prefer to work with Cloudera VM image
If you are working with Cloudera/HDP VM and its all fired up in Virtual box – it becomes easier to work with many of Hadoop ecosystem packages comes pre-installed (mysql, oozie, Hadoop, hive, zookeeper, storm, Kafka, spark etc…)
Code walkthrough:
1. Create table in MySQL

2. Let’s start Sqooping
As you can see customer table does not have any primary key. I have added few records in customer table. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range.
If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the --split-by argument. For example, --split-by id.
Since I want to import this table directly into hive I am adding –hive-import to my sqoop command.
Since I want to import this table directly into hive I am adding –hive-import to my sqoop command.

Here’s what individual sqoop command option means
connect – provide jdbc string
username – Database username
-P – Will ask the password in console – alternatively you can use –password but this is not a good practice as its visible in your job execution logs and asking for trouble. One way to deal with this is store db password in a file in HDFS and provide at runtime.
table – Tell which table you want to import from mysql – here’s its customer
split-by – specify what's your splitting column – I am specifying id here.
target-dir – HDFS destination dir
fields-terminated-by – I have specified comma (as by default it will import data into HDFS with comma separated values)
hive-import – Import table into hive (Uses Hive’s default delimiters if none are set.)
create-hive-table – If set job will fail if hive table already exist – it works in this case.
hive-table – specify <db_name>.<table_name> here its sqoop_workspace.customers where sqoop_workspace is my database and customers is the table name.
Sqoop is a map-reduce job. Notice that I am using -P for password option. While this works, but can be easily parameterized by using –password and reading it from file.

The above import command will create a table in HDFS by Sqoop using a map reduce job and will import it in Hive with schema ‘sqoop_workspace’ and Table name ‘customers’.
Now, let us verify our tables with Hive.

Next step is to query it in Hive:

Success!
We can also do incremental imports using Sqoop. We can halt our Hive table in sync with RDBMS by using a timestamp filter to do incremental import over time. This allows data to be actively and constantly exchanged between the different database environments over time, ensuring consistency of the shared information and keeping it up to date. Pretty cool stuff, eh!
Comments