Getting Started with Sqoop — Part I

Most of the organizations store their operational data in relational databases. So, there was a need for a tool which can import and export data from relational databases. This is why Apache Sqoop was born. Sqoop can easily integrate with Hadoop and dump structured data from relational databases on HDFS.

So what is sqoop?

Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer bulk data between HDFS (Hadoop storage) to relational database servers and vice versa.

This is how Sqoop got its name — “SQL to Hadoop & Hadoop to SQL”.

Features of Sqoop

Sqoop provides many salient features like:

  1. Full Load: Apache Sqoop can load the whole table by a single command. You can also load all the tables from a database using a single command.
  2. Incremental Load: Apache Sqoop also provides the facility of incremental load where you can load parts of table whenever it is updated.
  3. Parallel import/export: Sqoop uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
  4. Import results of SQL query: You can also import the result returned from an SQL query in HDFS.
  5. Compression: You can compress your data by using deflate(gzip) algorithm with –compress argument, or by specifying –compression-codec argument. You can also load compressed table in Apache Hive.
  6. Connectors for all major RDBMS Databases: Apache Sqoop provides connectors for multiple RDBMS databases, covering almost the entire circumference.
  7. Kerberos Security Integration: Kerberos is a computer network authentication protocol which works on the basis of ‘tickets’ to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. Sqoop supports Kerberos authentication.
  8. Load data directly into HIVE/HBase: You can load data directly into Apache Hive for analysis and also dump your data in HBase, which is a NoSQL database.

Sqoop Commands

Before we execute the sqoop commands let us load the data into MySQL (RDBMS)

Download data set

[$npntraining] mysql -u root -p
npntraining
mysql> create database sqoop_demo
mysql> use sqoop_demo;
mysql> LOAD DATA LOCAL INFILE '/home/npntraining/Desktop/PracticeFiles/data-set/tripadvisor_review_01.csv'
INTO TABLE traveller_reviews
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
ERROR 1148 (42000): The used command is not allowed with this MySQL versionmysql> SHOW VARIABLES LIKE 'local_infile';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| local_infile | OFF |
+---------------+-------+
mysql> SET GLOBAL local_infile = 1;

Sqoop Commands

[npntraining@centos8 ~]$ sqoop help
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information

Sqoop list-databases Command

[npntraining@centos8 ~]$ sqoop list-databases \
--connect jdbc:mysql://localhost:3306/ \
--username root \
--password npntraining

20/08/07 23:31:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.

[npntraining@centos8 ~]$ sqoop list-databases \
--connect jdbc:mysql://localhost:3306/ \
--username root \
-P

Sqoop list-tables command

[npntraining@centos8 ~]$ sqoop list-tables \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
-P

Sqoop import command

Import command is used to importing a table from relational databases to HDFS. In our case, we are going to import tables from MySQL databases to HDFS.

[npntraining@centos8 ~]$ sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password npntraining \
--table traveller_reviews \
--target-dir /sqoop_directory \
--delete-target-dir \
-m 1

Instead of hardcoding the password we can store the password in hdfs

[npntraining@centos8 ~]$ echo -n "npntraining" > .password
[npntraining@centos8 ~]$ hdfs dfs -put .password /
[npntraining@centos8 ~]$ hdfs dfs -chmod 400 /.password
[npntraining@centos8 ~]$ sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password-file /.password \
--table traveller_reviews \
--target-dir /sqoop_directory \
-m 1

We can also import specific columns

[npntraining@centos8 ~]$ sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password-file /.password \
--table traveller_reviews \
--columns "user_id,art_galleries,dance_clubs" \
--target-dir /sqoop_directory \
--delete-target-dir \
-m 1
[npntraining@centos8 ~]$ hdfs dfs -cat /sqoop_directory/part-m-00000

Importing with conditions

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password npntraining \
--table traveller_reviews \
--columns "user_id,art_galleries,dance_clubs" \
--where "user_id < 100" \
-m 1 \
--target-dir /sqoop_directory \
--delete-target-dir

we can also apply conditions as shown below

sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password npntraining \
--query 'select * from traveller_reviews where user_id >5 and $CONDITIONS' \
-m 1 \
--target-dir /sqoop_directory \
--delete-target-dir

Importing with specific delimiter

[npntraining@centos8 ~]$ sqoop import \
--options-file ~/Desktop/sqoop-options.txt \
-m 1 \
--target-dir /sqoop_directory \
--delete-target-dir \
--fields-terminated-by "|"
[npntraining@centos8 ~]$ hdfs dfs -cat /sqoop_directory/part-m-00000

If you see the above codes there is lot of mandatory code which can be stored in separate file.

vi sqoop-options.txt

--connect
jdbc:mysql://localhost:3306/sqoop_demo
--username
root
--password
npntraining
--query
'select * from traveller_reviews where user_id <5 and $CONDITIONS'

We can invoke the sqoop-options.txt with the command as shown below

[npntraining@centos8 ~]$ sqoop import \
--options-file ~/Desktop/sqoop-options.txt \
-m 1 \
--target-dir /sqoop_directory \
--delete-target-dir

Looking for Big Data Training!

Join our Big Data Hadoop and Spark Developer Training which is designed for professionals who are seeking to deepen their knowledge in the field of Big Data.

Data Engineering Training

--

--

Naveen - (Founder & Trainer @ NPN Training)

A software training institute which believes that technology has to be learnt under experienced practitioners — www.npntraining.com