Getting Started with Sqoop — Part I
Most of the organizations store their operational data in relational databases. So, there was a need for a tool which can import and export data from relational databases. This is why Apache Sqoop was born. Sqoop can easily integrate with Hadoop and dump structured data from relational databases on HDFS.
So what is sqoop?
Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer bulk data between HDFS (Hadoop storage) to relational database servers and vice versa.
This is how Sqoop got its name — “SQL to Hadoop & Hadoop to SQL”.
Features of Sqoop
Sqoop provides many salient features like:
- Full Load: Apache Sqoop can load the whole table by a single command. You can also load all the tables from a database using a single command.
- Incremental Load: Apache Sqoop also provides the facility of incremental load where you can load parts of table whenever it is updated.
- Parallel import/export: Sqoop uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
- Import results of SQL query: You can also import the result returned from an SQL query in HDFS.
- Compression: You can compress your data by using deflate(gzip) algorithm with –compress argument, or by specifying –compression-codec argument. You can also load compressed table in Apache Hive.
- Connectors for all major RDBMS Databases: Apache Sqoop provides connectors for multiple RDBMS databases, covering almost the entire circumference.
- Kerberos Security Integration: Kerberos is a computer network authentication protocol which works on the basis of ‘tickets’ to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner. Sqoop supports Kerberos authentication.
- Load data directly into HIVE/HBase: You can load data directly into Apache Hive for analysis and also dump your data in HBase, which is a NoSQL database.
Sqoop Commands
Before we execute the sqoop commands let us load the data into MySQL (RDBMS)
[$npntraining] mysql -u root -p
npntrainingmysql> create database sqoop_demo
mysql> use sqoop_demo;
mysql> LOAD DATA LOCAL INFILE '/home/npntraining/Desktop/PracticeFiles/data-set/tripadvisor_review_01.csv'
INTO TABLE traveller_reviews
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;ERROR 1148 (42000): The used command is not allowed with this MySQL versionmysql> SHOW VARIABLES LIKE 'local_infile';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| local_infile | OFF |
+---------------+-------+mysql> SET GLOBAL local_infile = 1;
Sqoop Commands
[npntraining@centos8 ~]$ sqoop help
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
Sqoop list-databases Command
[npntraining@centos8 ~]$ sqoop list-databases \
--connect jdbc:mysql://localhost:3306/ \
--username root \
--password npntraining
20/08/07 23:31:26 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
[npntraining@centos8 ~]$ sqoop list-databases \
--connect jdbc:mysql://localhost:3306/ \
--username root \
-P
Sqoop list-tables command
[npntraining@centos8 ~]$ sqoop list-tables \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
-P
Sqoop import command
Import command is used to importing a table from relational databases to HDFS. In our case, we are going to import tables from MySQL databases to HDFS.
[npntraining@centos8 ~]$ sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password npntraining \
--table traveller_reviews \
--target-dir /sqoop_directory \
--delete-target-dir \
-m 1
Instead of hardcoding the password we can store the password in hdfs
[npntraining@centos8 ~]$ echo -n "npntraining" > .password
[npntraining@centos8 ~]$ hdfs dfs -put .password /
[npntraining@centos8 ~]$ hdfs dfs -chmod 400 /.password[npntraining@centos8 ~]$ sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password-file /.password \
--table traveller_reviews \
--target-dir /sqoop_directory \
-m 1
We can also import specific columns
[npntraining@centos8 ~]$ sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password-file /.password \
--table traveller_reviews \
--columns "user_id,art_galleries,dance_clubs" \
--target-dir /sqoop_directory \
--delete-target-dir \
-m 1[npntraining@centos8 ~]$ hdfs dfs -cat /sqoop_directory/part-m-00000
Importing with conditions
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password npntraining \
--table traveller_reviews \
--columns "user_id,art_galleries,dance_clubs" \
--where "user_id < 100" \
-m 1 \
--target-dir /sqoop_directory \
--delete-target-dir
we can also apply conditions as shown below
sqoop import \
--connect jdbc:mysql://localhost:3306/sqoop_demo \
--username root \
--password npntraining \
--query 'select * from traveller_reviews where user_id >5 and $CONDITIONS' \
-m 1 \
--target-dir /sqoop_directory \
--delete-target-dir
Importing with specific delimiter
[npntraining@centos8 ~]$ sqoop import \
--options-file ~/Desktop/sqoop-options.txt \
-m 1 \
--target-dir /sqoop_directory \
--delete-target-dir \
--fields-terminated-by "|"[npntraining@centos8 ~]$ hdfs dfs -cat /sqoop_directory/part-m-00000
If you see the above codes there is lot of mandatory code which can be stored in separate file.
vi sqoop-options.txt
--connect
jdbc:mysql://localhost:3306/sqoop_demo--username
root
--password
npntraining
--query
'select * from traveller_reviews where user_id <5 and $CONDITIONS'
We can invoke the sqoop-options.txt with the command as shown below
[npntraining@centos8 ~]$ sqoop import \
--options-file ~/Desktop/sqoop-options.txt \
-m 1 \
--target-dir /sqoop_directory \
--delete-target-dir
Looking for Big Data Training!
Join our Big Data Hadoop and Spark Developer Training which is designed for professionals who are seeking to deepen their knowledge in the field of Big Data.