使用sqoop导mysql表数据到hive

最近又开始学习hadoop，主要是有台40GB内存的vCenter放在那里太浪费了，搭建了一个CDH5集群，1个namenode，1个secondary namenode，3个datanode。

然后学了下怎么将mysql数据导入hive，使用的工具是sqoop。

安装`mysql jdbc driver`

本文测试的环境是CDH5，所以sqoop环境已经设置好了，但是要注意的一点是，由于版权的问题，MySQL jdbc driver需要自己下载，然后拷贝到/var/lib/sqoop目录中：

cd /dist/dist  
wget -c http://ftp.ntu.edu.tw/MySQL/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz  
cd /dist/src  
tar xvf /dist/dist/mysql-connector-java-5.1.36.tar.gz  
cp -v mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar /var/lib/sqoop

MySQL端权限设置

sqoop通过jdbc连接MySQL，因此需要MySQL server开放对应的权限：

# mysql -uroot -p<root_pass>
mysql> create user 'hadoop'@'192.168.1.2' IDENTIFIED BY 'hadoop';  
mysql> create user 'hadoop'@'192.168.1.3' IDENTIFIED BY 'hadoop';  
mysql> create user 'hadoop'@'192.168.1.4' IDENTIFIED BY 'hadoop';  
mysql> create user 'hadoop'@'192.168.1.5' IDENTIFIED BY 'hadoop';  
mysql> create user 'hadoop'@'192.168.1.6' IDENTIFIED BY 'hadoop';  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.2' WITH GRANT OPTION;  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.3' WITH GRANT OPTION;  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.4' WITH GRANT OPTION;  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.5' WITH GRANT OPTION;  
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.6' WITH GRANT OPTION;  
mysql> flush privileges;

192.168.1.2 - 192.168.1.6 是hadoop集群的机器。
192.168.1.7 是 MySQL server。

将线上数据导入MySQL进行测试：

mysql -uroot -p<root_pass>  
mysql> create database hd_test_db;  
mysql> quit;  
mysql -uroot -p<root_pass> hd_test_db < hd_test_db_struct.sql  
cd /data/backup/hd_test_db  
mysqlimport --local -uroot -p<root_pass> hd_test_db *.txt

用的是这个表：

mysql -uroot -p<root_pass>  
mysql> use hd_test_db;  
mysql> select count(*) from items;  
+----------+
| count(*) |
+----------+
| 14822656 |
+----------+
1 row in set (0.00 sec)

hive

创建个hive数据库

# hive shell
hive> create database hd_test_db;  
hive> quit;

sqoop

sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table items --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.items

选项的用法可用sqoop help import查看。

导入完成后进入hive看下数据：

# sudo -u hdfs hive
hive> show databases;  
OK  
default  
hd_test_db  
Time taken: 0.674 seconds, Fetched: 2 row(s)  
hive> use hd_test_db;  
OK  
Time taken: 0.959 seconds  
hive> show tables;  
OK  
items  
Time taken: 0.533 seconds, Fetched: 1 row(s)  
hive> select count(*) from items;  
Total MapReduce jobs = 1  
Launching Job 1 out of 1  
Number of reduce tasks determined at compile time: 1  
In order to change the average load for a reducer (in bytes):  
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:  
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:  
  set mapred.reduce.tasks=<number>
Starting Job = job_1440428885404_0008, Tracking URL = http://hd1.hadoop.com:8088/proxy/application_1440428885404_0008/  
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1440428885404_0008  
Hadoop job information for Stage-1: number of mappers: 7; number of reducers: 1  
2015-08-30 21:45:24,588 Stage-1 map = 0%,  reduce = 0%  
2015-08-30 21:45:37,436 Stage-1 map = 10%,  reduce = 0%, Cumulative CPU 9.43 sec  
2015-08-30 21:45:38,501 Stage-1 map = 10%,  reduce = 0%, Cumulative CPU 9.43 sec  
2015-08-30 21:45:39,577 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:40,634 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:41,694 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:42,763 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:43,829 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:44,903 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:45,978 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:47,047 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:48,110 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:49,183 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 12.02 sec  
2015-08-30 21:45:50,259 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:51,319 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:52,388 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:53,462 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:54,538 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:55,650 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:56,718 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:57,801 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:58,861 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:45:59,928 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:46:00,994 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 21.97 sec  
2015-08-30 21:46:02,077 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 26.61 sec  
2015-08-30 21:46:03,132 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:04,196 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:05,260 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:06,322 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:07,382 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:08,471 Stage-1 map = 86%,  reduce = 0%, Cumulative CPU 33.64 sec  
2015-08-30 21:46:09,538 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:10,594 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:11,653 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:12,706 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:13,786 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:14,853 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 35.34 sec  
2015-08-30 21:46:15,921 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 37.94 sec  
2015-08-30 21:46:16,977 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 37.94 sec  
MapReduce Total cumulative CPU time: 37 seconds 940 msec  
Ended Job = job_1440428885404_0008  
MapReduce Jobs Launched:  
Job 0: Map: 7  Reduce: 1   Cumulative CPU: 37.94 sec   HDFS Read: 1575346947 HDFS Write: 9 SUCCESS  
Total MapReduce CPU Time Spent: 37 seconds 940 msec  
OK  
14822656  
Time taken: 72.172 seconds, Fetched: 1 row(s)

联表查询

导入两个表进行联表查询：

# sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table users --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.users
# sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table pays --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.pays

直接用root用户执行，需要在hdfs里创建一个root的用户目录：

# sudo -u hdfs hadoop fs -mkdir /user/root
# sudo -u hdfs hadoop fs -chown root:root /user/root

hive联表查询：

# hive
hive> use hd_test_db;  
hive> describe users;  
OK  
pid                     bigint                  None  
user_id                 bigint                  None  
user_name               string                  None  
create_time             int                     None  
sex                     string                  None  
auto                    int                     None  
Time taken: 0.098 seconds, Fetched: 6 row(s)  
hive> describe pays;  
OK  
user_id                 bigint                  None  
user_pay                int                     None  
zero_dateline           int                     None  
Time taken: 0.079 seconds, Fetched: 4 row(s)  
hive> select t2.user_pay t1.user_name from users t1 join pays t2 on t1.user_id = t2.user_id order by t2.pays desc limit 10;

花费时间users记录数是26464，pays记录数是54316：

Time taken: 47.608 seconds, Fetched: 10 row(s)

当然这完全体现不出hadoop的威力。找时间搞个大的表来测试下。

安装mysql jdbc driver

MySQL端权限设置

hive

sqoop

联表查询

安装`mysql jdbc driver`