使用sqoop导mysql表数据到hive
最近又开始学习hadoop,主要是有台40GB内存的vCenter放在那里太浪费了,搭建了一个CDH5集群,1个namenode,1个secondary namenode,3个datanode。
然后学了下怎么将mysql数据导入hive
,使用的工具是sqoop
。
安装mysql jdbc driver
本文测试的环境是CDH5,所以sqoop环境已经设置好了,但是要注意的一点是,由于版权的问题,MySQL jdbc driver需要自己下载,然后拷贝到/var/lib/sqoop
目录中:
cd /dist/dist
wget -c http://ftp.ntu.edu.tw/MySQL/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz
cd /dist/src
tar xvf /dist/dist/mysql-connector-java-5.1.36.tar.gz
cp -v mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar /var/lib/sqoop
MySQL端权限设置
sqoop通过jdbc连接MySQL,因此需要MySQL server开放对应的权限:
# mysql -uroot -p<root_pass>
mysql> create user 'hadoop'@'192.168.1.2' IDENTIFIED BY 'hadoop';
mysql> create user 'hadoop'@'192.168.1.3' IDENTIFIED BY 'hadoop';
mysql> create user 'hadoop'@'192.168.1.4' IDENTIFIED BY 'hadoop';
mysql> create user 'hadoop'@'192.168.1.5' IDENTIFIED BY 'hadoop';
mysql> create user 'hadoop'@'192.168.1.6' IDENTIFIED BY 'hadoop';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.2' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.3' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.4' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.5' WITH GRANT OPTION;
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hadoop'@'192.168.1.6' WITH GRANT OPTION;
mysql> flush privileges;
- 192.168.1.2 - 192.168.1.6 是hadoop集群的机器。
- 192.168.1.7 是 MySQL server。
将线上数据导入MySQL进行测试:
mysql -uroot -p<root_pass>
mysql> create database hd_test_db;
mysql> quit;
mysql -uroot -p<root_pass> hd_test_db < hd_test_db_struct.sql
cd /data/backup/hd_test_db
mysqlimport --local -uroot -p<root_pass> hd_test_db *.txt
用的是这个表:
mysql -uroot -p<root_pass>
mysql> use hd_test_db;
mysql> select count(*) from items;
+----------+
| count(*) |
+----------+
| 14822656 |
+----------+
1 row in set (0.00 sec)
hive
创建个hive数据库
# hive shell
hive> create database hd_test_db;
hive> quit;
sqoop
sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table items --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.items
选项的用法可用sqoop help import
查看。
导入完成后进入hive看下数据:
# sudo -u hdfs hive
hive> show databases;
OK
default
hd_test_db
Time taken: 0.674 seconds, Fetched: 2 row(s)
hive> use hd_test_db;
OK
Time taken: 0.959 seconds
hive> show tables;
OK
items
Time taken: 0.533 seconds, Fetched: 1 row(s)
hive> select count(*) from items;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_1440428885404_0008, Tracking URL = http://hd1.hadoop.com:8088/proxy/application_1440428885404_0008/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1440428885404_0008
Hadoop job information for Stage-1: number of mappers: 7; number of reducers: 1
2015-08-30 21:45:24,588 Stage-1 map = 0%, reduce = 0%
2015-08-30 21:45:37,436 Stage-1 map = 10%, reduce = 0%, Cumulative CPU 9.43 sec
2015-08-30 21:45:38,501 Stage-1 map = 10%, reduce = 0%, Cumulative CPU 9.43 sec
2015-08-30 21:45:39,577 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:40,634 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:41,694 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:42,763 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:43,829 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:44,903 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:45,978 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:47,047 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:48,110 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:49,183 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 12.02 sec
2015-08-30 21:45:50,259 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:51,319 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:52,388 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:53,462 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:54,538 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:55,650 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:56,718 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:57,801 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:58,861 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:45:59,928 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:46:00,994 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 21.97 sec
2015-08-30 21:46:02,077 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 26.61 sec
2015-08-30 21:46:03,132 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 33.64 sec
2015-08-30 21:46:04,196 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 33.64 sec
2015-08-30 21:46:05,260 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 33.64 sec
2015-08-30 21:46:06,322 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 33.64 sec
2015-08-30 21:46:07,382 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 33.64 sec
2015-08-30 21:46:08,471 Stage-1 map = 86%, reduce = 0%, Cumulative CPU 33.64 sec
2015-08-30 21:46:09,538 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 35.34 sec
2015-08-30 21:46:10,594 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 35.34 sec
2015-08-30 21:46:11,653 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 35.34 sec
2015-08-30 21:46:12,706 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 35.34 sec
2015-08-30 21:46:13,786 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 35.34 sec
2015-08-30 21:46:14,853 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 35.34 sec
2015-08-30 21:46:15,921 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 37.94 sec
2015-08-30 21:46:16,977 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 37.94 sec
MapReduce Total cumulative CPU time: 37 seconds 940 msec
Ended Job = job_1440428885404_0008
MapReduce Jobs Launched:
Job 0: Map: 7 Reduce: 1 Cumulative CPU: 37.94 sec HDFS Read: 1575346947 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 37 seconds 940 msec
OK
14822656
Time taken: 72.172 seconds, Fetched: 1 row(s)
联表查询
导入两个表进行联表查询:
# sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table users --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.users
# sudo -u hdfs sqoop import --connect jdbc:mysql://192.168.1.7/hd_test_db --username hadoop --password hadoop --table pays --target-dir /user/hive/warehouse/hd_test_db --fields-terminated-by '\t' --hive-import --hive-overwrite --create-hive-table --hive-table hd_test_db.pays
直接用root
用户执行,需要在hdfs
里创建一个root的用户目录:
# sudo -u hdfs hadoop fs -mkdir /user/root
# sudo -u hdfs hadoop fs -chown root:root /user/root
hive联表查询:
# hive
hive> use hd_test_db;
hive> describe users;
OK
pid bigint None
user_id bigint None
user_name string None
create_time int None
sex string None
auto int None
Time taken: 0.098 seconds, Fetched: 6 row(s)
hive> describe pays;
OK
user_id bigint None
user_pay int None
zero_dateline int None
Time taken: 0.079 seconds, Fetched: 4 row(s)
hive> select t2.user_pay t1.user_name from users t1 join pays t2 on t1.user_id = t2.user_id order by t2.pays desc limit 10;
花费时间users记录数是26464,pays记录数是54316:
Time taken: 47.608 seconds, Fetched: 10 row(s)
当然这完全体现不出hadoop的威力。找时间搞个大的表来测试下。