hive与tsdb整合的初步设想
hive
可以读取hbase
的数据,需要用到StorageHandlers
,暂时还没有时间去看这方面的资料,于是借着这段时间看hive
的成果,尝试先将tsdb
信息导出到文本中,然后将文本文件导入hive表中。
tsdb数据查询接口
使用tsdb query
接口:
# ./tsdb query
Not enough arguments.
Usage: query [Gnuplot opts] START-DATE [END-DATE] <query> [queries...]
A query has the form:
FUNC [rate] [counter,max,reset] [downsample FUNC N] SERIES [TAGS]
For example:
2010/03/11-20:57 sum my.awsum.metric host=blah sum some.other.metric host=blah state=foo
Dates must follow this format: YYYY/MM/DD-HH:MM[:SS] or Unix Epoch
or relative time such as 1y-ago, 2d-ago, etc.
Supported values for FUNC: [min, mimmin, max, mimmax, dev, sum, avg, zimsum]
Gnuplot options are of the form: +option=value
--config=PATH Path to a configuration file (default: Searches for file see docs).
--graph=BASEPATH Output data points to a set of files for gnuplot. The path of the output files will start with BASEPATH.
--table=TABLE Name of the HBase table where to store the time series (default: tsdb).
--uidtable=TABLE Name of the HBase table to use for Unique IDs (default: tsdb-uid).
--verbose Print more logging messages and not just errors.
--zkbasedir=PATH Path under which is the znode for the -ROOT- region (default: /hbase).
--zkquorum=SPEC Specification of the ZooKeeper quorum to use (default: localhost).
-v Short for --verbose.
输出到一个文本中,格式是这样的:
metric timestamp value {tag1=tag_value1, tag2=tag_value2, ...}
xxxx.yyyy 1441411568000 25866 {unit=u_entry, project=xxxx, role=yyyy}
文本格式转换
需要将分隔符由空格改为制表符\t
,注意到tags
中也是用空格分开的,不能用sed global
来替换:
sed -i -e 's/ /\t/1' -e 's/ /\t/1' -e 's/ /\t/1' /data/dump.txt
创建hive表
hive> create table tsdb (metric string, ts bigint, value float, tags string)
> row format delimited
> fields terminated by '\t';
OK
Time taken: 0.756 seconds
hive> show tables;
OK
tsdb
Time taken: 0.386 seconds, Fetched: 1 row(s)
hive> describe extended tsdb;
OK
metric string None
ts bigint None
value float None
tags string None
Detailed Table Information Table(tableName:tsdb, dbName:default, owner:root, createTime:1441624414, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:metric, type:string, comment:null), FieldSchema(name:timestamp, type:int, comment:null), FieldSchema(name:value, type:float, comment:null), FieldSchema(name:tags, type:string, comment:null)], location:hdfs://hd1.mc.com:8020/user/hive/warehouse/tsdb, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format= , field.delim=
Time taken: 0.104 seconds, Fetched: 6 row(s)
将文本数据导入hive表:
hive> load data local inpath "/data/dump.txt" into table tsdb;
Copying data from file:/data/dump.txt
Copying file: file:/data/dump.txt
Loading data to table default.tsdb
chgrp: changing ownership of '/user/hive/warehouse/tsdb/dump.txt': User does not belong to hive
Table default.tsdb stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 1288241, raw_data_size: 0]
OK
Time taken: 1.712 seconds
测试导入结果
# wc -l /data/dump.txt
13561 /data/dump.txt
# awk -F'\t' '{if($3>max)max=$3}END{print max}' /data/dump.txt
40329
看看hive的结果是否一致:
hive> select count(1) from tsdb;
...
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 5.54 sec HDFS Read: 1288470 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 540 msec
OK
13561
Time taken: 39.406 seconds, Fetched: 1 row(s)
hive> select max(value) from tsdb;
...
apReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.24 sec HDFS Read: 1288470 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 240 msec
OK
40329.0
Time taken: 38.202 seconds, Fetched: 1 row(s)
两者一致,hive导入数据没问题。
impala
用hive还是太慢了,用impala测试下:
[root@hd2 ~]# impala-shell
Starting Impala Shell without Kerberos authentication
Connected to hd2.mc.com:21000
Server version: impalad version 1.4.0-cdh5-INTERNAL RELEASE (build e801bd8c0d134e783c2313c7dd422a5ad06591af)
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
(Shell build version: Impala Shell v1.4.0-cdh5-INTERNAL (e801bd8) built on Sat Jul 12 06:45:04 PDT 2014)
[hd2.mc.com:21000] > invalidate metadata;
Query: invalidate metadata
Returned 0 row(s) in 1.03s
[hd2.mc.com:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| tsdb |
+------+
Returned 1 row(s) in 0.01s
[hd2.mc.com:21000] > select count(1) from tsdb;
Query: select count(1) from tsdb
+----------+
| count(1) |
+----------+
| 13561 |
+----------+
Returned 1 row(s) in 0.26s
[hd2.mc.com:21000] > select max(value) from tsdb;
Query: select max(value) from tsdb
+------------+
| max(value) |
+------------+
| 40329 |
+------------+
Returned 1 row(s) in 0.32s
果然是快太多了。