昨天捣鼓了一天这个东西,随便写点笔记。
- arvo:除了著名的hdfs文件,hadoop上常用的另一种序列化存储的文件格式就是arvo。简单的讲,这货就是由一个定义好的schema来读取的二进制文本文件。
- arvo schema:很像json...比如这里这个:
{ "type" : "record", "name" : "Tweet", "namespace" : "com.miguno.avro", "fields" : [ { "name" : "username", "type" : "string", "doc" : "Name of the user account on Twitter.com" }, { "name" : "tweet", "type" : "string", "doc" : "The content of the user's Twitter message" }, { "name" : "timestamp", "type" : "long", "doc" : "Unix epoch time in seconds" } ], "doc:" : "A basic schema for storing Twitter messages" }
- 定义好schema之后可以用java去build...
- arvo to HIVE:可以直接建HIVE external table. (还是上面那个link)
CREATE EXTERNAL TABLE tweets COMMENT "A table backed by Avro data with the Avro schema stored in HDFS" ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/YOURUSER/examples/input/' TBLPROPERTIES ( 'avro.schema.url'='hdfs:///user/YOURUSER/examples/schema/twitter.avsc' );
然后就是正常的玩法了。