缘起

kudu是Cloudera贡献给Apache的项目，号称是新一代的Hadoop存储，目前还在Beta阶段，但已经有一些企业在用了，比如小米，就曾经给kudu站过台，小米和Cloudera有一些合作的。

最早知道这个东西，是在某次面试，面一个搞hadoop的，当时因为在用hive，而hive一般改数据是很不方便的，只能按partition去load，我就问那哥们有啥好的解决方案，当时他就提到了kudu。

那会kudu才0.5版本，感觉还很初级阶段，所以一直懒得研究了。最近发现kudu已经发布了1.1.0版本，感觉已经有点稳定了，所以打算拿kudu来搞事情。

Apache Kudu

kudu是构建在hadoop生态系统中的一个列存储引擎，官方号称的特性：

Fast processing of OLAP workloads.
Integration with MapReduce, Spark and other Hadoop ecosystem components.
Tight integration with Cloudera Impala, making it a good, mutable alternative to using HDFS with Parquet.
Strong but flexible consistency model, allowing you to choose consistency requirements on a per-request basis, including the option for strict-serializable consistency.
Strong performance for running sequential and random workloads simultaneously.
Easy to administer and manage with Cloudera Manager.
High availability. Tablet Servers and Masters use the Raft Consensus Algorithm, which ensures that as long as more than half the total number of replicas is available, the tablet is available for reads and writes. For instance, if 2 out of 3 replicas or 3 out of 5 replicas are available, the tablet is available.
Reads can be serviced by read-only follower tablets, even in the event of a leader tablet failure.
Structured data model.

官方也给了一些应用场景：

Reporting applications where newly-arrived data needs to be immediately available for end users
Time-series applications that must simultaneously support:
- queries across large amounts of historic data
- granular queries about an individual entity that must return very quickly
Applications that use predictive models to make real-time decisions with periodic refreshes of the predictive model based on all historic data

总的来说，kudu+impala的组合，号称可以做到传统OLAP/OLTP的结合，提供海量数据分析的同时，也可以对数据进行增删改。之前曾经用过hbase上的SQL引擎phoenix，速度还是很慢的，很难做一些用户体验敏感的统计分析应用。所以这次把玩kudu+impala的目的，也是看看kudu在提供海量数据存储的同时，能否兼顾查询速度。

安装

kudu的安装分为两部分，一个是kudu本身的安装，另外一个是安装kudu版本的impala，公司这边用的CDH5.8，装起来比较简单，需要注意的是CDH原本带的impala是不支持kudu的，需要重新安装impala_kudu才行，而且我是删了以后重新安装了一遍。

kudu

kudu的安装比较简单，添个parcels就行了，也不需要什么特殊设置，唯一注意的是fs_wal_dir和fs_data_dirs两个参数，前者可以和后者值相同，但不能是后者的子目录。另外需要注意的是，如果在一台服务器上，需要同时运行kudu master和kudu tablet，这两个值需要设置为不同的值，否则就会导致服务无法启动。

impala_kudu

安装impala_kudu时遇到问题较多，因为kudu版本迭代比较快，导致文档更新不及时，虽然启动起来没问题，但是建表出现了一些问题。之前我们的impala用了sentry管理权限，但是设置起来很麻烦，所以这次就去掉了。

需要配置的一共有两点：

Impala Service Environment Advanced Configuration Snippet (Safety Valve)需要加上IMPALA_KUDU=1，这个用中文UI还搜不到，换了英文UI才搜到
Impala Daemon Command Line Argument Advanced Configuration Snippet (Safety Valve)需要配置master的地址，例如-kudu_master_hosts=10.0.1.1:7051，不知道为毛Cloudera Manager为毛不会自己配置

另外启动impala-shell时，也需要启动impala_kudu版本的，可以通过alternatives修改：

alternatives --display impala-shell
alternatives --set impala-shell  /opt/cloudera/parcels/IMPALA_KUDU-2.7.0-3.cdh5.9.0.p0.10/bin/impala-shell

验证

搞定之后，启动impala-shell，执行select if(version() like '%KUDU%', "all set to go!", "check your configs") as s;，如果看到

Query: select if(version() like '%KUDU%', "all set to go!", "check your
configs") as s

+----------------+
| s              |
+----------------+
| all set to go! |
+----------------+
Fetched 1 row(s) in 0.02s
----

就说明可以用了。

建表

kudu的文档里的建表语句已经不能在最新版本上用了，IMPALA-2848这个issue已经“简化”了建表语句，同时原来的已经不能用了……这里给个新的例子吧：

CREATE TABLE realtime.binlog (
    pk string PRIMARY key,
    exec_time STRING,
    db_name STRING,
    table_name STRING,
    event_type STRING,
    entry_type STRING,
    is_ddl boolean,
    before STRING,
    after STRING,
    ts bigint,
    salt bigint)
DISTRIBUTE BY HASH INTO 64 BUCKETS
STORED AS KUDU;

STORED AS KUDU取代了原先的table properties的方式。