spark sql
重要
shark
- driver 用来解析器
task:
- ddltask/sparkddltask ->metastore.api
- functiontask curd udf/udtf/udaf
- explaintask/sparkexpaintask (physical plan)
- mapredtak/sparktask/mapredlocaltask
- movetask
- fetchtask
sql parser:
- 语义解析(sqlstring 词法分析,ast语法分析)
- 语义分析(ast-metastore出来的叫logical plan逻辑计划) shark-hive sqlparser
query optimizer :(优化)
- 实际情况改写逻辑计划
- 复用hive逻辑:mapjoin 避免shuffle 把join变mapjoin
physical plan tree:
- execution operator1-nmapreduce
- logical plan = DAG ,生成SPARK RDD
使用查看:
EXPLAIN [exended|dependency]
catalystvs shark
功能 | catalyst | shark |
---|---|---|
parser | sql parse n->1catalyst logical plan tree | shark则就一个hive sql parse |
bind | 语义分析 spark基于内存元数据管理和metastore | shark metastore |
optimizer | 自己的rule | hive的 |
physical | 策略及rdd和mapreduce | rdd |
整体 | scala:implicit conversion pattern matching partial function ->dslapi |
实现
sql dsl api
- sqlcontext,hivecontext
- sql,tablename,json,parquet,rdd :schemaRdd
- cacheTable,unCacheTaable,registerRDDAsTable(),isCached memory_only
- logicalPlanToSparkQuery(plan:LogicalPlan):SchemaRDD
- SchemaRDD:onetoone,row,sqlcontext,logical,select
- Row类似resultset
- 数据类型:
- dsp:select where join orderby groupby as unionall
- dslapi表达式 列名和普通变量名:'coll 隐式转换symbolToUnresolvedAttribute函数
- parquet:列存储,thrift
看图: