spark sql


spark sql

重要

shark

  • driver 用来解析器

task:

  • ddltask/sparkddltask ->metastore.api
  • functiontask curd udf/udtf/udaf
  • explaintask/sparkexpaintask (physical plan)
  • mapredtak/sparktask/mapredlocaltask
  • movetask
  • fetchtask

sql parser:

  • 语义解析(sqlstring 词法分析,ast语法分析)
  • 语义分析(ast-metastore出来的叫logical plan逻辑计划) shark-hive sqlparser

query optimizer :(优化)

  • 实际情况改写逻辑计划
  • 复用hive逻辑:mapjoin 避免shuffle 把join变mapjoin

physical plan tree:

  • execution operator1-nmapreduce
  • logical plan = DAG ,生成SPARK RDD

使用查看:

EXPLAIN [exended|dependency]

catalystvs shark

功能 catalyst shark
parser sql parse n->1catalyst logical plan tree shark则就一个hive sql parse
bind 语义分析 spark基于内存元数据管理和metastore shark metastore
optimizer 自己的rule hive的
physical 策略及rdd和mapreduce rdd
整体 scala:implicit conversion pattern matching partial function ->dslapi

实现

sql dsl api

  • sqlcontext,hivecontext
  • sql,tablename,json,parquet,rdd :schemaRdd
  • cacheTable,unCacheTaable,registerRDDAsTable(),isCached memory_only
  • logicalPlanToSparkQuery(plan:LogicalPlan):SchemaRDD
  • SchemaRDD:onetoone,row,sqlcontext,logical,select
  • Row类似resultset
  • 数据类型:
  • dsp:select where join orderby groupby as unionall
  • dslapi表达式 列名和普通变量名:'coll 隐式转换symbolToUnresolvedAttribute函数
  • parquet:列存储,thrift

看图:

大山 /
Published under (CC) BY-NC-SA in categories 逻辑与现象  spark  tagged with spark  sql  sparksql