网站你啦怎样做旺仔饼干,网站建设有几种方法,网站购物车功能怎么做,网站建设需要匹配人员测试环境#xff1a;hive on spark spark版本#xff1a;3.3.1 一、执行时机二、对结果集的影响三、效率对比1.内连接1#xff09;on2#xff09;where 2.外连接1#xff09;on2#xff09;where 四、总结PS 一、执行时机
sql连接中#xff0c;where属于过滤条件#… 测试环境hive on spark spark版本3.3.1 一、执行时机二、对结果集的影响三、效率对比1.内连接1on2where 2.外连接1on2where 四、总结PS 一、执行时机
sql连接中where属于过滤条件用于对join的结果集进行过滤所以理论上的执行时机在join之后。on属于关联条件决定了满足什么样条件的数据才可以被关联到一起因此理论上的执行时机在join时。
但是大多数数据库系统为了提升效率都采用了一些优化技术思想都是将where中的筛选条件或是on中的关联条件尽可能的提前到数据源侧进行筛选目的是减少参与关联的数据量。因此它们实际的执行时机大多时候和理论上的不同。
二、对结果集的影响
内连接中条件放在where或者on中对结果集无影响。
外连接中以左外连接为例因为左外连接是完全保留左表记录on在join时生效因此最终的结果集也会保留左表的全部记录。where是对join后的结果集进行操作所以会过滤掉一些数据导致二者的结果集不相同。
三、效率对比
测试数据量如下 poi_data.poi_res表数据量8300W bi_report.mon_ronghe_pv表分区表总数据量120E本次采用分区20240522的数据关联数据量5900W其中 bid like ‘1%’ pv100 的数据量120W 两表的关联字段均无重复值。 1.内连接
1on
selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1
join (select bid, pv from bi_report.mon_ronghe_pv where event_day20240522) t2
on t1.bidt2.bid
and t2.bid like 1% and t2.pv100;Physical Plan
AdaptiveSparkPlan (28)
- Final Plan CollectLimit (17)- * Project (16)- * SortMergeJoin Inner (15):- * Sort (6): - AQEShuffleRead (5): - ShuffleQueryStage (4), Statistics(sizeInBytes5.3 GiB, rowCount4.57E7): - Exchange (3): - * Filter (2): - Scan hive poi_data.poi_res (1)- * Sort (14)- AQEShuffleRead (13)- ShuffleQueryStage (12), Statistics(sizeInBytes58.5 MiB, rowCount1.28E6)- Exchange (11)- * Project (10)- * Filter (9)- * ColumnarToRow (8)- Scan parquet bi_report.mon_ronghe_pv (7)
- Initial Plan CollectLimit (27)- Project (26)- SortMergeJoin Inner (25):- Sort (20): - Exchange (19): - Filter (18): - Scan hive poi_data.poi_res (1)- Sort (24)- Exchange (23)- Project (22)- Filter (21)- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297, name#299, point_x#316, point_y#317], HiveTableRelation [poi_data.poi_res, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#297, type#298, name#299, address#300, phone#301, alias#302, post_code#303, catalog_id#304, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))(3) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id774](4) ShuffleQueryStage
Output [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#334, pv#335, event_day#338]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day20240522]
PartitionFilters: [isnotnull(event_day#338), (event_day#338 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: structbid:string,pv:int(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338](9) Filter [codegen id : 2]
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 100))(10) Project [codegen id : 2]
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338](11) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id799](12) ShuffleQueryStage
Output [2]: [bid#334, pv#335]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#334, pv#335]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335](17) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000(18) Filter
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))(19) Exchange
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: hashpartitioning(bid#297, 600), ENSURE_REQUIREMENTS, [plan_id759](20) Sort
Input [4]: [bid#297, name#299, point_x#316, point_y#317]
Arguments: [bid#297 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#334, pv#335, event_day#338]
Condition : (((isnotnull(bid#334) AND isnotnull(pv#335)) AND StartsWith(bid#334, 1)) AND (pv#335 100))(22) Project
Output [2]: [bid#334, pv#335]
Input [3]: [bid#334, pv#335, event_day#338](23) Exchange
Input [2]: [bid#334, pv#335]
Arguments: hashpartitioning(bid#334, 600), ENSURE_REQUIREMENTS, [plan_id760](24) Sort
Input [2]: [bid#334, pv#335]
Arguments: [bid#334 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#297]
Right keys [1]: [bid#334]
Join condition: None(26) Project
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Input [6]: [bid#297, name#299, point_x#316, point_y#317, bid#334, pv#335](27) CollectLimit
Input [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#297, name#299, point_x#316, point_y#317, pv#335]
Arguments: isFinalPlantrue从物理执行计划可以看到第2步中的Filter使用条件Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))在t1表读取源数据时进行了过滤在第7步中通过谓词下推在t2表scan源数据时使用条件PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]进行了过滤两表都是在数据源侧进行的数据过滤减少了shuffle和参与join的数据量。
2where
selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1
join (select bid, pv from bi_report.mon_ronghe_pv where event_day20240522) t2
on t1.bidt2.bid
where t2.bid like 1% and t2.pv100;Physical Plan
AdaptiveSparkPlan (28)
- Final Plan CollectLimit (17)- * Project (16)- * SortMergeJoin Inner (15):- * Sort (6): - AQEShuffleRead (5): - ShuffleQueryStage (4), Statistics(sizeInBytes5.3 GiB, rowCount4.57E7): - Exchange (3): - * Filter (2): - Scan hive poi_data.poi_res (1)- * Sort (14)- AQEShuffleRead (13)- ShuffleQueryStage (12), Statistics(sizeInBytes58.5 MiB, rowCount1.28E6)- Exchange (11)- * Project (10)- * Filter (9)- * ColumnarToRow (8)- Scan parquet bi_report.mon_ronghe_pv (7)
- Initial Plan CollectLimit (27)- Project (26)- SortMergeJoin Inner (25):- Sort (20): - Exchange (19): - Filter (18): - Scan hive poi_data.poi_res (1)- Sort (24)- Exchange (23)- Project (22)- Filter (21)- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350, name#352, point_x#369, point_y#370], HiveTableRelation [poi_data.poi_res, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#350, type#351, name#352, address#353, phone#354, alias#355, post_code#356, catalog_id#357, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))(3) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id908](4) ShuffleQueryStage
Output [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#387, pv#388, event_day#391]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day20240522]
PartitionFilters: [isnotnull(event_day#391), (event_day#391 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: structbid:string,pv:int(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391](9) Filter [codegen id : 2]
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 100))(10) Project [codegen id : 2]
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391](11) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id933](12) ShuffleQueryStage
Output [2]: [bid#387, pv#388]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#387, pv#388]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388](17) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000(18) Filter
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Condition : (StartsWith(bid#350, 1) AND isnotnull(bid#350))(19) Exchange
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: hashpartitioning(bid#350, 600), ENSURE_REQUIREMENTS, [plan_id893](20) Sort
Input [4]: [bid#350, name#352, point_x#369, point_y#370]
Arguments: [bid#350 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#387, pv#388, event_day#391]
Condition : (((isnotnull(bid#387) AND isnotnull(pv#388)) AND StartsWith(bid#387, 1)) AND (pv#388 100))(22) Project
Output [2]: [bid#387, pv#388]
Input [3]: [bid#387, pv#388, event_day#391](23) Exchange
Input [2]: [bid#387, pv#388]
Arguments: hashpartitioning(bid#387, 600), ENSURE_REQUIREMENTS, [plan_id894](24) Sort
Input [2]: [bid#387, pv#388]
Arguments: [bid#387 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#350]
Right keys [1]: [bid#387]
Join condition: None(26) Project
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Input [6]: [bid#350, name#352, point_x#369, point_y#370, bid#387, pv#388](27) CollectLimit
Input [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#350, name#352, point_x#369, point_y#370, pv#388]
Arguments: isFinalPlantrue物理执行计划没有变化因此可以说当数据库支持谓词下推时筛选条件用where还是on没有区别数据库都会在数据源侧进行数据过滤减少参与关联的数据量。
2.外连接
1on
selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day20240522) t2
on t1.bidt2.bid
and t2.bid like 1% and t2.pv100;Physical Plan
AdaptiveSparkPlan (28)
- Final Plan CollectLimit (17)- * Project (16)- * SortMergeJoin LeftOuter (15):- * Sort (6): - AQEShuffleRead (5): - ShuffleQueryStage (4), Statistics(sizeInBytes36.5 MiB, rowCount3.07E5): - Exchange (3): - * LocalLimit (2): - Scan hive poi_data.poi_res (1)- * Sort (14)- AQEShuffleRead (13)- ShuffleQueryStage (12), Statistics(sizeInBytes58.5 MiB, rowCount1.28E6)- Exchange (11)- * Project (10)- * Filter (9)- * ColumnarToRow (8)- Scan parquet bi_report.mon_ronghe_pv (7)
- Initial Plan CollectLimit (27)- Project (26)- SortMergeJoin LeftOuter (25):- Sort (20): - Exchange (19): - LocalLimit (18): - Scan hive poi_data.poi_res (1)- Sort (24)- Exchange (23)- Project (22)- Filter (21)- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403, name#405, point_x#422, point_y#423], HiveTableRelation [poi_data.poi_res, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#403, type#404, name#405, address#406, phone#407, alias#408, post_code#409, catalog_id#410, c..., Partition Cols: []](2) LocalLimit [codegen id : 1]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000(3) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id1043](4) ShuffleQueryStage
Output [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#440, pv#441, event_day#444]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day20240522]
PartitionFilters: [isnotnull(event_day#444), (event_day#444 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: structbid:string,pv:int(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444](9) Filter [codegen id : 2]
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 100))(10) Project [codegen id : 2]
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444](11) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id1067](12) ShuffleQueryStage
Output [2]: [bid#440, pv#441]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#440, pv#441]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441](17) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000(18) LocalLimit
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: 1000(19) Exchange
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: hashpartitioning(bid#403, 600), ENSURE_REQUIREMENTS, [plan_id1029](20) Sort
Input [4]: [bid#403, name#405, point_x#422, point_y#423]
Arguments: [bid#403 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#440, pv#441, event_day#444]
Condition : (((isnotnull(bid#440) AND isnotnull(pv#441)) AND StartsWith(bid#440, 1)) AND (pv#441 100))(22) Project
Output [2]: [bid#440, pv#441]
Input [3]: [bid#440, pv#441, event_day#444](23) Exchange
Input [2]: [bid#440, pv#441]
Arguments: hashpartitioning(bid#440, 600), ENSURE_REQUIREMENTS, [plan_id1030](24) Sort
Input [2]: [bid#440, pv#441]
Arguments: [bid#440 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#403]
Right keys [1]: [bid#440]
Join condition: None(26) Project
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Input [6]: [bid#403, name#405, point_x#422, point_y#423, bid#440, pv#441](27) CollectLimit
Input [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#403, name#405, point_x#422, point_y#423, pv#441]
Arguments: isFinalPlantrue因为左关联on中的条件属于连接条件结果需要保留左表全部记录所以t1表全量读取t2表使用了谓词下推过滤。
2where
selectt1.bid,t1.name,t1.point_x,t1.point_y,t2.pv
from poi_data.poi_res t1
left join (select bid, pv from bi_report.mon_ronghe_pv where event_day20240522) t2
on t1.bidt2.bid
where t2.bid like 1% and t2.pv100;Physical Plan
AdaptiveSparkPlan (28)
- Final Plan CollectLimit (17)- * Project (16)- * SortMergeJoin Inner (15):- * Sort (6): - AQEShuffleRead (5): - ShuffleQueryStage (4), Statistics(sizeInBytes5.3 GiB, rowCount4.57E7): - Exchange (3): - * Filter (2): - Scan hive poi_data.poi_res (1)- * Sort (14)- AQEShuffleRead (13)- ShuffleQueryStage (12), Statistics(sizeInBytes58.5 MiB, rowCount1.28E6)- Exchange (11)- * Project (10)- * Filter (9)- * ColumnarToRow (8)- Scan parquet bi_report.mon_ronghe_pv (7)
- Initial Plan CollectLimit (27)- Project (26)- SortMergeJoin Inner (25):- Sort (20): - Exchange (19): - Filter (18): - Scan hive poi_data.poi_res (1)- Sort (24)- Exchange (23)- Project (22)- Filter (21)- Scan parquet bi_report.mon_ronghe_pv (7)(1) Scan hive poi_data.poi_res
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456, name#458, point_x#475, point_y#476], HiveTableRelation [poi_data.poi_res, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [bid#456, type#457, name#458, address#459, phone#460, alias#461, post_code#462, catalog_id#463, c..., Partition Cols: []](2) Filter [codegen id : 1]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))(3) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id1176](4) ShuffleQueryStage
Output [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: 0(5) AQEShuffleRead
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: coalesced(6) Sort [codegen id : 3]
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0(7) Scan parquet bi_report.mon_ronghe_pv
Output [3]: [bid#493, pv#494, event_day#497]
Batched: true
Location: InMemoryFileIndex [afs://kunpeng.afs.baidu.com:9902/user/g_spark_rdw/rdw/poi_engine/warehouse/bi_report.db/mon_ronghe_pv/event_day20240522]
PartitionFilters: [isnotnull(event_day#497), (event_day#497 20240522)]
PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]
ReadSchema: structbid:string,pv:int(8) ColumnarToRow [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497](9) Filter [codegen id : 2]
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 100))(10) Project [codegen id : 2]
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497](11) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id1201](12) ShuffleQueryStage
Output [2]: [bid#493, pv#494]
Arguments: 1(13) AQEShuffleRead
Input [2]: [bid#493, pv#494]
Arguments: coalesced(14) Sort [codegen id : 4]
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0(15) SortMergeJoin [codegen id : 5]
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None(16) Project [codegen id : 5]
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494](17) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000(18) Filter
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Condition : (StartsWith(bid#456, 1) AND isnotnull(bid#456))(19) Exchange
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: hashpartitioning(bid#456, 600), ENSURE_REQUIREMENTS, [plan_id1161](20) Sort
Input [4]: [bid#456, name#458, point_x#475, point_y#476]
Arguments: [bid#456 ASC NULLS FIRST], false, 0(21) Filter
Input [3]: [bid#493, pv#494, event_day#497]
Condition : (((isnotnull(bid#493) AND isnotnull(pv#494)) AND StartsWith(bid#493, 1)) AND (pv#494 100))(22) Project
Output [2]: [bid#493, pv#494]
Input [3]: [bid#493, pv#494, event_day#497](23) Exchange
Input [2]: [bid#493, pv#494]
Arguments: hashpartitioning(bid#493, 600), ENSURE_REQUIREMENTS, [plan_id1162](24) Sort
Input [2]: [bid#493, pv#494]
Arguments: [bid#493 ASC NULLS FIRST], false, 0(25) SortMergeJoin
Left keys [1]: [bid#456]
Right keys [1]: [bid#493]
Join condition: None(26) Project
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Input [6]: [bid#456, name#458, point_x#475, point_y#476, bid#493, pv#494](27) CollectLimit
Input [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: 1000(28) AdaptiveSparkPlan
Output [5]: [bid#456, name#458, point_x#475, point_y#476, pv#494]
Arguments: isFinalPlantruewhere属于过滤条件影响左关联的最终结果所以执行计划第2步中将where提前到join关联之前按照bid对t1表进行过滤。
四、总结
假设数据库系统支持谓词下推的前提下
内连接内连接的两个执行计划中对t2表都使用了PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)]对t1表都使用了Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297)) 因此可以说内连接中where和on在执行效率上没区别。外连接还是拿左外连接来说右表相关的条件会使用谓词下推而左表是否会提前过滤数据取决于where还是on以及筛选条件是否与左表相关1当为on时左表的数据必须全量读取此时效率的差别主要取决于左表的数据量。2当为where时如果筛选条件涉及到左表则会进行数据的提前过滤否则左表仍然全量读取。
PS
在内连接的物理执行计划中对poi_res表的过滤单独作为一个Filter步骤2Condition : (StartsWith(bid#297, 1) AND isnotnull(bid#297))而对mon_ronghe_pv表的过滤在第7步scan中PushedFilters: [IsNotNull(bid), IsNotNull(pv), StringStartsWith(bid,1), GreaterThan(pv,100)] 二者有什么区别查了一些资料说的是可以将PushedFilters理解为在读取数据时的过滤不满足条件的数据直接不读取。Filter时将数据读取之后再判断是否满足条件决定是否参与后续计算。
既然都是在数据源侧进行数据过滤为什么Filter不能像PushedFilters那样直接在读取数据的时候判断减少读入的数据量呢这样也可以提升效率这是一开始个人的疑问。查了一些资料说的是是否支持在scan时filter数据主要受数据源的影响。大数据中的存储方式主要分为行式存储和列式存储列式存储的数据存储方式和丰富的元数据对谓词下推技术有更好的支持。当前测试中mon_ronghe_pv表的存储格式为parquetpoi_res表存储格式text。