当前位置: 首页 > news >正文

专业网站建设组织如何做文化传播公司网站

专业网站建设组织,如何做文化传播公司网站,腾讯云新人服务器,提供服务好的网站建设2.2.3、Optimize2.2.3.1、SQL3.3.1.1、RB1、Join选择在Hadoop中#xff0c;MR使用DistributedCache来实现mapJoin。即将小文件存放到DistributedCache中#xff0c;然后分发到各个Task上#xff0c;并加载到内存中#xff0c;类似于Map结构#xff0c;然后借助于Mapper的迭…2.2.3、Optimize2.2.3.1、SQL3.3.1.1、RB1、Join选择在Hadoop中MR使用DistributedCache来实现mapJoin。即将小文件存放到DistributedCache中然后分发到各个Task上并加载到内存中类似于Map结构然后借助于Mapper的迭代机制遍历大表中的每一条记录并查找是否在小表中如果不在则省略。而Spark是使用广播变量的方式来实现MapJoin.2、谓词下推3、列裁剪4、常量替换5、分区剪枝3.3.1.2、CBO开启cbo之后通过配置spark.sql.cbo.enabled有以下几个优化点1、Build选择2、优化Join类型3、优化多Join顺序3.3.1.3、AE3.3.1.3.1、Auto Setting The Shuffle Partition NumberProperty NameDefaultMeaningspark.sql.adaptive.enabledfalse设置为true开启自适应机制spark.sql.adaptive.minNumPostShufflePartitions1自适应机制下最小的分区数可以用来控制最小并行度spark.sql.adaptive.maxNumPostShufflePartitions500自适应机制下最大的分区数可以用来控制最大并行度spark.sql.adaptive.shuffle.targetPostShuffleInputSize67108864动态reducer端每个Task最少处理的数据量. 默认为 64 MB.spark.sql.adaptive.shuffle.targetPostShuffleRowCount20000000动态调整每个task最小处理20000000条数据。该参数只有在行统计数据收集功能开启后才有作用3.3.1.3.2、Optimizing Join Strategy at RuntimeProperty NameDefaultMeaningspark.sql.adaptive.join.enabledtrue运行过程是否动态调整join策略的开关spark.sql.adaptiveBroadcastJoinThresholdequals to spark.sql.autoBroadcastJoinThreshold运行过程中用于判断是否满足BroadcastJoin条件。如果不设置则该值等于spark.sql.autoBroadcastJoinThreshold.3.3.1.3.3、Handling Skewed JoinProperty NameDefaultMeaningspark.sql.adaptive.skewedJoin.enabledfalse运行期间自动处理倾斜问题的开关spark.sql.adaptive.skewedPartitionFactor10如果一个分区的大小大于所有分区大小的中位数而且大于spark.sql.adaptive.skewedPartitionSizeThreshold或者分区条数大于所有分区条数的中位数且大于spark.sql.adaptive.skewedPartitionRowCountThreshold那么就会被当成倾斜问题来处理spark.sql.adaptive.skewedPartitionSizeThreshold67108864倾斜分区大小不能小于该值spark.sql.adaptive.skewedPartitionRowCountThreshold10000000倾斜分区条数不能小于该值spark.shuffle.statistics.verbosefalse启用后MapStatus会采集每个分区条数信息用来判断是否倾斜并进行相应的处理2.2.3.2、Compute2.2.3.2.1、Dynamic Executor Allocation2.2.3.2.2、Paralliesm2.2.3.2.3、Data Skew/Shuffle其除了手段和Spark文章中提到的倾斜一样这里不再叙述2.2.3.2.4、Properties更多配置见Property NameDefaultMeaningspark.sql.inMemorycolumnarStorage.compressedtrue内存中列存储压缩spark.sql.codegenfalse设置为true,可以为大型查询快速编辑创建字节码spark.sql.inMemoryColumnarStorage.batchSize10000默认列缓存大小为10000增大该值可以提高内存利用率但要避免OOM问题spark.sql.files.maxPartitionBytes134217728 (128 MB)The maximum number of bytes to pack into a single partition when reading files. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.spark.sql.files.openCostInBytes4194304 (4 MB)The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over-estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.spark.sql.files.minPartitionNumDefault ParallelismThe suggested (not guaranteed) minimum number of split file partitions. If not set, the default value is spark.default.parallelism. This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.spark.sql.broadcastTimeout300Timeout in seconds for the broadcast wait time in broadcast joinsspark.sql.autoBroadcastJoinThreshold10485760 (10 MB)Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE tableName COMPUTE STATISTICS noscan has been run.spark.sql.shuffle.partitions200Configures the number of partitions to use when shuffling data for joins or aggregationsspark.sql.sources.parallelPartitionDiscovery.threshold32Configures the threshold to enable parallel listing for job input paths. If the number of input paths is larger than this threshold, Spark will list the files by using Spark distributed job. Otherwise, it will fallback to sequential listing. This configuration is only effective when using file-based data sources such as Parquet, ORC and JSON.spark.sql.sources.parallelPartitionDiscovery.parallelism10000Configures the maximum listing parallelism for job input paths. In case the number of input paths is larger than this value, it will be throttled down to use this value. Same as above, this configuration is only effective when using file-based data sources such as Parquet, ORC and JSON.spark.sql.adaptive.coalescePartitions.enabledtrueWhen true and spark.sql.adaptive.enabled is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid too many small tasksspark.sql.adaptive.coalescePartitions.minPartitionNumDefault ParallelismThe minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled.spark.sql.adaptive.coalescePartitions.initialPartitionNum(none)The initial number of shuffle partitions before coalescing. If not set, it equals to spark.sql.shuffle.partitions. This configuration only has an effect when spark.sql.adaptive.enabled and spark.sql.adaptive.coalescePartitions.enabled are both enabled.spark.sql.adaptive.advisoryPartitionSizeInBytes64 MBThe advisory size in bytes of the shuffle partition during adaptive optimization (when spark.sql.adaptive.enabled is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.spark.sql.adaptive.localShuffleReader.enabledtrue开启自适应执行后spark会使用本地的shuffle reader读取shuffle数据。这种情况只会发生在没有shuffle重分区的情况spark.sql.adaptive.skewJoin.enabledtrueWhen true and spark.sql.adaptive.enabled is true, Spark dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed partitions.spark.sql.adaptive.skewJoin.skewedPartitionFactor5A partition is considered as skewed if its size is larger than this factor multiplying the median partition size and also larger than spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes.spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes256MBA partition is considered as skewed if its size in bytes is larger than this threshold and also larger than spark.sql.adaptive.skewJoin.skewedPartitionFactor multiplying the median partition size. Ideally this config should be set larger than spark.sql.adaptive.advisoryPartitionSizeInBytes、.spark.sql.optimizer.maxIterations100The max number of iterations the optimizer and analyzer runsspark.sql.optimizer.inSetConversionThreshold10The threshold of set size for InSet conversionspark.sql.inMemoryColumnarStorage.partitionPruningtrueWhen true,enable partition pruning for in-memory columnar tablesspark.sql.inMemoryColumnarStorage.enableVectorizedReadertrueEnables vectorized reader for columnar cachingspark.sql.columnVector.offheap.enabledtrueWhen true, use OffHeapColumnVector in ColumnarBatch.spark.sql.join.preferSortMergeJointrueWhen true, prefer sort merge join over shuffle hash joinspark.sql.sort.enableRadixSorttrueWhen true, enable use of radix sort when possible. Radix sort is much faster but requires additional memory to be reserved up-front. The memory overhead may be significant when sorting very small rows (up to 50% more in this case)spark.sql.limit.scaleUpFactor4Minimal increase rate in number of partitions between attempts when executing a take on a query. Higher values lead to more partitions read. Lower values might lead to longer execution times as more jobs will be runspark.sql.hive.advancedPartitionPredicatePushdown.enabledtrueWhen true, advanced partition predicate pushdown into Hive metastore is enabledspark.sql.subexpressionElimination.enabledtrueWhen true, common subexpressions will be eliminatedspark.sql.caseSensitivefalseWhether the query analyzer should be case sensitive or not. Default to case insensitive. It is highly discouraged to turn on case sensitive modespark.sql.crossJoin.enabledfalseWhen false, we will throw an error if a query contains a cartesian product without explicit CROSS JOIN syntax.spark.sql.files.ignoreCorruptFilesfalseWhether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.spark.sql.files.ignoreMissingFilesfalseWhether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned.spark.sql.files.maxRecordsPerFile0Maximum number of records to write out to a single file.If this value is zero or negative, there is no limit.spark.sql.cbo.enabledfalseEnables CBO for estimation of plan statistics when set true.spark.sql.cbo.joinReorder.enabledfalseEnables join reorder in CBOspark.sql.cbo.joinReorder.dp.threshold12The maximum number of joined nodes allowed in the dynamic programming algorithmspark.sql.cbo.joinReorder.card.weight0.7The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight size * (1 - weight).spark.sql.cbo.joinReorder.dp.star.filterfalseApplies star-join filter heuristics to cost based join enumerationspark.sql.cbo.starSchemaDetectionfalseWhen true, it enables join reordering based on star schema detectionspark.sql.cbo.starJoinFTRatio0.9Specifies the upper limit of the ratio between the largest fact tables for a star join to be consideredspark.sql.windowExec.buffer.in.memory.threshold4096Threshold for number of rows guaranteed to be held in memory by the window operator2.2.3.3、Storage2.2.3.3.1、Small File小文件的危害就不再叙述了这个时候就要思考什么时候会产生小文件。其产生的地方有1、源头如果原始文件就存在小文件那么就需要先进行合并然后再计算避免产生大量的task造成资源浪费2、计算过程中这个时候就要结合实际的数据量大小和分布以及分区数进行调整。3、写入写入文件的数量跟reduce/分区的个数有关系可以根据实际的数据量进行调整并行度或者配置自动合并2.2.3.3.2、Cold And Hot Data2.2.3.3.3、Compress And Serializable 1、文件采用合适的存储类型以及压缩格式2、使用合适高效的序列化器如kryoProperty NameDefaultMeaningspark.sql.parquet.compression.codecsnappyparquet存储类型文件的压缩格式默认为snappyspark.sql.sources.fileCompressionFactor1.0When estimating the output data size of a table scan, multiply the file size with this factor as the estimated data size, in case the data is compressed in the file and lead to a heavily underestimated resultspark.sql.parquet.mergeSchemafalseWhen true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is availablespark.sql.parquet.respectSummaryFilesfalseWhen true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldnt be enabled before knowing what it means exactlyspark.sql.parquet.binaryAsStringfalseSome other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systemsspark.sql.parquet.filterPushdowntrueEnables Parquet filter push-down optimization when set to truespark.sql.parquet.columnarReaderBatchSize4096The number of rows to include in a parquet vectorized reader batch. The number should be carefully chosen to minimize overhead and avoid OOMs in reading data.2.2.3.4、Other2.2.3.4.1、Closed Loop FeedBack2.2.3.4.1.1、实时运行信息分析2.2.3.4.1.2、运行信息离线统计分析高频表、列统计错误信息汇总策略生效情况记录等。
http://www.dnsts.com.cn/news/150947.html

相关文章:

  • 南京哪里可以做网站重庆是哪个省市
  • 吴中seo外链推广工具如何做营销型手机网站优化
  • 精准扶贫网站建设目的哪个公司建设网站
  • 响应式网站开发设计师做网络写手 哪个网站比较好
  • 网站站内链接怎么做网站建设的原因
  • 数据查询网站建设友链交换网站源码
  • 茂南手机网站建设公司做一个企业网站花费
  • 珠海做快照网站电话做网站怎么找图
  • 高水平建设专业网站桂林网站制作报价
  • 做旅游网站网站建设 知乎
  • 时彩网站开发网站导航条图片素材
  • 查互做蛋白的网站vivo系统最新版本
  • 前程无忧网杭州网站建设类岗位茶文化网站网页设计
  • 单位网站建设典型材料深圳外贸人才网
  • 用自己的电脑做网站空间电子商务平台官网
  • 学做ps的网站有哪些wordpress文章页图片地址怎么修改
  • 衡水商城网站建设大淘客cms网站怎么做
  • 长春建个网站需要多少钱?昆山网站建设熊掌号
  • 优化网站排名推荐公司重庆网站房地产
  • 做网站虚拟服务器模版建站
  • erp网站代做自己做的网站怎么实现结算功能
  • 做网站的第一步生物科技网站建设 中企动力北京
  • 网站开启伪静态公司找人做网站需要什么
  • 好看的美食网站设计沈阳网站设计定制
  • 什么网站必须要flash中文网址
  • 开发大型网站的流程桐乡微网站建设公司
  • 馆陶做网站wordpress 栏目显示不出来
  • 应式网站wordpress 3.4 漏洞
  • 怎么建立网站wordpress 视频播放
  • 网站不推广如何排名网站专题报道怎么做