在并发查询下,Java UDAF 聚合函数存在某一节点读数据阶段耗时长尾效应

【Doris 使用环境】测试

【Doris 版本】doris-1.2.2-rc01

【问题描述】我们在进行性能测试,我们的写了一个 Java UDAF 聚合函数,函数的功能类似于 count,表类型为 duplicate,我们在进行并发测试时,发现当并发逐渐从1增涨到40的过程中,其中一个 fragment instance 的耗时会急剧增长,而其他的节点都会保持差不多的时间,这就导致出现了长尾效应,导致响应耗时激增。

【操作系统】Centos 7.6

【机器配置】
FE:8c16G * 1 台
BE:32c128GB * 3台

以下是出问题的那个 fragment 对比
这是 2 并发

Fragment 1:
Instance f8b6dec98491443b-90a9d4947a3482c9 (host=TNetworkAddress(hostname:10.227.100.33, port:9030)):(Active: 3s491ms, % non-child: 0.01%)
- FragmentCpuTime: 2s884ms
- RowsProduced: 1
BlockMgr:
- BlockWritesOutstanding: 0
- BlocksCreated: 0
- BlocksRecycled: 0
- BufferedPins: 0
- BytesWritten: 0.00
- MaxBlockSize: 8.00 MB
- TotalBufferWaitTime: 0ns
- TotalEncryptionTime: 0ns
- TotalIntegrityCheckTime: 0ns
- TotalReadBlockTime: 0ns
VDataStreamSender (dst_id=2, dst_fragments=[{“f8b6dec98491443b-90a9d4947a3482cb”}]):(Active: 248.96us, % non-child: 0.01%)
- BlocksSent: 2
- BrpcSendTime: 436.609us
- BrpcSendTime.Wait: 494.764us
- BytesSent: 12.04 KB
- CompressTime: 122.180us
- IgnoreRows: 0
- LocalBytesSent: 0.00
- LocalSendTime: 0ns
- LocalSentRows: 0
- OverallThroughput: 47.396185874938965 MB/sec
- PeakMemoryUsage: 3.13 KB
- SerializeBatchTime: 131.93us
- SplitBlockDistributeByChannelTime: 0ns
- SplitBlockHashComputeTime: 0ns
- UncompressedRowBatchSize: 37.17 KB
VAGGREGATION_NODE (id=1):(Active: 3s490ms, % non-child: 81.31%)
- BuildTime: 2s799ms
- DeserializeDataTime: 0ns
- ExecTime: 2s798ms
- ExprTime: 1.270ms
- GetResultsTime: 0ns
- HashTableComputeTime: 0ns
- HashTableInputCount: 0
- HashTableIterateTime: 0ns
- HashTableSize: 0
- InsertKeysToColumnTime: 0ns
- MaxRowSizeInBytes: 0
- MergeTime: 0ns
- PeakMemoryUsage: 0.00
- ProjectionTime: 0ns
- RowsReturned: 1
- RowsReturnedRate: 0
- SerializeDataTime: 0ns
- SerializeKeyTime: 0ns
- SerializeResultTime: 2.65ms
- StreamingAggTime: 0ns
VNewOlapScanNode(r_1y5w_c_500_b_60_t_32_dup) (id=0):(Active: 640.843ms, % non-child: 18.29%)
- PushDownPredicates: [{f1 IN [1]}]
- KeyRanges: ScanKeys:
- TabletIds: [16402, 16312, 16410, 16390, 16320, 16330, 16338, 16342, 16308, 16370, 16394, 16358, 16366, 16376, 16326, 16398, 16392, 16322, 16304, 16354, 16374, 16324, 16400, 16364, 16306]
- AcuireRuntimeFilterTime: 332ns
- GetNextTime: 579.549ms
- MaxScannerThreadNum: 12
- NewlyCreateFreeBlocksNum: 374
- NumDiskAccess: 3
- NumScanners: 25
- PeakMemoryUsage: 0.00
- PreAllocFreeBlocksNum: 192
- ProcessConjunctTime: 68.229us
- ProjectionTime: 0ns
- RowsRead: 585.693K (585693)
- RowsReturned: 585.693K (585693)
- RowsReturnedRate: 913.94K /sec
- ScannerBatchWaitTime: 572.678ms
- ScannerCtxSchedCount: 836
- ScannerSchedCount: 307
- ScannerWorkerWaitTime: 1s224ms
- TabletNum: 25
- TotalReadThroughput: 0
VScanner:
- PreEvaluatePredicates:
ComparisonPredicateBase(STRING, EQ), column_id=1, opposite=false
- BlockConvertTime: 0ns
- BlockFetchTime: 18s825ms
- ReaderInitTime: 0ns
- RowsDelFiltered: 0
- ScannerConvertBlockTime: 0ns
- ScannerCpuTime: 18s239ms
- ScannerFilterTime: 1.207ms
- ScannerGetBlockTime: 18s687ms
- ScannerInitTime: 18.240ms
- ScannerPrefilterTime: 0ns
SegmentIterator:
- BitmapIndexFilterTimer: 282.898us
- BlockConditionsFilteredTime: 5.495ms
- BlockInitSeekCount: 0
- BlockInitSeekTime: 0ns
- BlockInitTime: 14.522ms
- BlockLoadTime: 18s813ms
- BlocksLoad: 15.013K (15013)
- CachedPagesNum: 0
- CompressedBytesRead: 1013.03 MB
- DecompressorTimer: 0ns
- FirstReadSeekCount: 227
- FirstReadSeekTime: 56.820ms
- FirstReadTime: 975.707ms
- IOTimer: 404.453ms
- LazyReadSeekCount: 0
- LazyReadSeekTime: 0ns
- LazyReadTime: 17s44ms
- NumSegmentFiltered: 0
- NumSegmentTotal: 227
- OutputColumnTime: 62.547ms
- RawRowsRead: 58.591571M (58591571)
- RowsBitmapIndexFiltered: 0
- RowsBloomFilterFiltered: 0
- RowsConditionsFiltered: 858
- RowsKeyRangeFiltered: 0
- RowsStatsFiltered: 858
- RowsVectorPredFiltered: 58.005878M (58005878)
- ShortPredEvalTime: 478.990us
- TotalPagesNum: 77.308K (77308)
- UncompressedBytesRead: 1011.24 MB
- VectorPredEvalTime: 38.551ms

这是 40 并发,可以对比看一下

Fragment 1:
Instance fc487c641232419a-896cd737eb114386 (host=TNetworkAddress(hostname:10.227.100.33, port:9030)):(Active: 1m, % non-child: 0.00%)
- FragmentCpuTime: 3s457ms
- RowsProduced: 1
BlockMgr:
- BlockWritesOutstanding: 0
- BlocksCreated: 0
- BlocksRecycled: 0
- BufferedPins: 0
- BytesWritten: 0.00
- MaxBlockSize: 8.00 MB
- TotalBufferWaitTime: 0ns
- TotalEncryptionTime: 0ns
- TotalIntegrityCheckTime: 0ns
- TotalReadBlockTime: 0ns
VDataStreamSender (dst_id=2, dst_fragments=[{“fc487c641232419a-896cd737eb114388”}]):(Active: 283.218us, % non-child: 0.00%)
- BlocksSent: 2
- BrpcSendTime: 469.646us
- BrpcSendTime.Wait: 463.844us
- BytesSent: 12.05 KB
- CompressTime: 126.739us
- IgnoreRows: 0
- LocalBytesSent: 0.00
- LocalSendTime: 0ns
- LocalSentRows: 0
- OverallThroughput: 41.55223560333252 MB/sec
- PeakMemoryUsage: 3.13 KB
- SerializeBatchTime: 136.69us
- SplitBlockDistributeByChannelTime: 0ns
- SplitBlockHashComputeTime: 0ns
- UncompressedRowBatchSize: 37.17 KB
VAGGREGATION_NODE (id=1):(Active: 1m, % non-child: 8.48%)
- BuildTime: 5s105ms
- DeserializeDataTime: 0ns
- ExecTime: 5s103ms
- ExprTime: 1.625ms
- GetResultsTime: 0ns
- HashTableComputeTime: 0ns
- HashTableInputCount: 0
- HashTableIterateTime: 0ns
- HashTableSize: 0
- InsertKeysToColumnTime: 0ns
- MaxRowSizeInBytes: 0
- MergeTime: 0ns
- PeakMemoryUsage: 0.00
- ProjectionTime: 0ns
- RowsReturned: 1
- RowsReturnedRate: 0
- SerializeDataTime: 0ns
- SerializeKeyTime: 0ns
- SerializeResultTime: 2.387ms
- StreamingAggTime: 0ns
VNewOlapScanNode(r_1y5w_c_500_b_60_t_32_dup) (id=0):(Active: 55s361ms, % non-child: 91.49%)
- PushDownPredicates: [{f1 IN [1]}]
- KeyRanges: ScanKeys:
- TabletIds: [16354, 16342, 16390, 16338, 16370, 16312, 16320, 16364, 16398, 16324, 16330, 16306, 16322, 16376, 16374, 16308, 16326, 16358, 16400, 16366, 16394, 16410, 16304, 16392, 16402]
- AcuireRuntimeFilterTime: 649ns
- GetNextTime: 54s939ms
- MaxScannerThreadNum: 12
- NewlyCreateFreeBlocksNum: 0
- NumDiskAccess: 3
- NumScanners: 25
- PeakMemoryUsage: 0.00
- PreAllocFreeBlocksNum: 192
- ProcessConjunctTime: 69.191us
- ProjectionTime: 0ns
- RowsRead: 585.693K (585693)
- RowsReturned: 585.693K (585693)
- RowsReturnedRate: 10.579K /sec
- ScannerBatchWaitTime: 54s935ms
- ScannerCtxSchedCount: 50
- ScannerSchedCount: 25
- ScannerWorkerWaitTime: 8m16s
- TabletNum: 25
- TotalReadThroughput: 0
VScanner:
- PreEvaluatePredicates:
ComparisonPredicateBase(STRING, EQ), column_id=1, opposite=false
- BlockConvertTime: 0ns
- BlockFetchTime: 1m11s
- ReaderInitTime: 0ns
- RowsDelFiltered: 0
- ScannerConvertBlockTime: 0ns
- ScannerCpuTime: 43s102ms
- ScannerFilterTime: 1.762ms
- ScannerGetBlockTime: 1m10s
- ScannerInitTime: 40.280ms
- ScannerPrefilterTime: 0ns
SegmentIterator:
- BitmapIndexFilterTimer: 312.298us
- BlockConditionsFilteredTime: 8.270ms
- BlockInitSeekCount: 0
- BlockInitSeekTime: 0ns
- BlockInitTime: 17.726ms
- BlockLoadTime: 1m11s
- BlocksLoad: 15.013K (15013)
- CachedPagesNum: 0
- CompressedBytesRead: 1013.03 MB
- DecompressorTimer: 0ns
- FirstReadSeekCount: 227
- FirstReadSeekTime: 296.486ms
- FirstReadTime: 3s892ms
- IOTimer: 702.4ms
- LazyReadSeekCount: 0
- LazyReadSeekTime: 0ns
- LazyReadTime: 1m6s
- NumSegmentFiltered: 0
- NumSegmentTotal: 227
- OutputColumnTime: 53.988ms
- RawRowsRead: 58.591571M (58591571)
- RowsBitmapIndexFiltered: 0
- RowsBloomFilterFiltered: 0
- RowsConditionsFiltered: 858
- RowsKeyRangeFiltered: 0
- RowsStatsFiltered: 858
- RowsVectorPredFiltered: 58.005878M (58005878)
- ShortPredEvalTime: 501.588us
- TotalPagesNum: 77.308K (77308)
- UncompressedBytesRead: 1011.24 MB
- VectorPredEvalTime: 82.52ms

1 是否方便提供下sql,表名和字段名可以脱敏,就是想看下sql的结构
2 如果把UDAF去掉是否还可以复现,想排除下其他因素的影响