前言

昨天下午3点左右，我们业务用户反映我们的app出现大面积登录成功，访问缓慢的情况。

查询后台日志发现大量出现TimeoutException，如下：

Caused by: com.alibaba.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2021-05-11 16:09:44.785, end time: 2021-05-11 16:10:34.796, client elapsed: 0 ms, server elapsed: 5011 ms, timeout: 5000 ms, request: Request [id=7467, version=2.0.0, twoway=true, event=false, broken=false, data=RpcInvocation [methodName=xxxx, parameterTypes=[class com.xxxx.xjd.late.common.dto.xxx.xxxx], arguments=[com.xxxx.xjd.late.common.dto.xxxx.xxxx@e2adb42], attachments={path=com.xxxx.xjd.late.api.xxxx, interface=com.xxxx.xjd.late.api.xxxx, version=1.0.0, timeout=50000, group=xxxx-xxxx}]], channel: /10.0.0.1:56811 -> /10.0.0.3:8077

PS：Linux指令，查询某个关键字最新一行数据

1	grep 'TimeoutException' xxxxx.log \| tail -n 1

正文

初期分析

线上出现大量dubbo请求超时，我们开始分析问题，一开始认为是网络原因导致的，咨询了网络运维组，运维组排查内网环境说网络正常。

而且经过跟踪，我们发现有部分请求是成功的。

由于我们是分布式部署，A模块称为client项目，用于为App提供接口服务，其调用了service项目（主要与数据库、其他模块、三方等交互）来进行服务。

我们发现clinet项目的请求会访问到service项目，但是处理过程中超时了，也没有任何业务异常。

而且超时接口广泛，不是单一一个接口出现超时异常。

我们找到了一个最简单的接口，这个接口是前端操作后，app将数据发送过来，service处理后保存入库（普通save操作）。

而后我们通过日志找到这个用户，登录堡垒机查询数据库发现刚才用户数据已经保存成功。

因此我们大概意识到是数据库执行缓慢导致的。

问题浮现

正在我们分析过程中，数据库管理员找到我们，说xxx应用是不是你们负责，再跑什么业务，现在数据库很卡，CPU要打满了。

在数据库管理员的帮助下，我们定位到了一些慢SQL，及一个一直在执行的插入SQL，其IP定位到我们的管理后台。

找到了相关项目，我们根据SQL定位到了如下这段代码：

//....部分代码略
platfContentService.save(platfContent);//新建或者编辑保存
if("2".equals(platfContent.getType())){//公告群发
    new Thread(){
        @Override
        public void run(){
            List<String> userIds = null;
            List<PlatfInbox> inboxs = new ArrayList();
            Integer offset = 0;
            Integer pageSize = 1000;
            if(String.valueOf(Constant.targetUserEnum.ALL.getValue())
                    .equals(platfContent.getTargetUser())){
                userIds = platfContentService.findUserIdsFromPlatfUser(null,
                        offset,pageSize);
                while(userIds != null
                        && userIds.size() > 0){
                    for(String userId : userIds){
                        inboxs.add(encapsulatePlatfInbox(userId,platfContent));
                    }
                    platfContentService.batchInsertPlatfInbox(inboxs);
                    userIds.clear();
                    inboxs.clear();
                    offset = offset+pageSize;
                    userIds =  platfContentService.findUserIdsFromPlatfUser(null,offset,pageSize);
                }

            }else if(String.valueOf(Constant.targetUserEnum.YQ_USER.getValue())
                    .equals(platfContent.getTargetUser())
                    || String.valueOf(Constant.targetUserEnum.JK_USER.getValue())
                    .equals(platfContent.getTargetUser())
                    || String.valueOf(Constant.targetUserEnum.YHK_USER.getValue())
                    .equals(platfContent.getTargetUser())) {
                userIds = platfContentService.findUserIdsFromLateAccount(platfContent,offset,pageSize);
                while(userIds.size() > 0){
                    for(String userId : userIds){
                        inboxs.add(encapsulatePlatfInbox(userId,platfContent));
                    }
                    platfContentService.batchInsertPlatfInbox(inboxs);
                    userIds.clear();
                    inboxs.clear();
                    offset = offset+pageSize;
                    userIds =  platfContentService.findUserIdsFromLateAccount(platfContent,offset,pageSize);
                }
            }

        }

    }.start();
}
j.setSuccess(true);
j.setMsg("操作成功");
//....部分代码略

这段代码逻辑如下：

我们管理后台原来有个发送公告消息模块，运营之前没用过，昨天下午使用了，添加了一条全体公告。

代码的逻辑是开个线程去批量将每个用户查出来拿到id，在更新进inbox消息表……

由于用户量巨大，导致对数据库造成巨大压力，原来一些慢SQL的问题也浮现了。

同时这种方案开启线程去处理，如果正好服务挂了，剩余部分数据将得不到有效处理。

问题解决

可以看到代码是开启线程处理数据，我们将管理后台重启即可。而后数据库压力缓解，系统服务恢复正常。

关于这部分问题，原来的设计是每有一条消息，向数据库里插入一条消息记录。

通过查询可用看到消息表数据量已达到6600w。

我们后面通过分析优化，基本处理方法如下：

对于一些全局消息，只记录一条即可，并通过Redis缓存辅助处理。
对于具体用户个人消息，设置保存时限，定时清除部分实际无用消息，并通过缓存近期消息提高查询效率。

同时DBA给我们拉取了一些慢查询，我们需要处理及优化。

# Query 10: 0.64 QPS, 5.78x concurrency, ID 0xCC2F786C3D667858 at byte 593627453
# This item is included in the report because it matches --limit.
# Scores: V/M = 5.81
# Time range: 2021-05-11T13:34:04 to 2021-05-11T16:18:59
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          2    6294
# Exec time      2  57150s      3s     89s      9s     21s      7s      6s
# Lock time      0   867ms    56us    17ms   137us   159us   521us    98us
# Rows sent      0   6.15k       1       1       1       1       0       1
# Rows examine  39  39.00G   6.34M   6.35M   6.35M   6.29M       0   6.29M
# Query size     0   1.05M     167     194  174.23  174.84    4.27  166.51
# String:
# Hosts        10.0.0.xx (3173/50%), 10.0.0.xx (3121/49%)
# Users        xxxx_web_w
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s  ################################################################
#  10s+  #################
# Tables
#    SHOW TABLE STATUS LIKE 'xxxx_detail'\G
#    SHOW CREATE TABLE `xxxx_detail`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT
                COUNT(0)
                FROM xxxx_detail a
                 WHERE a.user_id != 'xxxxxxx'

                        AND a.gmail_account = 'xxx@gmail.com' AND a.del_flag = 0\G

这条SQL用了 != ，实际查询用不到索引，造成全表扫描。

可以看到这个SQL的操作是查询除了该用户，有没有其他用户使用这个邮箱，这种可以对gmail_account字段添加索引，如果涉及大量查询，可以考虑使用Redis来进行处理。

# Query 1: 2.98 QPS, 21.71x concurrency, ID 0xEC345910A7B575A6 at byte 389892839
# This item is included in the report because it matches --limit.
# Scores: V/M = 8.98
# Time range: 2021-05-11T13:34:39 to 2021-05-11T16:18:59
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          9   29401
# Exec time     10 214024s      1s    133s      7s     23s      8s      4s
# Lock time      0      4s    52us    37ms   122us   176us   436us    93us
# Rows sent     27 574.24k      20      20      20      20       0      20
# Rows examine  50  50.01G 590.49k   2.41M   1.74M   2.26M 467.04k   1.86M
# Query size     0   2.94M     105     105     105     105       0     105
# String:
# Hosts        10.0.0.xx (14751/50%), 10.0.0.xx (14650/49%)
# Users        xxxx_web_w
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s  ################################################################
#  10s+  ################
# Tables
#    SHOW TABLE STATUS LIKE 'platf_inbox'\G
#    SHOW CREATE TABLE `platf_inbox`\G
# EXPLAIN /*!50100 PARTITIONS*/
select *
                from platf_inbox a
                where a.del_flag= '0' and a.msg_type=3
                order by a._id desc
                limit 0,20\G

该SQL查询20条最新消息，我们通过Redis缓存数据来减小数据库压力。

# Query 59: 0.09 QPS, 0.49x concurrency, ID 0x9B09EF8E6B818E92 at byte 517278788
# This item is included in the report because it matches --outliers.
# Scores: V/M = 3.53
# Time range: 2021-05-11T14:04:08 to 2021-05-11T16:03:49
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0     668
# Exec time      0   3517s      2s     64s      5s     13s      4s      4s
# Lock time      0   119ms    83us     4ms   178us   247us   209us   144us
# Rows sent      0     326       0       1    0.49    0.99    0.50       0
# Rows examine   3   2.57G   3.93M   3.93M   3.93M   3.86M    0.06   3.86M
# Query size     0 546.01k     837     837     837     837       0     837
# String:
# Hosts        10.0.0.xx (338/50%), 10.0.0.xx (330/49%)
# Users        xxxx_web_w
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s  ################################################################
#  10s+  ######
# Tables
#    SHOW TABLE STATUS LIKE 'xxxx_ocr_record'\G
#    SHOW CREATE TABLE `xxxx_ocr_record`\G
# EXPLAIN /*!50100 PARTITIONS*/
select
                *
                from xxxx_ocr_record a
                WHERE a.ocr_result = 1  and a.user_id='xxxxx' order BY a.create_date DESC LIMIT 1\G

该表没有索引，虽然只查一条，但是全表扫描，查询效率非常低，需要添加索引。

总结

从此次线上事故到恢复，查不到耗时3小时左右，感觉自己从中学到了一些经验。

对线上问题处理及优化有了更多的一些认识，虽然是程序逻辑问题造成的，但是仍有不少收获。

SakuraTears的博客

记一次线上问题（dubbo连接超时问题）

前言

正文

初期分析

问题浮现

问题解决

总结