问题描述 最近业务团队预发使用平台redis框架包(redis客户端使用lettuce,超时时间配置为5s)的时候经常出现命令超时问题。
报错日志,issue地址
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : 2020-09-02 17 :04 :07 .029 dxy-live-extensions [lettuce-epollEventLoop-5-1] ERROR c .d .redis .lock .RedisDistributedLock - set redis occured an exception :Redis command timed out ; nested exception is io .lettuce .core .RedisCommandTimeoutException : Command timed out Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : org .springframework .dao .QueryTimeoutException : Redis command timed out ; nested exception is io .lettuce .core .RedisCommandTimeoutException : Command timed out Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .connection .lettuce .LettuceExceptionConverter .convert (LettuceExceptionConverter.java :70 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .connection .lettuce .LettuceExceptionConverter .convert (LettuceExceptionConverter.java :41 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .PassThroughExceptionTranslationStrategy .translate (PassThroughExceptionTranslationStrategy.java :44 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .FallbackExceptionTranslationStrategy .translate (FallbackExceptionTranslationStrategy.java :42 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .connection .lettuce .LettuceConnection .convertLettuceAccessException (LettuceConnection.java :257 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .connection .lettuce .LettuceScriptingCommands .convertLettuceAccessException (LettuceScriptingCommands.java :236 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .connection .lettuce .LettuceScriptingCommands .evalSha (LettuceScriptingCommands.java :195 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .connection .DefaultedRedisConnection .evalSha (DefaultedRedisConnection.java :1240 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at sun .reflect .NativeMethodAccessorImpl .invoke0 (Native Method)Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at sun .reflect .NativeMethodAccessorImpl .invoke (NativeMethodAccessorImpl.java :62 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at sun .reflect .DelegatingMethodAccessorImpl .invoke (DelegatingMethodAccessorImpl.java :43 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at java .lang .reflect .Method .invoke (Method.java :498 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .core .CloseSuppressingInvocationHandler .invoke (CloseSuppressingInvocationHandler.java :61 )Sep 2 17:04:07 live-uat dxy-live-extensions-uat[117]: #011at com.sun.proxy.$Proxy169.evalSha(Unknown Source) Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .core .script .DefaultScriptExecutor .eval (DefaultScriptExecutor.java :77 )Sep 2 17:04:07 live-uat dxy-live-extensions-uat[117]: #011at org.springframework.data.redis.core.script.DefaultScriptExecutor.lambda$execute$0(DefaultScriptExecutor.java:68) Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .core .RedisTemplate .execute (RedisTemplate.java :224 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .core .RedisTemplate .execute (RedisTemplate.java :184 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .core .RedisTemplate .execute (RedisTemplate.java :171 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .core .script .DefaultScriptExecutor .execute (DefaultScriptExecutor.java :58 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .core .script .DefaultScriptExecutor .execute (DefaultScriptExecutor.java :52 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .core .RedisTemplate .execute (RedisTemplate.java :346 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at com .dxy .redis .lock .RedisDistributedLock .redisLock (RedisDistributedLock.java :121 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at com .dxy .redis .lock .RedisDistributedLock .doLock (RedisDistributedLock.java :92 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at com .dxy .redis .lock .AbstractDistributedLock .lock (AbstractDistributedLock.java :46 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at com .dxy .redis .util .RedisLockUtils .tryLock (RedisLockUtils.java :64 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at com .dxy .listener .AbstractMessageListener .tryLock (AbstractMessageListener.java :21 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at com .dxy .listener .LiveInfoChangeMessageListener .onMessage (LiveInfoChangeMessageListener.java :34 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .connection .lettuce .LettuceMessageListener .message (LettuceMessageListener.java :43 )Sep 2 17 :04 :07 live-uat dxy-live-extensions-uat [117] : #011at org .springframework .data .redis .connection .lettuce .LettuceMessageListener .message (LettuceMessageListener.java :29 )
排查问题
查看redis服务器慢命令,慢命令设置为3s,但是最终还是没有慢命令日志 ;
1 2 3 4 5 6 7 8 9 10 # 查看客户端连接数 client listsinfo clientsinfo stats # 查看最大客户端连接数 config get maxclients # 命令执行超过3秒记录慢日志 CONFIG SET slowlog-log-slower-than 3000000 # 只保留最近1000条慢日志 CONFIG SET slowlog-max-len 1000 # 查看慢日志 showlog get
升级lettuce包,官方推荐使用5.3+稳定版本,但是升级完成之后,还是没有效果 ;
排查业务代码使用了哪些场景,发现业务团队使用了大量的sub订阅,使用方式直接去redisconnectionfactory获取连接,然后直接订阅。由于lettcue我们默认是使用共享连接配置(即与server端连接采用多路复用连接机制,默认只有一个客户端连接)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 @Component public class HostChangeMessageListener extends AbstractMessageListener implements MessageListener { @Resource private IMCacheService cacheService; @Resource private IMMessageService imMessageService; @Override public void onMessage (Message message, byte [] pattern) { String content = new String(message.getBody()); if (tryLock(message)) { } } }
查看lettuce的命令超时文档
RedisCommandTimeoutException
with a stack trace like:
1 2 3 4 5 6 io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 minute(s) at io.lettuce.core.ExceptionFactory.createTimeoutException(ExceptionFactory.java:51) at io.lettuce.core.LettuceFutures.awaitOrCancel(LettuceFutures.java:114) at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:69) at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80) at com.sun.proxy.$Proxy94.set(Unknown Source)
Diagnosis:
Check the debug log (log level DEBUG
or TRACE
for the logger io.lettuce.core.protocol
)
Take a Thread dump to investigate Thread activity
Cause:
Command timeouts are caused by the fact that a command was not completed within the configured timeout. Timeouts may be caused for various reasons:
Redis server has crashed/network partition happened and your Redis service didn’t recover within the configured timeout
Command was not finished in time. This can happen if your Redis server is overloaded or if the connection is blocked by a command (e.g. BLPOP 0
, long-running Lua script). See also blpop(Duration.ZERO, …)
gives RedisCommandTimeoutException
.
Configured timeout does not match Redis’s performance.
If you block the EventLoop
(e.g. calling blocking methods in a RedisFuture
callback or in a Reactive pipeline)
Action:
Check for the causes above. If the configured timeout does not match your Redis latency characteristics, consider increasing the timeout. Never block the EventLoop
from your code.
解决方案 采用spring redis监听器方式,避免多个客户端连接订阅导致其他命令超时。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 @Bean public MessageListenerAdapter liveInfoChangeMessageListenerAdapter () { return new MessageListenerAdapter(liveInfoChangeMessageListener, "onMessage" ); } @Bean public RedisMessageListenerContainer container (RedisConnectionFactory connectionFactory) { RedisMessageListenerContainer listenerContainer = new RedisMessageListenerContainer(); listenerContainer.setConnectionFactory(connectionFactory); listenerContainer.addMessageListenerliveInfoChangeMessageListenerAdapter(), new PatternTopic("IM_LIVE_ENTRY_INFO_CHANGE" )); return listenerContainer; }
Leetuce注意点 针对需要阻塞的命令最好针对命令配置超时时间,如下:
There are various options:
Configure a higher default timeout.
Consider a timeout that meets the default timeout when calling blocking commands.
Configure TimeoutOptions
with a custom TimeoutSource
1 2 3 4 5 6 7 8 9 10 11 12 TimeoutOptions timeoutOptions = TimeoutOptions.builder().timeoutSource(new TimeoutSource() { @Override public long getTimeout(RedisCommand<?, ?, ?> command) { if (command.getType() == CommandType.BLPOP) { return TimeUnit.MILLISECONDS.toNanos(CommandArgsAccessor.getFirstInteger(command.getArgs())); } // -1 indicates fallback to the default timeout return -1; } }).build();
Note that commands that timed out may block the connection until either the timeout exceeds or Redis sends a response.
Redis阻塞问题总结
Redis数据结构或API使用不合理,导致可能存在大对象且大对象使用复杂度高的命令;
对一个有千万个元素的hash执行hgetall操作, 或del操作.类似的这种操作都会造成Redis阻塞
对于这种大对象可以采用redis-cli -h {host} -p {port} bigkeys 来查看。但是该命令只能查询某类型中的其 中最大的一个key。如果你想查询多个。可以采用修改redis-cli源代码的方式(Redis的源代码是C)。如果不想修 改源代码的话也可以使用scan来完成。
对于Scan命令需要注意。该命令只能扫描单台Redis上的数据。如果你是一个集群,需要每台机器执行一遍。但是如果你使用开源的客户端的话(比如:Java的Lettuce客户端)就已经帮你把scan命令实现为可以扫描整个集群了。
然后对大对象进行拆分。具体拆分要视业务而定了。
Redis的cpu使用率接近100%
从机同步主机数据。从机接受到rdb文件后从磁盘加载数据
主从持久化数据。
将cpu使用率达到100%,有可能是真实业务访问量确实很大。单台Redis达到每秒处理6万+的请求。这个时候就只能做水平扩展了
如果Redis每秒操作数只有几百,或者几千,且cpu还是很高的话就有可能使用了高算法复杂度的命令。例如hgetall。还有一种可能是内存的过度优化导致。这种情况目前暂时没有遇到,但也纳入考虑范围。
CPU竞争
Redis是一个CPU密集型的应用,不适合和其他CPU密集的服务部署在一起。
在生产环境中,我们一台服务器的配置是32核逻辑cpu, 256GB内存。每台机器如果只部署一台Redis比较浪费。所以可能会一台机器部署多个Redis。通常会将Redis进程绑定到CPU上。但是在生成RDB文件或者AOF持久话时,就会产生子进程。这样子进程与父进程会产生CPU竞争。所以当开启持久化或者主节点。不建议绑定CPU
内存交换
Redis是一个内存型数据库,所有数据全部放在内存中。所以强烈建议不开启内存交换
网络问题
主从同步网络延迟较大的话,导致从机经常断线重连。如果断线时间久了。导致从机再次连接上主机时会全量同步,这时主机,从机都会收到影响