Java 分析 Spring/JPA/Mysql/Tomcat 应用程序中的连接关闭异常

Question

提问by ncmadhan

PROBLEM

问题

I have recently been put in charge of a Java web application with code already written and in place. The app receives moderately high traffic and has peak hours of traffic between 11am to 3pm everyday. The application uses Spring, JPA(Hibernate), MYSQL DB. Spring has been configured to use tomcat jdbc connection pool to make connections to the DB. (Details of configuration at the end of the post)

我最近负责一个 Java Web 应用程序，其中的代码已经编写好并就位。该应用程序接收中等高流量，并且每天上午 11 点至下午 3 点之间的流量高峰时段。该应用程序使用 Spring、JPA(Hibernate)、MYSQL DB。Spring 已被配置为使用 tomcat jdbc 连接池来建立与 DB 的连接。（具体配置见文末）

For the past few days, during the application's peak load hours, the application has been going down due to tomcat going unresponsive to requests. It has required tomcat to be restarted multiple times.

过去几天，在应用程序的高峰负载时段，由于 tomcat 对请求没有响应，应用程序一直在宕机。它需要多次重启 tomcat。

Going through the tomcat catalina.out logs, I noticed a whole lot of

通过 tomcat catalina.out 日志，我注意到了很多

Caused by: java.sql.SQLException: Connection has already been closed.
    at org.apache.tomcat.jdbc.pool.ProxyConnection.invoke(ProxyConnection.java:117)
    at org.apache.tomcat.jdbc.pool.JdbcInterceptor.invoke(JdbcInterceptor.java:109)
    at org.apache.tomcat.jdbc.pool.DisposableConnectionFacade.invoke(DisposableConnectionFacade.java:80)
    at com.sun.proxy.$Proxy28.prepareStatement(Unknown Source)
    at org.hibernate.jdbc.AbstractBatcher.getPreparedStatement(AbstractBatcher.java:505)
    at org.hibernate.jdbc.AbstractBatcher.getPreparedStatement(AbstractBatcher.java:423)
    at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:139)
    at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1547)
    at org.hibernate.loader.Loader.doQuery(Loader.java:673)
    at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:236)
    at org.hibernate.loader.Loader.loadCollection(Loader.java:1994)
    ... 115 more

These appear frequently just before the crash.

这些经常出现在崩溃之前。

Going further earlier before these exceptions, I noticed a whole lot of Connections being abandoned just before the Connection Closed exceptions.

在这些异常之前更早一点，我注意到在 Connection Closed 异常之前有很多 Connections 被放弃了。

WARNING: Connection has been abandoned PooledConnection[com.mysql.jdbc.Connection@543c2ab5]:java.lang.Exception
    at org.apache.tomcat.jdbc.pool.ConnectionPool.getThreadDump(ConnectionPool.java:1065)
    at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:782)
    at org.apache.tomcat.jdbc.pool.ConnectionPool.borrowConnection(ConnectionPool.java:618)
    at org.apache.tomcat.jdbc.pool.ConnectionPool.getConnection(ConnectionPool.java:188)
    at org.apache.tomcat.jdbc.pool.DataSourceProxy.getConnection(DataSourceProxy.java:128)
    at org.hibernate.ejb.connection.InjectedDataSourceConnectionProvider.getConnection(InjectedDataSourceConnectionProvider.java:47)
    at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:423)
    at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:144)
    at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:139)

These seem to appear frequently just before the Connection Closed exceptions. And these seem to be the first symptoms of impending doom in the logs.

这些似乎经常出现在 Connection Closed 异常之前。这些似乎是日志中即将到来的厄运的第一个症状。

ANALYSIS

分析

Going by the logs, I set out to see if there is any connection pool configuration/mysql configuration that might be causing the issue. Went through a couple of excellent articles that show tuning of the pool for Production environment. Links 1& 2

根据日志，我开始查看是否有任何可能导致问题的连接池配置/mysql 配置。浏览了几篇优秀的文章，展示了生产环境池的调整。链接1& 2

Going by these articles, I noticed that:

通过这些文章，我注意到：

The below line in JHanik's article (link 1) mentions this
Setting the value of abandonWhenPercentageFull to 100 would mean that connections are not > considered abandoned unless we've reached our maxActive limit.
I figured this might be important in my case, because I see a lot of connections being abandoned.
My max_connections setting does not match what is being recommended (in link 2)
mysql max_connections should be equal to max_active+max_idle

JHanik 的文章（链接 1）中的以下行提到了这一点
将 allowedWhenPercentageFull 的值设置为 100 意味着除非我们达到了 maxActive 限制，否则不会将连接视为已放弃。
我认为这对我来说可能很重要，因为我看到很多连接被放弃了。
我的 max_connections 设置与推荐的不匹配（在链接 2 中）
mysql max_connections 应该等于 max_active+max_idle

WHAT I TRIED

我的尝试

So, as per the recommendations from the articles, I did the following two things:

因此，根据文章中的建议，我做了以下两件事：

Changed abandonWhenPercentageFull to 100
In my MYSQL server, max_connections was set as 500. Increased it to 600 In my connection pool settings, max_active was 200 and max_idle was 50. Changed it to max_active=350, max_idle = 250

将放弃时百分比全更改为 100
在我的 MYSQL 服务器中，max_connections 设置为 500。将其增加到 600 在我的连接池设置中，max_active 为 200，max_idle 为 50。将其更改为 max_active=350，max_idle = 250

THIS DID NOT HELP

这没有帮助

The next day, the following observations were made during peak hours:

第二天，在高峰时段进行了以下观察：

Tomcat did not go down. The app stayed up throughout peak hours. However the performance went from bad to worse and then the app was barely usable even though it did not really go down.
DB Connection pool, though increased in size, got completely utilized, and I could see 350 active connections to the DB at one point.

Tomcat 没有宕机。该应用程序在整个高峰时段都保持运行状态。然而，性能变得越来越差，然后该应用程序几乎无法使用，即使它并没有真正下降。
DB 连接池虽然增加了大小，但得到了充分利用，我可以看到 350 个与 DB 的活动连接。

FINALLY, MY QUESTION:

最后，我的问题：

It clearly looks like there are issues with the way DB connections are being made from the app server. So I have two directions to take this analysis forward.

从应用程序服务器建立数据库连接的方式显然存在问题。所以我有两个方向来推进这个分析。

My question is which of these should I be taking?

我的问题是我应该服用哪些？

1. The issue is not with the connection pool settings. The code is what is causing the issue

1. 问题不在于连接池设置。代码是导致问题的原因

There might be places in the code where DB connections are not being closed. Which is causing the high number of connections being open.

代码中可能存在未关闭数据库连接的地方。这导致大量连接被打开。

The code uses a GenericDao which is extended in every Dao class. The GenericDao uses Spring's JpaTemplate to fetch a EntityManager instance which in turn is used for all DB operations. My understanding is using the JpaTemplate handles the nitty gritty of closing DB connections internally.

代码使用在每个 Dao 类中扩展的 GenericDao。GenericDao 使用 Spring 的 JpaTemplate 来获取一个 EntityManager 实例，该实例又用于所有数据库操作。我的理解是使用 JpaTemplate 处理内部关闭数据库连接的细节。

So, where exactly should I looking for possible connection leaks?

那么，我应该在哪里寻找可能的连接泄漏？

2. The issue is with the connections pool/mysql config parameters. However, the optimizations I have put in need to be tuned further

2. 问题在于连接池/mysql 配置参数。但是，我所做的优化需要进一步调整

If yes, which parameters should I be looking at? Should I be collecting some data to use to determine more appropriate values for my connection pool. (For eg, for max_active, max_idle, max_connections)

如果是，我应该查看哪些参数？我是否应该收集一些数据来为我的连接池确定更合适的值。（例如，对于 max_active、max_idle、max_connections）

Addendum: Complete Connection Pool configuration

附录：完整的连接池配置

   <bean id="dataSource" class="org.apache.tomcat.jdbc.pool.DataSource" destroy-method="close">
        <property name="driverClassName" value="com.mysql.jdbc.Driver" />
        <property name="url" value="jdbc:mysql://xx.xx.xx.xx" />
        <property name="username" value="xxxx" />
        <property name="password" value="xxxx" />
        <property name="initialSize" value="10" />
        <property name="maxActive" value="350" />
        <property name="maxIdle" value="250" />
        <property name="minIdle" value="90" />
        <property name="timeBetweenEvictionRunsMillis" value="30000" />
        <property name="removeAbandoned" value="true" />
        <property name="removeAbandonedTimeout" value="60" />
        <property name="abandonWhenPercentageFull" value="100" />
        <property name="testOnBorrow" value="true" />
        <property name="validationQuery" value="SELECT 1" />
        <property name="validationInterval" value="30000" />
        <property name="logAbandoned" value="true" />
        <property name="jmxEnabled" value="true" />
    </bean>

Answer 1

回答by M. Deinum

The code uses a GenericDao which is extended in every Dao class. The GenericDao uses Spring's JpaTemplate to fetch a EntityManager instance which in turn is used for all DB operations. My understanding is using the JpaTemplate handles the nitty gritty of closing DB connections internally.

代码使用在每个 Dao 类中扩展的 GenericDao。GenericDao 使用 Spring 的 JpaTemplate 来获取一个 EntityManager 实例，该实例又用于所有数据库操作。我的理解是使用 JpaTemplate 处理内部关闭数据库连接的细节。

This is probably the root of your problem, you shouldn't use the JpaTemplateto get the EntityManagerthis will give you un unmanaged Entitymanager. In fact you shouldn't be using JpaTemplateat all.

这可能是您问题的根源，您不应该使用JpaTemplate来获取EntityManager这会给您 unmanaged Entitymanager。事实上，你根本不应该使用JpaTemplate。

It is recommended to write daos based on the plain EntityManagerAPI and simply inject the EntityManageras you normally would do (with @PersistenceContext).

建议基于普通EntityManagerAPI编写 daos，然后EntityManager像往常一样简单地注入daos （使用@PersistenceContext）。

If you really want to use the JpaTemplateuse the executemethod and pass in a JpaCallbackwhich will give you a managed EntityManager.

如果你真的想使用JpaTemplateuseexecute方法并传入 aJpaCallback这会给你一个托管的EntityManager.

Also make sure that you have setup transactions correctly without proper tx setup connections will not be closed as spring doesn't know that it should close the connection.

还要确保在没有正确的 tx 设置连接的情况下正确设置了事务，因为 spring 不知道它应该关闭连接。

Answer 2

回答by Ryan P.

This is woefully late for the OP, but maybe it will help someone else in the future:

这对于 OP 来说太晚了，但也许将来会对其他人有所帮助：

I ran into something similar to this in a production environment with long-running batch jobs. The problem is if your code needs a connection longer than the time specified by property:

我在具有长时间运行的批处理作业的生产环境中遇到了类似的问题。问题是如果您的代码需要比属性指定的时间更长的连接：

name="removeAbandonedTimeout" value="60

and you have enabled:

并且您已启用：

<property name="removeAbandoned" value="true" />

then it will get disconnected during the processing after 60 seconds. One possible workaround (that didn't work for me) is to enable the interceptor:

然后它将在 60 秒后的处理过程中断开连接。一种可能的解决方法（对我不起作用）是启用拦截器：

jdbcInterceptors="ResetAbandonedTimer"

This will reset the abandoned timer for that connection for every read/write that occurs. Unfortunately in my case, the processing would sometimes still take longer than the timeout before anything was read/written to the database. So I was forced to either bump the timeout length, or disable the removeAbandonded (I chose the former solution).

这将为发生的每次读/写重置该连接的废弃计时器。不幸的是，在我的情况下，在将任何内容读/写到数据库之前，处理有时仍会比超时时间更长。所以我被迫要么增加超时长度，要么禁用 removeAbandonded（我选择了前一种解决方案）。

Hope this helps someone else if they run into something similar!

希望这可以帮助其他人，如果他们遇到类似的事情！

Answer 3

回答by V H

I was recently asked to investigate why production system sometimes goes down. I wanted to share my findings since it involves a correlation of events to take a JVM tomcat app with JDBC issues as outlined above to actually crash the app. This is using mysql as a backend so probably most useful for this scenario but if issue hit on another platform cause likely to be the same.

我最近被要求调查为什么生产系统有时会出现故障。我想分享我的发现，因为它涉及到事件的相关性，以便将 JVM tomcat 应用程序与上面概述的 JDBC 问题实际应用程序崩溃。这是使用 mysql 作为后端，因此可能对这种情况最有用，但如果问题在另一个平台上发生，原因可能相同。

By simply getting connection closed does not imply the application is broken

简单地关闭连接并不意味着应用程序已损坏

This is under a grails application but will be relative to all JVM related apps:

这是在 grails 应用程序下，但将与所有与 JVM 相关的应用程序相关：

tomcat/context.xmldb configuration, notice very small db pool and removeAbandonedTimeout="10"ye right we want things to break

tomcat/context.xml数据库配置，注意非常小的数据库池， removeAbandonedTimeout="10"是的，我们想要破坏

<Resource
 name="jdbc/TestDB"  auth="Container" type="javax.sql.DataSource"
              driverClassName="com.mysql.jdbc.Driver"
              url="jdbc:mysql://127.0.0.1:3306/test"
              username="XXXX"
              password="XXXX"
              testOnBorrow="true"
              testWhileIdle="true"
              testOnReturn="true"
              factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
              removeAbandoned="true"
              logAbandoned="true"
              removeAbandonedTimeout="10"
              maxWait="5000"
              initialSize="1"
              maxActive="2"
              maxIdle="2"
              minIdle="2"
              validationQuery="Select 1" />

A quartz job that runs every minute, not that it matters the app I think dies on first attempt:

每分钟运行一次的石英作业，这并不重要，我认为应用程序在第一次尝试时就死了：

class Test2Job {
    static  triggers = {
               cron name: 'test2', cronExpression: "0 0/1 * * * ?"
        }
        def testerService
        def execute() {
        println "starting job2 ${new Date()}"
        testerService.basicTest3()

    }

}

Now our testService with comments so please follow comments:

现在我们的 testService 带有评论，所以请关注评论：

def dataSource

  /**
   * When using this method in quartz all the jdbc settings appear to get ignored
   * the job actually completes notice huge sleep times compared to basicTest
   * strange and very different behaviour.
   * If I add Tester t = Tester.get(1L) and then execute below query I will get
   * connection pool closed error
   * @return
   */
  def basicTest2() {
      int i=1
      while (i<21) {
          def sql = new Sql(dataSource)
          def query="""select id as id  from tester t
                  where id=:id"""
          def instanceList = sql.rows(query,[id:i as Long],[timeout:90])
          sleep(11000)
          println "-- working on ${i}"
          def sql1 = new Sql(dataSource)
          sql1.executeUpdate(
                  "update tester t set t.name=? where t.id=?",
                  ['aa '+i.toString()+' aa', i as Long])

          i++
          sleep(11000)
      }
      println "run ${i} completed"
  }


  /**
   * This is described in above oddity
   * so if this method is called instead you will see connection closed issues
   */
  def basicTest3() {
      int i=1
      while (i<21) {
          def t = Tester.get(i)
          println "--->>>> test3 t ${t.id}"

          /**
           * APP CRASHER - This is vital and most important
           * Without this declared lots of closed connections and app is working
           * absolutely fine,
           * The test was originally based on execRun() which returns 6650 records or something
           * This test query is returned in time and does not appear to crash app
           *
           * The moment this method is called and please check what it is currently doing. It is simply
           * running a huge query which go beyond the time out values and as explained in previous emails MYSQL states
           *
           * The app is then non responsive and logs clearly show application is broke 
           */
          execRun2()


          def sql1 = new Sql(dataSource)
          sleep(10000)
          sql1.executeUpdate("update tester t set t.name=? where t.id=?",['aa '+i.toString()+' aa', t.id])
          sleep(10000)
          i++
      }

  }


  def execRun2() {
      def query="""select new map (t as tester) from Tester t left join t.children c
left join t.children c
                  left join c.childrena childrena
                  left join childrena.childrenb childrenb
                  left join childrenb.childrenc childrenc , Tester t2 left join t2.children c2 left join t2.children c2
                  left join c2.childrena children2a
                  left join children2a.childrenb children2b
                  left join children2b.childrenc children2c
             where ((c.name like (:name) or
                  childrena.name like (:name) or
                  childrenb.name like (:name) or (childrenc is null or childrenc.name like (:name))) or
                  (
                  c2.name like (:name) or
                  children2a.name like (:name) or
                  children2b.name like (:name) or (children2c is null or children2c.name like (:name))
      ))

          """
      //println "query $query"
      def results = Tester.executeQuery(query,[name:'aa'+'%'],[timeout:90])
      println "Records: ${results.size()}"

      return results
  }


  /**
   * This is no different to basicTest2 and yet
   * this throws a connection closed error and notice it is 20 not 20000
   * quite instantly a connection closed error is thrown when a .get is used vs
   * sql = new Sql(..) is a manuall connection
   *
   */
  def basicTest() {
      int i=1
      while (i<21) {
          def t = Tester.get(i)
          println "--- t ${t.id}"
          sleep(20)
          //println "publishing event ${event}"
          //new Thread({
          //    def event=new PurchaseOrderPaymentEvent(t,t.id)
          //    publishEvent(event)
          //} as Runnable ).start()

          i++
      }
  }

It is only when the query then takes longer than expected time but there has to be another element, the query itself then has to hit sit on MYSQL even though it is killed. MYSQL is eating it away processing it.

只有当查询花费的时间比预期的时间长但必须有另一个元素时，即使查询被终止，查询本身也必须在 MYSQL 上运行。MYSQL 正在吞噬它处理它。

I think what is going on is

我认为正在发生的事情是

job 1 - hits app -> hits mysql ->    (9/10 left)
         {timeout} -> app killed  -> mysql running (9/10)
 job 2 - hits app -> hits mysql ->    (8/10 left)
         {timeout} -> app killed  -> mysql running (8/10) 
.....
 job 10 - hits app -> hits mysql ->    (10/10 left)
         {timeout} -> app killed  -> mysql running (10/10)
 job 11 - hits app ->

If by this time job1 has not completed then we have nothing left in the pool well app is simply broke now.. jdbc errors thrown etc.. never mind if it completes after the crash..

如果此时 job1 还没有完成，那么我们在池中什么都没有了

You can monitor what is going on by checking mysqlIt appeared to run for a longer period which goes against what they have suggested this value should be doing, but then again maybe this isn't really based on any of this and relates to a problem elsewhere.

您可以通过检查 mysql来监视正在发生的事情它似乎运行了更长的时间，这与他们建议的该值应该做的事情背道而驰，但是也许这并不是真正基于其中的任何一个并且与问题有关别处。

Whilst testing noticed there were two states: Sending data / Sending to client:

虽然测试注意到有两种状态：发送数据/发送给客户端：

|  92 | root | localhost:58462 | test | Query   |   80 | Sending data      | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  95 | root | localhost:58468 | test | Query   |  207 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  96 | root | localhost:58470 | test | Query   |  147 | Sending data      | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  97 | root | localhost:58472 | test | Query   |  267 | Sending data      | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  98 | root | localhost:58474 | test | Sleep   |   18 |                   | NULL                                                                                                 |
|  99 | root | localhost:58476 | test | Query   |  384 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
| 100 | root | localhost:58478 | test | Query   |  327 | Sending data      | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |

Sseconds later:

秒后：

|  91 | root | localhost:58460 | test | Query   |   67 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  92 | root | localhost:58462 | test | Query   |  148 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |
|  97 | root | localhost:58472 | test | Query   |  335 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test | |
| 100 | root | localhost:58478 | test | Query   |  395 | Sending to client | select tester0_.id as col_0_0_ from tester tester0_ left outer join tester_childa children1_ on test |

Seconds after that: (all dead)
|  58 | root | localhost       | NULL | Query   |    0 | starting | show processlist |
|  93 | root | localhost:58464 | test | Sleep   |  167 |          | NULL             |
|  94 | root | localhost:58466 | test | Sleep   |  238 |          | NULL             |
|  98 | root | localhost:58474 | test | Sleep   |   74 |          | NULL             |
| 101 | root | localhost:58498 | test | Sleep   |   52 |          | NULL             |

It may be that a script needs to be created to monitor the process list and maybe a deeper result set containing exact queries running to work out which of your queries events is killing your app

可能需要创建一个脚本来监视进程列表，并且可能需要创建一个包含精确查询的更深层结果集来确定哪些查询事件正在杀死您的应用程序

Java 分析 Spring/JPA/Mysql/Tomcat 应用程序中的连接关闭异常

提问by ncmadhan

回答by M. Deinum

回答by Ryan P.

回答by V H

相关推荐

最近更新

标签

Java 分析 Spring/JPA/Mysql/Tomcat 应用程序中的连接关闭异常

提问by ncmadhan

回答by M. Deinum

回答by Ryan P.

回答by V H

相关推荐

强制 Java 时区为 GMT/UTC

在 simple.json Java 中解码浮点数

Java 有指针吗？

Ant 构建失败，“[javac] javac：目标版本无效：7”

相关推荐

最近更新

标签