Spring Boot Observability: Discovering a Database Connection Leak

Databases and connections are common in most software projects: we need to save data somewhere.

Database connections are a scarce resource, and we should be conscious of how to use them, doesn’t matter if you use a framework to handle them or not.

In this post, we are going to talk about Gatling as a loading test tool, create a database connection leak, use Grafana to assess the problem and solve it.

TRY IT YOURSELF: You can find the source code of this post here.

Spring Boot Observability Series

If you like this post and are interested in hearing more about my journey as a Software Engineer, you can follow me on Twitter and travel together.

Requirements

To set up the environment, refer to Setting up Micrometer, Grafana and Prometheus

Loading Testing with Gatling

As we need to have enough data in Prometheus to observe in our tests, we should use an automated load testing tool to help us.

Gatling is a tool for load testing, it allows to create scenarios and has a lot of options to setup them.

To use it, first, add the following plugin to the gradle.build:

plugins {
	...
	id 'io.gatling.gradle' version "3.7.2"
        ...
}

Now, we create a test on src/gatling/java:

public class CustomersSimulation extends Simulation {
  HttpProtocolBuilder httpProtocol = http
          .baseUrl(
                  "http://localhost:8080");

  ScenarioBuilder scn = scenario("Constant requests")
          .exec(http("request_1")
                  .get("/customers"));

  {
    setUp(scn.injectOpen(constantUsersPerSec(1)
                    .during(60))
            .protocols(httpProtocol));
  }
}

There, we created a test to call the app once each second, for 60 seconds.

To run these tests, you can execute gradlew :gatlingRun

You will be able to access a report that looks like this (you can find the link on the logs after running the job):

Gatling report

With this load testing tool, we can talk now about database connections and how they work.

About Database Connections Pool

Any project usually needs a database to persist data. To talk with a database, we should create a network connection to send commands through it. Creating this connection is expensive and takes time, that’s why pooling database connections is a good idea, and reusing them as much as we can.

Database connection pool

However, pooling brings new challenges like how many should we pool? having few might slow down the application transactions, as one transaction needs to wait for a connection to be released; having many, could be a waste of resources.

To define those features, we should analyze how our application behaves regarding the database interaction. Let’s start with installing a Grafana dashboard to help us make decisions.

Understanding a Database Connection Dashboard

We can find several open-source dashboards on the Grafana page, one of them is a Spring Boot HikariCP / JDBC.

To import the dashboard, open Grafana and you should do:

  1. Click on the “four squares” in the sidebar to open the Configuration menu.
  2. Click on “Manager”.
  3. Click on “Import”.
  4. Set the Grafana dashboard id to 6083
  5. Click on “Load”

After you navigate to the new dashboard, you will see something like this:

Database Grafana dashbaord

Starting the application should be enough to get some data on the dashboard. Let’s talk about each panel.

Max, Min and Timeout

Number of connections panel

In the above view, we can find three metrics:

  • Max: Maximum available connections on the cache
  • Min: Minimum available connections on the cache.
  • Total Timeout: How many threads timeout because there were not a available connection for them.

If Max and Min are equals, it means, there will always be a fixed amount available of connections.

NOTE: In the case of a Hiraki pool, it is better to have Max and Min equal (default). Also, by default, Hiraki pool defines 10 connections.

CONNECTIONS

Connections panel

The connections view shows how the connections behave through time, moving between the following three states:

  • Active: A connection is active when it is used by a thread of the application, which means, the connection is outside the pool and cannot be used by any other thread
  • Idle: A connection is idle when the connection is inside the pool waiting for a thread to be used.
  • Pending threads: A pending thread is a thread that requires a connection but couldn’t find any inside the pool
Active, Idel and Pending threads

CONNECTIONS TIME

Connections time panel

The connections time view shows us how the connections behave through time in terms of operations that take time to complete, as follows:

  • Usage: When a connection is active, that means, it is being used by a thread. This metric shows how much time the connection was used by the thread
  • Creation: A connection takes time to be created. This metrics show how much it took to create them.
  • Acquire: This metrics shows the time a thread needed to acquire a idle connection and move it to active connection.
Usage, creation and acquire times

Assessing a Connection Leak

What if a database connection keeps a long time attached to a thread in an active state?

When a new thread requires a database connection, the pool gives it to him, and that new connection could again keep a long time attached to the new thread. If this process repeats by each thread, will be a moment when the connection pool won’t have any idle connections to give, and the next thread requiring a connection will need to wait (a pending thread) until a connection is released from a different thread, as you see in the following image:

Pending threads increasing

As more requests arrive at the application, and each request starts a new thread that requires a new database connection, the pending threads will start to accumulate, the requests will start to wait and the performance of the application will degrade.

Let’s see how we can use Spring Boot metrics plus Grafana to find this problem.

A New Endpoint on the Test App

Let’s add a new endpoint to the Spring Boot test app:

  @GetMapping("/customers")
  public List getCustomers(){
    return customerRepository.findAll();
  }

  @GetMapping("/customers/transform")
  public List getCustomersTransform()
          throws InterruptedException {
    List customers = customerRepository.findAll();

    Thread.sleep(5000); //slow operation

    return customers;
  }

We added a new endpoint /customer/transform, this executes the same operation that we have on /customer, but we added a slow operation after we access the database at line 10. This slow operation simulates the following:

  • A slow external network call
  • A slow file system access
  • A slow algorithm

Modifying the Load Test

Let’s add the new endpoint to the Gatling load tests we had before:

public class CustomersSimulation extends Simulation {
  HttpProtocolBuilder httpProtocol = http
          .baseUrl(
                  "http://localhost:8080");

  ScenarioBuilder scn = scenario("Constant requests")
          .exec(http("request_1")
                  .get("/customers"))
          .exec(http("request_2")
                  .get("/customers/transform"));

  {
    setUp(scn.injectClosed(rampConcurrentUsers(5)
                    .to(100)
                    .during(160))
            .protocols(httpProtocol));
  }
}

There, we added the new endpoint /customers/transform, plus, we changed the load tests configuration to the following: Ramp up concurrent users, from 5 to 100, in a 2 minutes period.

Running the Load Test and Assessing the Results

After running the load tests again, we can see the following Grafana graphs, starting with connections:

The behavior of active, idle and pending threads with load

Before the test started, the idle connections were 10 (the available ones). After the test started, idle connections transformed to active connections, and as the load increases, the app reached a point where no more idle connections are available, therefore, pending threads started to wait for free connections. Pending threads accumulate over time and start dropping (they might timeout?). The behavior is weird, the load is not that much to break the app.

Now, let’s see the connections time graph:

The behavior of usage, creation and acquire time with load

There, we see interesting things: first, the usage time max is huge, 5 seconds, also, we see how the usage time through time is at least 1 second (pretty coincidence with the Thread.sleep of 5 seconds we have).

Second, the acquire time max was 22 seconds !!, which means, the pending threads took 22 seconds to grab a connection from the pool in the worst-case scenario. That is terrible because it means a request holds for 22 seconds before responding (customers are not happy for sure).

Finally, let’s see the Gatling report:

We see there that some requests are KO (failed), also, the 99th percentile response time is 30 seconds !!, that means, most of the requests are under that time, the 50th percentile is under 12 seconds !!, that means, the 50 % of the request, their response time was below 12 seconds. These numbers are terrible, customers unhappy for sure.

NOTE: By default, Hikari has no connection leak detection. You can activate it, setting spring.datasource.hikari.leak-detection-threshold, with seconds. This will throw an exception if the connection is outside the pool more than that value. The following is an example of that exception:

java.lang.Exception: Apparent connection leak detected
	at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) ~[HikariCP-4.0.3.jar:na]

After this analysis, we can conclude that for that load, something is wrong with our application, most probably, a connection leak. The question now is, how? the database access queries are pretty simple, but for sure, the Thread.sleep (slow operation not related to database access) has something to do here.

Seems like the database connections are active for more time than we think they are, which means, a starving of connections if the load increases. Let’s discuss why.

Finding a Connection Leak

If we think the database connections are active more time than they need to, a good candidate for this problem is open-in-view configuration or transactions, let’s talk about them.

ORMs and Lazy Loading

ORMs allow lazy loading data on-demand, this means, after you retrieve an Entity from the database, that Entity can still be connected to the database, and execute commands/operations against it using the Entity methods.

For instance, if you have a Customer entity, and inside you have a list of Orders, you can do the following:

Customer customer = orm.findById(....);

List orders = customer.getOrders();

If you setup the Customer orders property to be lazy loaded, when you execute the customer.getOrders(), the ORM is going to access the database and retrieve the orders. This means, if you use the Customer entity on a Controller, you will be accessing the database from the controller without realizing it.

For the ORM to be able to do this through the whole application, it needs to HOLD a connection for the whole life of an HTTP request/response.

For more details about this pattern, check this out.

The spring.jpa.open-in-view Spring Boot Property

On Spring Boot, spring.jpa.open-in-view property allows to activate/deactivate the lazy loading through the whole application. Interesting enough, this property is active by default, which might explain the behavior we see in our application, and why the Thread.sleep is affecting the time a connection is active: if the connection is active when HTTP request begins, until a response is delivered, this means, the Thread.sleep is summing up an active time to that connection, holding it more time that we need to.

NOTE: There is an interesting discussion about if this property should be activated by default or not. You can take a look here.

If you don’t need lazy loading access, deactivating this property should be the best.

Transactions and Active Connections

If you have transactions on your application, for instance, using @Transactional, you need to be aware that a connection will be active for the whole transaction (similar effect like the open-in-view, but, with a narrow scope).

As removing transactions might not be valid for your use cases, we should try to have them as narrow as possible to avoid an active connection holding much time.

Fixing the Connection Leak

As our application doesn’t use transactions, we should take care of the open-in-view property, as follows:

spring:
  application:
    name: spring-observability
  datasource:
    hikari:
      leakDetectionThreshold: 2000
  jpa:
    open-in-view: false

Also, we added a leak detection of 2 seconds. (if a connection is active for more than 2 seconds, and an exception is thrown).

Running again the load tests, we have:

The behavior of active, idle and pending threads on load after the fix

Interesting enough, there is not graphic movements, seems like the connections moved from one state to another so fast, the graph cannot render it, which is good. Let’s see connection time:

The behavior of usage, creation and acquire time on load after the fix

There we see how fast the usage time is getting: 4 ms tops, which means, the connections got released so fast, and that explains why the previous graph doesn’t show information at all.

Regarding the Gatling report:

Gatling report after the fix

We have a pretty good improvement:

  • All requests succeed.
  • 3424 requests processed instead of 818 before
  • 99th percentil is 5 seconds instead of 30 before
  • 50th percentil is 478 milliseconds instead of 12 seconds before
  • No connection leak errors.

Only with disabling open-in-view, we improve the application performance considerably.

Final Thought

Observability is key to being able to assess problems like connection leaks, trying to find how to fix these problems blind, is going to be hard.

Sometimes performance problems are related to how we setup our application, so, we should be conscious about what framework we are using, and what it does for us on behind.

Trust the framework, but, don’t be blind on how you setup it and use it.

2 comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s