Solution summary for Docker dropping connections after 15 min

resolution

#1

Hey guys,

I would like to summarize the problem we had with dropped TCP connection and the solution we implemented for other developers with the same problem.

Problem

Most system, drop idle TCP connections after 2h, but Docker (maybe Swarm) actually drops them after only 15 min.
Services or Clients that open connection to Core without frequent activity get their connections dropped after 15 min.
The first issue we had was to know that the connection actually dropped. The server was getting an error around 2h after the drop but clients were never getting any errors except when clients wanted to use the connection. The error returned on the server was also not really explicit: transport is closing.

Solution

We then find out about keep alive gRPC options. The full config is available here.

type ServerParameters struct {
	// After a duration of this time if the server doesn't see any activity it
	// pings the client to see if the transport is still alive.
	Time time.Duration // The current default value is 2 hours.

As you see, the default value for ServerParameters.Time is 2h, way higher than the Docker’s 15 min.

Changing ServerParameters.Time to 1 min solves the problem by making the server pinging clients every minute and thus preventing Docker to drop the connection.

In short: ServerParameters.Time < Docker idle connection config (15 min)

Going further

We also wanted Clients to continuously ping the server to get an error when the connection dropped instead of waiting for an actual use of the connection to get the error.

The keep alive options that gRPC provides also have a configuration for Client:

type ClientParameters struct {
	// After a duration of this time if the client doesn't see any activity it
	// pings the server to see if the transport is still alive.
	Time time.Duration // The current default value is infinity.

By default ClientParameters.Time is set to never ping the server, we started by setting it to also 1 min (same as server).

Clients were actively pinging the Server every minute but also the Server was pinging the Clients every minute (one ping in each direction every minute). This is not a normal behavior. Only one should ping and the other respond, not both all the time.
Setting the config to 2 min resolves this problem. The server was pinging every minute, the client only reply to the pings. If the connection drop, then Clients ping the Server after 2 min and get an error.

In short: ClientParameters.Time > ServerParameters.Time.


More problematic, after around 8 min, Clients were getting again the transport is closing error! After running the Clients with GODEBUG=http2debug=2, we saw that the actual HTTP2 error was GOAWAY - ENHANCE_YOUR_CALM - too_many_pings.

This error was caused by EnforcementPolicy.MinTime gRPC configuration set by default to 5min:

# EnforcementPolicy is used to set keepalive enforcement policy on the
# server-side. Server will close connection with a client that violates this
# policy.
type EnforcementPolicy struct {
	// MinTime is the minimum amount of time a client should wait before sending
	// a keepalive ping.
	MinTime time.Duration // The current default value is 5 minutes.

Clients should not ping the Server more often that this option. Setting ClientParameters.Time to same value as EnforcementPolicy.MinTime (5 min) solves the issue.

In short: ClientParameters.Time >= EnforcementPolicy.MinTime

Summary

ServerParameters.Time < Docker idle connection config (15 min)
ClientParameters.Time > ServerParameters.Time
ClientParameters.Time >= EnforcementPolicy.MinTime

Here is the properties and values used:

Property Default Value used for solution
ServerParameters.Time 2h 1 min
EnforcementPolicy.MinTime 5 min 5 min
ClientParameters.Time Infinity 5 min

Reduce KeepAlive values