Google's gRPC is an RPC system that supports many languages, and is relatively widely used. I think its popularity is due to being used for parts of Docker and Kubernetes. I think gRPC is mostly fine, but it is surprisingly easy to screw up by misconfiguring it. Part of that is because RPC systems are challenging to get right. They need to be useful for a wide variety of services, ranging from high request rate services that handle thousands of tiny requests each second, to services that need to transfer a lot of data, or servics with thousands of concurrent, slow requests that take minutes to complete. As a result, an RPC system like gRPC needs to be very configurable. Unfortunately, it is also pretty easy to configure it in a way that causes hard to understand errors.
The rest of this blog post is going to describe two examples of annoying edge cases I ran into recently. I wasted about a day to debug and understand each of these. Mostly I'm hoping that if someone else runs into these errors, they will find this article and I can save them time. I'm also hopeful that the gRPC team will eventually make this library easier to use, by improving the error messages and documenting best practices.
gRPC is designed to reuse TCP connections for multiple requests. However, many networks terminate connections that are idle for too long. For example, the AWS NLB TCP load balancer has a 350 second timeout. TCP has an optional keepalive mechanism, which sends an empty packet perodically to keep connections active. It is enabled by default on Linux, but with a 2 hour timeout before it sends the first keepalive packet. This is useful for cleaning up long dead connections, but not to keep these connections alive through NATs or load balancers. Go configures TCP keepalives to 15 seconds by default, which should be often enough to keep the network connection alive.
Unfortunately, TCP keepalives are invisible to applications, and may not be supported by some operating systems. As a result, gRPC has its own keepalives. However, the gRPC client-side keepalive specification itself says: "client authors must coordinate with service owners for whether a particular client-side setting is acceptable". If a client sends keepalive pings too often, servers close the connection. The intention is to prevent large numbers of idle clients from consuming too many resources. However, what this means is if you accidentally configure the client or server to different settings, you will occasionally have dropped RPCs.
In my opinion, this means this setting can only safely be set to its default value: 5 minutes. Unfortunately, this is too long for some networks. For example, Azure's TCP load balancer drops idle connections after 4 minutes by default, although it can be configured for longer. If you do want to deploy a shorter gRPC keepalive time in a running system, you have to be extremely careful. You must first deploy all servers to permit the more frequent pinging, then you will need to deploy the clients. If you want to undo it, you will need to do the opposite: first deploy the clients to ping less, then deploy the servers. If you screw this order up, or accidentally forget to configure it, you get intermittent "transport is closing" errors. I wrote a long bug report asking for the gRPC documentation to be improved, but unfortunately that did not happen since the gRPC maintainers disagreed with me.
For others who might encounter this error, when client keepalive is too aggressive, client RPCs will fail with gRPC code UNAVAILABLE (14) and message "transport is closing". The solution is to remove the client keepalive configuration. If you enable verbose gRPC logs, you will see:
INFO: 2021/03/14 11:02:26 [transport] Client received GoAway with http2.ErrCodeEnhanceYourCalm.
The server will log an error that the following:
ERROR: 2021/03/14 11:02:26 [transport] transport: Got too many pings from the client, closing the connection.
Setting the gRPC client keepalive also has another important side effect: gRPC will turn on the TCP_USER_TIMEOUT
socket setting, which will cause it to detect failed connections after 20 seconds. Without this, gRPC can use a dead connection for many minutes. I recommend every gRPC client set the keepalive timeout to 5 minutes, to ensure this gets turned on, but without pinging servers too much.
If you return an error from a gRPC request, it returns a status code, a status message (unicode string), and an optional error details (undocumented but supported by the library). So what happens when a server accidentally returns a really large error message? Unfortunately, the connection may get closed. In general, you can only return a maximum of about 7 kiB in your gRPC error message (3 kiB if you use the optional error details). That should be plenty. However, if you have an error message that prints a variable-length data structure, the right request can cause this limit to be exceeded. That is how I ran into this problem. The solution is to return shorter error messages. I also added a gRPC server interceptor to truncate errors, to make sure I don't accidentally do this again.
By default, the Go gRPC implementation defaults to limits error messages to 16 MiB. If you exceed this limit, on the client, you will see one of two errors. On the server you will see nothing, because as far as it is aware, it returned the error correctly.
gRPC code=13 (Internal): peer header list size exceeded limit gRPC code=14 (Unavailable): transport is closing
The C client limits error messages to 8 kiB. If you exceed this limit, on the client you will see one of two errors, depending on if the server is a Go or C server.
code=StatusCode.RESOURCE_EXHAUSTED: received trailing metadata size exceeds limit code=StatusCode.INTERNAL: Received RST_STREAM with error code 2
On the server, you will see an error like:
ERROR: 2021/03/05 10:13:18 [transport] header list size to send violates the maximum size (8192 bytes) set by client