Tuesday, August 12, 2014

HBase client response times

By Lars Hofhansl

When talking about latency expectations from HBase you'll hear various anecdotes and various war- and horror-stories where latency varies from a few milliseconds to many minutes.

In this blog post I will classify the latency conditions and their causes.

There are five principle causes to latency in HBase:

  1. Normal network round trip time (RTT) plus internal work that HBase has to do to retrieve or write the data. This type of latency is in the order of milliseconds - RTT + disk seeks, etc.
  2. Client retries due to moved regions or splits. Regions can be moved due to HBase deciding that the cluster is not balanced. As regions grow in size their are split.
    The HBase client rides over this by retrying a few times. You should make sure you have the client retry count and interval configured according to your needs. This can take in the order of one second. Note that this is independent on the size of the region as no data is actually moved - it remains in its location in HDFS - just the ownership is transferred to another region server.
  3. GC (see also hbase-gc-tuning-observations). This is an issue with nearly every Java application. HBase is no exception. A well tuned HBase install can keep this below 2s at all times, with average GC times around 50ms (this is for large heaps of 32GB)
  4. Server outages. This is were large tail end latency is caused. When a region server dies it takes - by default - 90s to detect, then that server's regions are reassigned to other servers, which then have replay logs and bring the regions online. Depending on the amount uncommitted data the region server had, this can take minutes. So in total this is in the order of a few minutes - if a client requests any data for any of the involved regions. Interactions with other region servers are not impacted.
  5. Writers overwhelming the cluster. Imagine a set of clients trying to write data into an HBase cluster faster than the cluster's hardware (network for 3-way replication, or disk) can absorb. HBase can buffer some spikes in memory (the memstore) but after some time of sustained write load has to force the clients to stop. In 0.94 a region server will just block any writers for a configurable maximum amount of time (90s by default). In 0.96 and later the server throws a RegionTooBusyException back to the client. The client then will have to retry until HBase has enough resource to accept the request. See also about-hbase-flushes-and-compactions. Depending on how fast HBase can compact excess HFiles this condition can last minutes.
All of the cases refer to any delays caused by HBase itself. Scanning large regions or writing a lot of data in bulk-write requests naturally have to adhere to physics. The data needs to loaded from disk - potentially from another machines - or it needs to be written across the network to three replicas. The time needed here depends on the particular network/disk setup.

The gist is that when things are smooth (and you have disabled Nagle's, i.e. enable tcpnodelay) you'll see latency of a few ms if things are in the blockcache, or < 20ms or so when disk seeks are needed.
The 99th percentile will include GCs, region splits, and region moves, and you should see something around 1-2s.

In the event of failures such as failed region servers or overwhelming the cluster with too many write requests, latency can go as high as a few minutes for requests to the involved regions.

2 comments:

  1. Hi, found your blog post from long ago, very useful info. I'm not sure if you're still active but wanted to ask if you have any experience with similar response time spikes during major compactions (whether run explicitly or just regular minor compaction that got upgraded to major ones).

    Thanks.

    ReplyDelete