Sunday, October 11, 2015

Gentle Introduction to Hystrix - Motivation

In the last few days I have been exploring the Netflix Hystrix library and have come to appreciate the features provided by this excellent library.

To quote from the Hystrix site:

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

There are a whole lot of keywords to parse here, however the best way to experience Hystrix in my mind is to try out a sample use case.

An unpredictable Service

Consider a service, an odd one, which takes a json message of the following structure and returns an acknowledgement:

    "payload": "Sample Payload",
    "delay_by": 0

The service takes in a payload, but additionally takes in two fields - delay_by which makes the service acknowledge a response after the delay in milliseconds and a "throw_exceptions" field which will result in an exception after the specified delay!

Here is a sample response:

 "received":"Sample Payload",
 "payload":"Reply Message"

If you are following along, here is my github repo with this sample, I have used Netflix Karyon 2 for this sample and the code which handles the request can be expressed very concisely the following way - see how the rx-java library is being put to good use here:

import rx.Observable;
import service1.domain.Message;
import service1.domain.MessageAcknowledgement;

import java.util.concurrent.TimeUnit;

public class MessageHandlerServiceImpl implements MessageHandlerService {

    private String replyMessage;

    public Observable<MessageAcknowledgement> handleMessage(Message message) {"About to Acknowledge");
        return Observable.timer(message.getDelayBy(), TimeUnit.MILLISECONDS)
                .map(l -> message.isThrowException())
                .map(throwException -> {
                    if (throwException) {
                        throw new RuntimeException("Throwing an exception!");
                    return new MessageAcknowledgement(message.getId(), message.getPayload(), replyMessage);


At this point we have a good candidate service which can be made to respond with an arbitrary delay and failure.

A client to the Service

Now onto a client to this service. I am using Netflix Feign to make this call, yet another awesome library, all it requires is a java interface annotated the following way:

package aggregate.service;

import aggregate.domain.Message;
import aggregate.domain.MessageAcknowledgement;
import feign.RequestLine;

public interface RemoteCallService {
    @RequestLine("POST /message")
    MessageAcknowledgement handleMessage(Message message);

It creates the necessary proxy implementing this interface using configuration along these lines:

RemoteCallService remoteCallService = Feign.builder()
        .encoder(new JacksonEncoder())
        .decoder(new JacksonDecoder())
        .target(RemoteCallService.class, "");

I have multiple endpoints which delegate calls to this remote client, all of them expose a url pattern along these lines - http://localhost:8888/noHystrix?message=Hello&delay_by=0&throw_exception=false, this first one is an example where the endpoint does not use Hystrix.

No Hystrix Case

As a first example, consider calls to the Remote service without Hystrix, if I were to try a call to http://localhost:8888/noHystrix?message=Hello&delay_by=5000&throw_exception=false or say to http://localhost:8888/noHystrix?message=Hello&delay_by=5000&throw_exception=true, in both instances the user request to the endpoints will simply hang for 5 seconds before responding.

There should be a few things immediately apparent here:

1. If the service responds slowly, then the client requests to the service will be forced to wait for the response to come back.

2. Under heavy load it is very likely that all threads handling user traffic will be exhausted, thus failing further user requests.

2. If the service were to throw an exception, the client does not handle it gracefully.

Clearly there is a need for something like Hystrix which handles all these issues.

Hystrix command wrapping Remote calls

I conducted a small load test using a 50 user load on the previous case and got a result along these lines:

---- Global Information --------------------------------------------------------
> request count                                         50 (OK=50     KO=0     )
> min response time                                   5007 (OK=5007   KO=-     )
> max response time                                  34088 (OK=34088  KO=-     )
> mean response time                                 17797 (OK=17797  KO=-     )
> std deviation                                       8760 (OK=8760   KO=-     )
> response time 50th percentile                      19532 (OK=19532  KO=-     )
> response time 75th percentile                      24386 (OK=24386  KO=-     )
> mean requests/sec                                  1.425 (OK=1.425  KO=-     )

Essentially a 5 second delay from the service results in a 75th percentile time of 25 seconds!, now consider the same test with Hystrix command wrapping the service calls:

---- Global Information --------------------------------------------------------
> request count                                         50 (OK=50     KO=0     )
> min response time                                      1 (OK=1      KO=-     )
> max response time                                   1014 (OK=1014   KO=-     )
> mean response time                                    22 (OK=22     KO=-     )
> std deviation                                        141 (OK=141    KO=-     )
> response time 50th percentile                          2 (OK=2      KO=-     )
> response time 75th percentile                          2 (OK=2      KO=-     )
> mean requests/sec                                 48.123 (OK=48.123 KO=-     )

Strangely the 75th percentile time now is 2 millseconds!, how is this possible, and the answer becomes obvious using the excellent tools that Hystrix provides, here is a Hystrix dashboard view for this test:

What happened here is that the first 10 requests timed out, anything more than a second by default times out with Hystrix command in place, once the first ten transactions failed Hystrix short circuited the command thus blocking anymore requests to the remote service and hence the low response time. On why these transactions were not showing up as failed, this is because there is a fallback in place here which responds to the user request gracefully on failure.


The purpose here was to set the motivation for why a library like Hystrix is required, I will follow this up with the specifics of what is needed to integrate Hystrix into an application and the breadth of features that this excellent library provides.

No comments:

Post a Comment