Sunday, January 16, 2022

Service to Service Call Pattern - Multi-Cluster Ingress

Multi-Cluster Ingress is a neat feature of Anthos and GKE (Google Kubernetes Engine), whereby a user accessing an application that is hosted on multiple GKE clusters, in different zones is directed to the right cluster that is nearest to the user!

So for eg. consider two GKE clusters, one in us-west1, based out of Oregon, USA and another in europe-north1, based out of Finland. An application is installed to these two clusters. Now, a user accessing the application from US will be lead to the GKE cluster in us-west1 and a user coming in from Europe will be lead to the GKE cluster in europe-north1. Multi-cluster Ingress enables this easily!

Enabling Multi-Cluster Ingress

Alright, so how does this work. 

Let me once again assume that I have two clusters available in my GCP project, one in us-west1-a zone and another in europe-north1-a, and an app called "Caller" deployed to these two clusters. For a cluster, the way to get traffic into the cluster from a user outside of it is typically done using an "Ingress"

This works great for a single cluster, however not so for a bunch of clusters. A different kind of an Ingress resource is required that spans GKE clusters and this is where a Multi-Cluster ingress comes in - an ingress that spans clusters.

Multi-Cluster Ingress is a Custom resource provided by GKE and looks something like this:

It is defined in one of the clusters, designated as a "config" cluster. 
See how there is a a reference to "sample-caller-mcs" above, that is pointing to a "MultiClusterService" resource, which is again a custom resource that will work only in the context of a GKE project. A definition for such a resource, looks almost like a Service and here is the one for "sample-caller-mcs"

Now that there is a MultiClusterIngress defined pointing to a MultiClusterService, what all happens under the covers:
1. A load balancer is created which uses an ip advertised using anycast - better details are here. These anycast ip's help get the request through to the cluster closest to the user.
2. A Network Endpoint Group(NEG) is created for every cluster that matches the definition of MultiClusterService. These NEG's are used as the backend of the loadbalancer.

Sample Application

I have a sample set of applications and deployment manifests available here that demonstrates Multi-Cluster Ingress. There are instructions to go with it here. This brings up an environment which looks like this:

Now to simulate a request coming in from us-west1-a is easy for me since I am in US, another approach is to simply spin up an instance in us-west1-a and use that to make a request the following way:

And the "caller" invoked should be the one in us-west1-a, similarly if the request is made from an instance in europe-north1-a:

The "caller" invoked will be the one in europe-north1-a!!


This really boggles my mind, being able to spin up two clusters on two different continents, and having a request from the user directed to the one closest to them, in a fairly simple way. There is a lot going on under the covers, however this is abstracted out using the resource types of MultiClusterIngress and MultiClusterService. 

Tuesday, January 4, 2022

Service to Service call pattern - Using Anthos Service Mesh

Anthos Service Mesh makes it very simple for a service in one cluster to call service in another cluster. Not just calling the service but also doing so securely, with fault tolerance and observability built-in.

This is a fourth in a series of posts on service to service call patterns in Google Cloud. 

The first post explored Service to Service call pattern in a GKE runtime using a Kubernetes Service abstraction

The second post explored Service to Service call pattern in a GKE runtime with Anthos Service mesh

The third post explored the call pattern across multiple GKE runtimes with Multi-Cluster Service

Target Call Pattern

There are two services deployed to two different clusters. The "caller" in "cluster1" invokes the "producer" in "cluster2".

Creating Clusters and Anthos Service Mesh

The entire script to create the cluster is here. The script:
1. Spins up two GKE standard clusters
2. Adds firewall rules to enable ip's in one cluster to reach the other cluster
3. Installs service mesh on each of the clusters

Caller and Producer Installation

The caller and the producer is deployed using the normal kubernetes deployment descriptors, no additional special resource is required to get the set-up to work, so for eg, the callers deployment looks like this:

apiVersion: apps/v1
kind: Deployment
  name: sample-caller-v1
    app: sample-caller
    version: v1
  replicas: 1
      app: sample-caller
      version: v1
        app: sample-caller
        version: v1
      serviceAccountName: sample-caller-sa
        - name: sample-caller
            - containerPort: 8080

Caller to Producer Call

The neat thing with this entire set-up is that from the callers perspective a call continues to be made to the dns name of a service representing the producer. So assuming that the producer's service is deployed to the same namespace, then a  dns name of "producer" should just work.

So with this in place, a call from the caller to producer looks something like this:

The call fails, with a message that the "sample-producer" host name in cluster1 cannot be resolved. This is perfectly okay as such a service has not been created in cluster1. Creating such a service:

resolves the issue and a call cleanly goes through!! This is magical, see how the service in cluster 1 resolves the pods in cluster2!

Additionally the presence of x-forwarded-client-cert header in the producer indicates that the mTLS is being used during the call. 

Fault Tolerance

So security via mTLS is accounted for, now I want to layer in some level of fault tolerance. This can be done by ensuring that the calls timeout instead of just hanging, and not making repeated calls to producer if it starts to be non-responsive. This is typically done using istio configuration. Since Anthos service mesh is essentially a managed istio, the configuration for timeout looks something like this, using a VirtualService configuration

And circuit breaker, using a Destination Rule which looks like this:

All of it is just straight kubernetes configuration and it just works across multiple clusters.


The fact that I can treat multiple clusters as if they were a single cluster is I believe the real value proposition of Anthos Service Mesh, all the work around how to enable such a communication securely with fault tolerance is what the Mesh brings to the table.

My repository has all the sample that I have used for the post -

Thursday, December 30, 2021

Cloud Bigtable - Write and Retrieval

This is a quick write up based on a few days of experimentation with Cloud Bigtable, with the following objectives:

1. Using an emulator for local development

2. A high level schema design with retrieval patterns in mind

3. Finding records


Cloud Bigtable emulator provides a way to test the Bigtable functionality locally.  Setting up the emulator is easy and is described in this document. Assuming that the gcloud utility, which is a CLI to work with the Google Cloud resources, is available on the machine, then the following command should get the emulator in place:
gcloud components install bigtable
Once installed, the emulator can be started up using the following command:
gcloud beta emulators bigtable start --host-port=localhost:8086
This brings up the emulator at port 8086.

Working with the Emulator

Now that a local instance of Bigtable is up, working with it requires another utility called "cbt", which can be installed, again using gcloud, the following way:
gcloud components install cbt
A table to hold an entity modeled after a "Hotel", call it "hotels" along with a "columnfamily" to hold the details, called "hotel_details", looks like this:
cbt -project "project-id" createtable hotels
cbt -project "project-id" createfamily hotels hotel_details
Now that the emulator and the cbt utility is available, let's start with a modeling exercise. Take this modeling exercise with a pinch of salt, my knowledge of Bigtable is evolving and the approach here likely will need heavy polishing.

Schema Design for an Entity

So my objective is to provide basic write and read functionality on a "Hotel" entity, described using a golang struct the following way:
type Hotel struct {
	Id      string
	Name    string
	Address string
	Zip     string
	State   string
To store such an entity into Bigtable attention should be paid to how the data will ultimately be read. In my case, there are going to be two read patterns. 
  1. Retrieval by Hotel's id field
  2. Retrieving a list of hotels by the zip code
Now, Bigtable supports only 1 index, called the "Row key", and retrieval of a single record can be using this "Row key" or a set of records can be retrieved using the prefix of a row key.

In my case it will be difficult to support retrieval by id AND retrieval by zip code using one Row key, so my schema design is to have multiple records with different row keys for a single Hotel entity, along these lines, say for a Hotel which looks like this:

To support retrieval by id my row key looks something like this:
H/id#id1 along with data for the hotel being set to different column names.

To support retrieval by zip code my row key looks like this:
H/Zip#OR-1/Id#id1, the data this time points to the row key of the actual data which is H/id#id1, this way the entire data for the hotel does not have to duplicated. Given this row key, say if all hotels with a Zip code of OR-1 has to be retrieved, I can do it using a row key prefix of "H/Zip#OR-1" and then hydrate the information using the Id from the data.

So with this storing the information of a real hotel into Bigtable and querying it back looks like this in raw form:

  hotel_details:address                    @ 2021/12/29-20:53:30.816000
    "525 SW Morrison St, Portland"
  hotel_details:id                         @ 2021/12/29-20:53:30.816000
  hotel_details:name                       @ 2021/12/29-20:53:30.816000
    "The Nines"
  hotel_details:state                      @ 2021/12/29-20:53:30.816000
  hotel_details:zip                        @ 2021/12/29-20:53:30.816000
  hotel_details:key                        @ 2021/12/29-20:53:30.816000
This works quite well, I am not entirely sure if this optimal though, I will revisit the approach once I have gained a little more experience with using Bigtable

Retrieving by Zip Code

Assuming that a bunch of Hotels are present in the database with this schema design, a retrieval by zip code looks like this in golang:

func findHotels(table *bigtable.Table, ctx context.Context, zip string) ([]types.Hotel, error) {
	searchPrefix := fmt.Sprintf("H/Zip#%s", zip)
	var keys []string
	var hotels []types.Hotel
	err := table.ReadRows(ctx, bigtable.PrefixRange(searchPrefix),
		func(row bigtable.Row) bool {
			keys = append(keys, keyFromRow(row))
			return true

	if err != nil {
		return nil, fmt.Errorf("error in searching by zip code: %v", err)

	err = table.ReadRows(ctx, bigtable.RowList(keys), func(row bigtable.Row) bool {
		hotels = append(hotels, hotelFromRow(row))
		return true
	if err != nil {
		return nil, fmt.Errorf("error in retrieving by keys: %v", err)
	return hotels, nil
The code starts by generating the search prefix, which has a pattern of "H/Zip#zipcode" and retrieves the id from the retrieved records, and then batches a call to the table with the retrieved id's to get the details.


It may be easier to follow this along with real code, which is in my github repository available here - This has sample to write to Bigtable and then retrieve from it.

Monday, December 20, 2021

Service to Service call patterns - Multi-cluster Service

This is third blog post as part of a series exploring service to service call patterns in different application runtimes in Google Cloud.

The first post explored Service to Service call pattern in a GKE runtime using a Kubernetes Service abstraction

The second post explored Service to Service call pattern in a GKE runtime with Anthos Service mesh

This post will explore the call pattern across multiple GKE runtimes with Multi-Cluster Service providing a way for calls to be made across clusters.

Mind you, the preferred way for service to service call ACROSS clusters is using Anthos Service Mesh, which will be covered in the next blog post, however Multi-Cluster service is also a perfectly valid approach in the absence of Anthos Service Mesh.

Target Architecture

A target architecture that I am aiming for is the following:

Here two different applications are hosted on two separate Kubernetes clusters in different availability zones and the Service(called "Caller") in one cluster invokes the Service(called "Producer") in another cluster.

Creating the Cluster with Multi-Cluster Services

The details on bringing up 2 clusters and enabling Multi-cluster services is detailed in this document

Services Installation

Assuming that the 2 GKE clusters are now available, the first cluster holds the Caller and an Ingress Gateway to enable the UI of the caller to be accessible to the user. This is through a deployment descriptor which looks something like this for the caller:

apiVersion: apps/v1
kind: Deployment
  name: sample-caller-v1
    app: sample-caller
    version: v1
  replicas: 1
      app: sample-caller
      version: v1
        app: sample-caller
        version: v1
      serviceAccountName: sample-caller-sa
        - name: sample-caller
          imagePullPolicy: IfNotPresent
            - containerPort: 8080
            runAsUser: 1000
              memory: "256Mi"
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 3
            periodSeconds: 3
              path: /actuator/health/readiness
              port: 8080
I have reproduced the entire yaml just for demonstration, there is nothing that should stand out in the file.

Along the same lines the Producer application is deployed to the second cluster.

Caller to Producer call - Using Multi-Cluster Services

The right approach to getting service to service call working across a cluster is to use a feature of Anthos called Multi-cluster service and is described in detail in this blog post and this how to post

The short of it is that if a "ServiceExport" resource is defined in cluster 2 and if the same namespace exists in Cluster 1 then the Service is resolved using a host name of the form "service-name.namespace.svc.clusterset.local" and in my case this maps to "sample-producer.istio-apps.svc.clusterset.local"!. The ServiceExport resource looks something like this:

kind: ServiceExport
  namespace: istio-apps
  name: sample-producer
This is the only change that I have to make to the caller, instead of calling Producer using "sample-producer", now it uses the host name of "sample-producer.istio-apps.svc.clusterset.local" and everything resolves cleanly and the call continues to work across the cluster.

View from the caller:

View from the Producer:


I hope this clarifies to some extent how service to service call can be enabled across multiple clusters, even across regions. 

There are a few small catches, for eg, to get the Mutual TLS to work across clusters is not easy. This is cleanly solved when using Anthos Service Mesh and will be detailed in the next blog post.

Thursday, November 18, 2021

Service to Service call patterns - GKE with Anthos Service Mesh on a single cluster

 This is second in a series of posts exploring service to service call patterns in some of the application runtimes on Google Cloud. The first in the series explored service to service call patterns in GKE

This post will expand on it by adding in a Service Mesh, specifically Anthos Service Mesh, and explore how the service to service patterns change in the presence of a mesh. The service to service call with be across services in a single cluster. The next post will explore services deployed to multiple GKE clusters.


The steps to set-up a GKE cluster and install Anthos service mesh on top of it is described in this document -, in brief these are the commands that I had to run in my GCP Project to get a cluster running:

If the installation of cluster and the mesh has run through cleanly, a good way to verify the installation is to see if the cluster gets registered as a Anthos managed cluster in the Google Cloud Console.

The services that I will be installing is fairly simple and looks like this:

Using a UI, the caller can make the producer behave in certain ways:
  • Introduce response time delays
  • Respond with certain status codes
This will help check how the mesh environment will behave in the face of these behaviors.

The codebase for the "caller" and "producer" are in this repository -, there are kubernetes manifests available in the repository to bring up these services.

Behavior 1 - Mutual TLS

The first behavior that I want to see is for the the caller and the producer to verify each others identities by presenting and validating their certificates.

This can be done by adding in a istio DestinationRule for the producer, along these lines:

This also adds in the DestinationRule for the caller, this is because the caller gets the call from the browser via an Ingress Gateway and even this call needs to be authenticated using mtls

Alright now that the set-up in place, the following is what gets captured as the request flows from the Browser to the Ingress Gateway to the Caller to the Producer.

The sign that the mTLS works is seeing the "x-forwarded-client-cert" header, this is in both the Callers headers coming in from Ingress-gateway, and in the "Producers" headers coming in from the Caller.

Behavior 2 - Timeout

The second behavior that I want to explore is the timeouts. A request timeout can be set for the call from the Caller to Producer by creating a Virtual Service for the Producer with the value set, along these lines:

With this configuration in place a request from the caller with a delay of 6 seconds, causes the Mesh to timeout and present an error that looks like this:

The mesh responds with a http status code of 504 with a message of "Upstream timed out". 

Behavior 3 - Circuit Breaker

Circuit breaker is implemented using a Destination Rule resource
Here I have configuration which breaks the circuit if 3 continuous 5XX responses are received from the Producer in a 15 second interval, and then does not make a request for another 15 seconds

With this configuration in place a request with broken circuit looks like this:

The mesh responds with a http status code of 503 and a message of "no healthy upstream"


The neat thing is that in all scenarios so far, the way the Caller calls the Producer remains exactly the same, it is the mesh which injects in the appropriate security controls through mTLS and the resilience of calling service through timeouts and circuit breaker. 

Wednesday, October 27, 2021

Service to Service call patterns in Google Cloud - GKE

This is a series of posts that will explore service to service call patterns in some of the application runtimes in Google Cloud. This specific post will explore GKE without using a service mesh and the next post will explore GKE with Anthos Service Mesh.

Set Up

The set-up is simple, two applications - caller and producer are hosted on the application runtime with caller making a http request to the producer. An additional UI is packaged with the caller that should make it easy to test the different scenarios.

The producer is special, a few faults can be injected into the producers response based on the post body from the caller:

  1. An arbitrary delay
  2. A specific response http status code

These will be used for checking how the runtimes behave under faulty situation.

GKE Autopilot Runtime

The fastest way to get a fully managed Kubernetes cluster in Google Cloud is to spin up a GKE Autopilot cluster. Assuming such a cluster is available, the service to service call pattern is through the abstraction of a Kubernetes service and looks something like this:

A manifest file which enables this is the following:

Once a service resource is created, here called "sample-producer" for instance, a client can call it using the services FQDN - sample-producer.default.svc.cluster.local. In my sample, the caller and the called are in the same namespace, for such cases calling by just the service name is sufficient.

A sample service to service call and its output in a simple UI looks like this:

A few things to see here:
  1. As the request flows from the browser to the caller to the producer, the headers are captured at each stage and presented. There is nothing special with the headers so far, once service meshes come into play they start to get far more interesting.
  2. The delay does not do anything, the browser and the caller end up waiting no matter how high the delay.
  3. Along the same lines, if the producer starts failing, caller continues to send requests down to the service, instead of short circuiting it.


Service to service call in a Kubernetes environment is straightforward with the abstraction of a Kubernetes service resource providing a simple way for clients to reach the instances hosting an application. Layering in a service mesh provides a great way for the service to service calls to be much more resilient without the application explicitly needing to add in libraries to handle request timeouts or faulty upstream services. This will be the topic of the next blog post. 

Thursday, September 30, 2021

Google Cloud Deploy - CD for a Java based project

This is a short write-up on using Google Cloud Deploy for Continuous Deployment of a Java-based project. 

Google Cloud Deploy is a new entrant to the CD space. It facilitates a continuous deployment currently to GKE based targets and in future to other Google Cloud application runtime targets.

Let's start with why such a tool is required, why not an automation tool like Cloud Build or Jenkins. In my mind it comes down to these things:

  1. State - a dedicated CD tool can keep state of the artifact, to the environments where the artifact is deployed. This way promotion of deployments, rollback to an older version, roll forward is easily done. Such an integration can be built into a CI tool but it will involve a lot of coding effort.
  2. Integration with the Deployment environment - a CD tools integrates well the target deployment platform without too much custom code needed.

Target Flow

I am targeting a flow which looks like this, any merge to a "main" branch of a repository should:
1. Test and build an image
2. Deploy the image to a "dev" GKE cluster
3. The deployment can be promoted from the "dev" to the "prod" GKE cluster

Building an Image

Running the test and building the image is handled with a combination of Cloud Build providing the build automation environment and skaffold providing tooling through Cloud Native Buildpacks. It may be easier to look at the code repository to see how both are wired up -

Deploying the image to GKE

Now that an image has been baked, the next step is to deploy this into a GKE Kubernetes environment.  Cloud Deploy has a declarative way of specifying the environments(referred to as Targets) and how to promote the deployment through the environments. A Google Cloud Deploy pipeline looks like this:

The pipeline is fairly easy to read. Target(s) describe the environments to deploy the image to and the pipeline shows how progression of the deployment across the environments is handled. 

One thing to notice is that the "prod" target has been marked with a "requires approval" flag which is a way to ensure that the promotion to prod environment happens only with an approval. Cloud Deploy documentation has a good coverage of all these concepts. Also, there is a strong dependence on skaffold to generate the kubernetes manifests and deploying them to the relevant targets.

Given such a deployment pipeline, it can be put in place using:

gcloud beta deploy apply --file=clouddeploy.yaml --region=us-west1

Alright, now that the CD pipeline is in place, a "Release" can be triggered once the testing is completed in a "main" branch, a command which looks like this is integrated with the Cloud Build pipeline to do this, with a file pointing to the build artifacts:

gcloud beta deploy releases create release-01df029 --delivery-pipeline hello-skaffold-gke --region us-west1 --build-artifacts artifacts.json
This deploys the generated kubernetes manifests pointing to the right build artifacts to the "dev" environment

and can then be promoted to additional environments, prod in this instance.


This is a whirlwind tour of Google Cloud Deploy and the feature that it offers. It is still early days and I am excited to see where the Product goes. The learning curve is fairly steep, it is expected that a developer understands:
  1. Kubernetes, which is the only application runtime currently supported, expect other runtimes to be supported as the Product evolves.
  2. skaffold, which is used for building, tagging, generating kubernetes artifacts
  3. Cloud Build and its yaml configuration
  4. Google Cloud Deploys yaml configuration

It will get simpler as the Product matures.