The TransferWise stack, 2020 edition

Engineering

By Yuriy Opryshko, Platform Engineer at TransferWise

Over three years ago Alvar published the 2016 take on our tech stack. A lot has happened between now and then. The TransferWise community has grown at the speed of light from a million users in 2016 to over 7 million users today. Our customers now transfer over 4 billion pounds a month, saving over a billion pounds a year versus the banks.

The engineering organisation has grown too. From 120 to over 400 engineers in dozens of teams, all working together to achieve our mission. The number of services powering TransferWise has grown from 50 in 2016 to over 250 today, and on a normal working day over 120 production deployments happen, compared to around 30 back in 2016. That growth has meant our stack and infrastructure has had to evolve and we’ve changed a lot in three years.

This is the first in a series of posts about the inner workings of TransferWise. We’re going to deep dive into our engine and look at how far we’ve come in the last three years. We’ll look at the programming languages, frameworks and technologies we work with and contribute to.

Later posts will focus on more specific areas: platform and infrastructure, our approach to front end development, observability and more.

Platform and SRE — embracing the cloud

We went with AWS for several reasons:

  1. It was (and still is) the most mature and feature-rich cloud provider, and both vendor and open source tooling makes our lives a lot easier;
  2. AWS knowledge was already widespread within TransferWise, as well as in the broader industry;
  3. We have already had a good working relationship with them.

Regardless of that, most of our platform is based on open source software, so the choice of the cloud provider isn’t set in stone. The option to switch, or even use multiple providers is a useful feature.

All our AWS infrastructure is described as code in Terraform. We use resource tagging heavily for reporting the costs back to teams.

CloudFlare sits on the edge, serving the website and the API to the end users and partners, as well as providing attack mitigation and CDN functionality.

The services run in our Kubernetes clusters, which we spin up using our home-baked AMIs. Jose talked about the setup and lessons we learned migrating to k8s in great detail at GOTO Amsterdam last year and his slides provide a lot of useful insight.

For the database layer, we use PostgreSQL and MariaDB, selecting the engine based on the task. Mongo is our choice for NoSQL solutions. Most of the databases run on RDS, which allows us to automate stuff like backups and multi-availability zone deployments, whilst maintaining compatibility. Highly available EC2-based clusters are set up when RDS is too limited for a particular use case.

Envoy handles the service mesh layer, providing a transparent way for services to reach other services.

We use Kafka for messaging. It processes several thousand messages per second, even if we exclude the logging pipeline, which uses Kafka for log shipping.

Backend: microservices galore

We’ve also recently introduced a unified service template. So, when starting a new service you get all the basics like logging, the aforementioned task executor, monitoring, metrics and error reporting out of the box. It’s no longer using Spring Initializr, but lives in a maintained git repository you can base off thanks to GitHub’s excellent template repository feature.

We’re ramping up adoption of our homegrown service-to-service communication framework. With it, all the requests flowing between our services will have priorities, deadlines and idempotency flags set. This will allow services to prioritise critical traffic and shed or postpone the less critical requests. With idempotency controls, requests will be automatically retried in all configured cases — not only when a network issue or another problem happens before the request was already sent, which is what we already do by default.

Some of our earlier tech bets didn’t pay off: we got rid of Eureka and Zuul (replaced by Envoy), as well as Spring Config Server (replaced by Kubernetes manifests and sealed secrets).

Our Grails monolith app is still there, but has reduced in size. Getting rid of it isn’t a priority, so we’ll just retire it when it becomes naturally irrelevant in future. In the meantime, we’re making sure our standard toolset for development and deployment works consistently with any service, small or large.

Frontend: let the 🦀 do the work

The Javascript code itself is written using ES6 or Typescript.

And all our frontend apps use our unified design system, thanks to regular collaboration with our designers.

Mobile apps: iterative evolution

The iOS app went through several tech and UX iterations: from stock Objective-C based MVP back in the early days to the current app, which is based on modularized Swift and custom UI components (lightweight wrappers around UIKit). We also have our own NSUrlSession-based lightweight networking stack, and the same goes for CoreData.

All new code in the Android app is written in Kotlin, which runs rings around Android’s Java version. Using Kotlin throughout the stack allows us to develop consistent APIs and create clean integrations between other core libraries, such as Retrofit and Room. The RxJava code is being migrated to Kotlin Coroutines as well. The current reactive MVVM architecture has replaced the reactive MVP architecture from back in the day, and the overall codebase is now around 75% Kotlin. We also have a custom design and UI component library we use throughout.

Deployments, observability, analytics and security

The services’ log files are gathered by fluentbit and shipped via Kafka to ElasticSearch, where they can be viewed by developers in Kibana.

All metrics — platform, service, business-level ones — are gathered with Prometheus, stored in a Thanos store and displayed in Grafana on a number of dashboards. AlertManager is used for alerts, which trigger pages in VictorOps for relevant teams. We believe in knowing what’s going on inside our running code at all times, so our observability stack is one of the most important parts of the engine.

In terms of product analytics, the relevant data in our databases is replicated to a Snowflake instance in near real time, while being stripped of any personal information, using our own open source PipelineWise. We use Looker to query and visualise the data, and it isn’t just for the analysts or product people — out of 2200+ people working in TransferWise, almost a third of us use Looker every day to make data-driven decisions.

Our security bounty program is now public — so let us know if you find a vulnerability. New accepted submissions get rewarded! Be sure to follow the responsible disclosure terms, though.

The future outlook

Another thing that we’re working on these days is a cross team disaster recovery exercise. While disasters (like a whole datacenter being gone) are, by definition, improbable, we want to have plans and playbooks for quickly falling over to a backup region and restoring service, no matter what has happened. As this is a large effort spanning multiple teams all across the organisation, it’s also an excellent opportunity for everyone involved to learn more about our platform. And to get involved with the parts you wouldn’t normally work with.

P.S. Interested in working with us? We’re hiring! Check out our open Engineering roles here.