Engineering at N26

Someone messaged me on LinkedIn while I was at N26. They wanted to apply but assumed the backend was mostly Java — verbose, legacy, not where they wanted to go. When I told them the reality was Kotlin microservices on Kubernetes, they changed their mind entirely.

That gap between outside perception and inside reality is why I wrote this. N26 had published a stack overview in 2018; by 2022 the picture had changed a lot. I had spent nearly three years inside it. This is the tour I wish I’d had on day one.

Snapshot in time

Written in September 2022, during my time at N26. Numbers, tooling, and migration percentages evolve — banks don’t stand still. For what came later on financial crime and stream processing, see Rebuilding Fraud Prevention from Scratch in Six Months.

The shape of the system

N26 is a licensed EU neobank. That sentence matters more than any framework choice: regulatory constraints, audit trails, and incident response aren’t afterthoughts — they’re load-bearing walls in the architecture.

At a high level, here’s what we were running:

Stack at a glance (2022)

Product surface

Mobile apps and web, backed by API gateways and BFF-style services. Customer-facing latency is the metric everyone feels.

RESTmobileAPI gateway

Microservices

Roughly 230 services, heavily JVM/Kotlin, communicating over HTTP and increasingly over events. The default unit of ownership.

KotlinJavaPythonTypeScript

Event layer

Async backbone for decoupling — SQS and Kinesis where they fit, Apache Kafka where we needed scale, resilience, and replay.

KafkaKinesisSQS

Data

100+ PostgreSQL instances, standardized provisioning, compliance baked into how databases get created.

PostgreSQLIaCML pipelines

Platform

Kubernetes everywhere, GitHub Actions + Argo CD for delivery, Datadog + ELK (moving to OpenSearch) for observability.

KubernetesArgo CDDatadog

9k+

Containers under management

~230

Microservices in the distributed stack

3 TB

Daily inter-service traffic (10 TB at peak)

100+

PostgreSQL databases, provisioned via PR

Everything lives in the cloud. At this scale, “we’ll fix the deploy later” is a business risk, not a backlog item.

Containers everywhere

When I say everything lives in the cloud, I mean it literally — including the orchestration layers we outgrew.

We ran over 9,000 containers. For a long time that meant Nomad plus a home-grown orchestration layer on SpotInst. It worked until it didn’t. Pipelines built on the old setup had grown to ~1.5 hours end to end — E2E tests, security gates, quality checks, monitoring hooks. Continuous deployment to production since 2017 only helps if the pipe stays fast enough to use.

Kubernetes became the obvious next step. Not because it’s fashionable — because we needed predictable scheduling, a real ecosystem, and deployment times that didn’t punish teams for shipping.

Orchestration evolution

2017

Continuous deployment to prod

Shipping frequently was already cultural; infra had to keep up.

→

Nomad + custom orchestration

Step up from home-grown, but pipelines kept slowing down.

2022

Kubernetes in all environments

80%+ of microservices migrated; typical deploy ~30 min, some pipelines under 15.

The migration wasn’t painless. Kubernetes has sharp edges, especially when you’re regulated and can’t afford mystery failures at 2 a.m. But the direction was clear: faster, more stable deploys, and infrastructure that scaled with the business instead of fighting it.

Kotlin, on purpose

A few years before this snapshot, N26 made an explicit bet on Kotlin for the JVM backend. It paid off.

When I joined, we were still in the mixed Java/Kotlin phase — new services bootstrapped in Kotlin, older Java codebases in maintenance mode. By 2022, Kotlin was the default for new JVM work. Not because Kotlin solves every problem — we still had Python, TypeScript, and Java where history or fit demanded it — but because the ergonomics matter at 230 services. Less ceremony, safer refactors, teams that actually wanted to open the repo.

The LinkedIn message wasn’t wrong to care about language — it was wrong about which language dominated. Perception lagged reality by years. That’s common at fast-moving companies and worth fixing publicly.

CI/CD: separating build from release

We were mid-flight migrating from Jenkins to GitHub Actions for CI and Argo CD for CD. The split matters: build artifacts in one place, deploy intent in another, clearer audit trail for regulators who ask what shipped when and who approved it.

Modern delivery pipeline (target state)

CIminutes

GitHub Actions — test, scan, build container image. Native Kubernetes awareness made migrated pipelines faster and less flaky than the Jenkins era.

CD≤30 min

Argo CD — declarative deploys to K8s clusters across environments. Promotion paths aligned with our change-management rules.

Verifycontinuous

Monitoring-as-code alerts and SLO dashboards. If prod misbehaves, the owning team's metrics fire first — not a central on-call guessing game.

Early days of the migration, but the migrated pipelines were already winning on speed and stability. In a bank, “faster CI” isn’t a developer convenience — it’s how you patch fraud vectors before the weekend.

Observability as infrastructure

Continuous deployment without observability is just faster breakage.

We invested heavily in Datadog for infra and app metrics, custom business metrics on top, and ELK for log access (with a move toward OpenSearch underway). The important shift wasn’t the vendor list — it was Monitoring as Code: alerts and dashboards living in the same repos as the services they watch.

Want to change an alert threshold? Open a PR on the service that owns the metric. Review it like code. Ship it like code. That sounds obvious until you’ve lived in a world where only three people can touch Grafana and everyone else pages them at midnight.

Logs on demand

Engineers could reach production logs through the ELK stack without a ticket-based ritual — critical when you're debugging a payment path at peak.

Metrics with business context

Infra graphs tell you the CPU spiked. Business metrics tell you card auths failed — different urgency, different runbook.

Alerts owned by services

Threshold changes via PR kept accountability local. The team that understands the failure mode owns the pager.

Data: regulated, automated, valuable

Data at a bank is both asset and liability. N26’s data org had to extract customer value — fraud signals, support automation, product analytics — without treating compliance as someone else’s problem.

Our DBA function supported 100+ PostgreSQL databases. We had migrated away from MySQL for most use cases — PostgreSQL’s versatility and operational maturity won. New database instances were provisioned through automation: one PR, minutes to live, auditable from creation.

That standardization let a small DBA team punch far above its headcount. Cloud-native Postgres, Infrastructure as Code, and ruthless elimination of snowflake setups. Machine learning usage had scaled too — chatbot and related ML workloads were part of daily operations, not a lab experiment.

From HTTP coupling to events

~200 microservices talking to each other sounds elegant on a slide. In practice, synchronous HTTP everywhere builds a distributed monolith — change one service, coordinate five deploys, flip feature flags, hope nothing times out.

Our network moved over 3 TB per day, peaking around 10 TB, with TCP latency usually under 5 ms. HTTP is fast until it isn’t — until every user request fans out into a tree of blocking calls and one slow dependency owns your p99.

We had async patterns early — AWS SQS and Kinesis for workloads that fit. Kafka entered when we needed higher throughput, stronger ordering guarantees, replay for audit and recovery, and a path off the worst coupling points.

The coupling problem

User request

mobile app

Service A

sync HTTP

Service B

sync HTTP

Service C

blocks on B

One concrete migration — moving a critical flow from synchronous HTTP to an async Kafka path — cut p95 latency by ~50%, doubled throughput, and reduced load on a downstream system by ~66%. At peak, several critical processes were doing 1,000+ messages per second through Kafka.

Sync HTTP mesh

Request chains block on every downstream call
Deploy coordination across coupled services
Failure modes propagate instantly to the user
Hard to replay or audit async side effects

Event-driven paths

Critical flows decoupled with bounded latency
Consumers scale independently of producers
Replay and audit trails for regulated workflows
Domain modeling conversations — events as contracts

We introduced Kafka gradually. Banks don’t YOLO new infrastructure. Stabilize internally, learn failure modes, then expand. Plenty of systems were still on the roadmap to break apart — event-driven architecture is a direction, not a checkbox.

The honest part

If you’ve read this far, you’re probably waiting for the catch. There is one.

We were still breaking down the original monolith — officially deprecated years earlier, unofficially longer. Macroservices from the early days still existed. Some integrations ran on third-party software with decades-old internals (cloud-hosted, but old). Kubernetes migration was past 80%, not 100%. Kafka adoption was real but incomplete.

That’s normal at scale. The mistake would be pretending otherwise in a recruiting blog post. The interesting work wasn’t “we have perfect microservices.” It was how a licensed bank modernizes under regulatory load — card expiries hitting at once, subscription billing at scale, financial crime prevention (a topic big enough for its own article), and the daily grind of making deploys safe enough to run continuously.

What this stack taught me

Perception lags reality

Outside engineers still thought 'Java bank.' Inside, Kotlin and K8s were the default. Publish the truth or lose candidates and partners to stale assumptions.

Platform speed is product speed

1.5-hour pipelines aren't an infra metric — they're a cap on how fast you can respond to fraud, outages, and regulation.

Events beat chains

HTTP between 200 services doesn't scale socially or operationally. Kafka wasn't magic — it was a contract that teams could evolve independently.

Observability is ownership

Monitoring-as-code moved alerts next to the people who understand the service. That changes incident culture more than any tool brand.

Where to go next

This article is the wide-angle shot — containers, languages, delivery, data, events. The deep cut on financial crime and stream processing came later: Rebuilding Fraud Prevention from Scratch in Six Months. For why banks freeze code anyway, Stop Complaining About Code Freezes covers the other side of continuous deployment.

I also spoke about pieces of this stack at Kafka Summit London 2023 and KotlinConf 2024.

Originally published on InsideN26

This piece first appeared as Engineering at N26: a Tour of our Tech Stack and Architecture (September 2022). Rewritten and expanded for this site — recruiting CTA removed, cross-links added, numbers preserved from the original snapshot.