Multi-cloud resilience with GSLB
GSLB is just DNS that knows which of your targets are healthy and where the client is. That turns out to be enough to route around a dead region.
If your whole app lives in one region of one cloud, that region is your availability ceiling. When it has a bad day, so do you, and there is nothing in your stack that can do anything about it because every part of your stack is in the same place. Multi-cloud, or at least multi-region, is how you raise that ceiling. The open question is how traffic finds the side that is still up.
That routing decision has to happen before a connection is made, which puts it at the DNS layer. Global server load balancing (GSLB) is the name for authoritative DNS that answers differently depending on which of your targets are healthy and, optionally, where the query came from.
What GSLB actually is
A normal A record is static. Whoever asks, whenever they ask, gets the same address. GSLB replaces that static answer with a small decision made at query time:
query: app.example.com A ?
┌──────────────────────────┐
resolver ─────────▶ │ authoritative DNS (GSLB) │
│ pool: app-prod │
│ • is each member up? │
│ • where is the client? │
└────────────┬──────────────┘
│ returns the address of a
│ healthy member, by policy
▼
┌───────────────┬───────────────┬───────────────┐
│ AWS us-east │ GCP europe │ on-prem DC │
│ 1.2.3.4 (up) │ 5.6.7.8 (up) │ 9.10.11.12 (✗) │
└───────────────┴───────────────┴───────────────┘
The members can be anywhere. Different regions of one cloud, different clouds, a cloud plus a box in a colo. DNS does not care where an address points, which is the reason it works as a cross-provider steering layer when nothing else spans all of them.
Health checks are the whole point
A pool is only as good as its knowledge of which members are alive. A separate health checker probes each member on its own loop and marks it up or down. The DNS answer is built from that state, so a member that fails its checks simply stops being returned.
before after primary fails its check
pool app-prod pool app-prod
1. us-east (up) ◀── served 1. us-east (down) ── skipped
2. europe (up) 2. europe (up) ◀── served now
(us-east returns automatically
once it passes checks again)
Selection policy decides which healthy member wins. The two that cover most needs:
- Active-passive. Members have a priority order. The pool serves the highest-priority healthy member and only moves down the list when it goes down. This is failover: a primary site with a standby that takes over.
- Geo. The client's location decides which member answers, so European users get the European target and North American users get the closer one. Unhealthy members are excluded first, so geo and failover compose: nearest healthy.
Where GSLB helps, and where it does not
GSLB earns its place in cross-region and cross-cloud recovery. A region or a provider has an incident, the members there fail their checks, and new queries get sent to the side that is still serving. You did not change application code and you are not locked to one cloud's load balancer to do it.
What it does not do is fail over in milliseconds. DNS answers are cached by resolvers, operating systems, and browsers, and a lot of that caching ignores your TTL, so there is a recovery window after a member goes down during which some clients still hold the old answer. That is why DNS failover is never truly instant. GSLB is the right tool for "send people to the other region over the next minute or two", and the wrong tool for "one process crashed, recover in 500ms". For the tight case you want anycast or a load balancer in front; for the regional and multi-cloud case, DNS is exactly where the decision belongs.
How dnswiz does it
You define a pool, add members with their addresses and a priority, and attach a health check. The health checker probes each member every 30 seconds by default and the authoritative answer is built from the live result, so failover and recovery happen on their own with no manual step. Geo selection uses the query's location to pick the nearest healthy member. When state changes, the dashboard records the transition with a timestamp, so an incident leaves a trail you can read afterwards instead of a guess.
Pools, members, and checks are all in the API and the Terraform provider, so a failover topology is something you commit and review rather than click together. That side of it is its own subject, covered in DNS as code.
See also: DNS as code, What feature-rich DNS actually means.