Forty milliseconds. That’s roughly the latency gap between a user in Singapore hitting a server in Singapore versus hitting one in London. It sounds trivial. But for a real-time trading platform, a multiplayer game, a video collaboration tool, or a surgical robotics interface, those forty milliseconds are the difference between a smooth experience and an unusable one.
Latency isn’t just a performance metric. It’s a product quality issue, a revenue issue, and increasingly, a competitive differentiator. As applications become more interactive and users become less tolerant of lag, engineering teams at every scale are being pushed to take latency seriously, not just at the infrastructure layer, but across the entire delivery stack.
This is a guide to doing that well.
Understanding Latency: More Than Just Distance
Before you can reduce latency, it helps to be precise about where it comes from. “Network latency” is often used as a catch-all, but the actual delay a user experiences is the sum of several distinct contributors.
Propagation delay is the unavoidable physics, the time it takes a signal to travel from point A to point B through fiber or copper. Light travels through fiber at roughly two-thirds the speed of light in a vacuum. You cannot engineer your way around this. A packet traveling from New York to Sydney will take at least 85ms one-way no matter how good your infrastructure is.
Transmission delay is the time required to push all the bits of a packet onto the link. For high-bandwidth connections, this is negligible. For low-bandwidth or congested links, it adds up.
Processing delay is the time routers, load balancers, firewalls, and application servers spend actually handling the packet. This is highly engineerable, bad software, bloated middleware, and inefficient routing decisions all inflate processing delay.
Queuing delay occurs when packets stack up waiting to be processed. During congestion, this can dwarf all other delay components combined. Queuing is where well-architected systems pull dramatically ahead of poorly designed ones under load.
Understanding which of these dominates in your environment tells you where to focus. Most organizations spend too much time tuning application code when their biggest latency problem is geographic, or vice versa.
Best Practice 1: Get Physically Closer to Your Users
This sounds obvious, but it’s consistently under-executed. The single highest-impact thing most organizations can do to reduce latency is move compute and content closer to the people accessing it.
Content Delivery Networks (CDNs) are the most accessible starting point. CDNs cache static assets, images, scripts, stylesheets, video segments, at edge nodes distributed around the world. A user in Jakarta retrieves your homepage’s assets from a Jakarta PoP (Point of Presence), not your origin server in Frankfurt. For content-heavy applications, this alone can cut perceived load times dramatically.
Edge computing goes further. Rather than caching static content, edge platforms like Cloudflare Workers, AWS Lambda@Edge, and Fastly Compute@Edge let you run actual application logic at the edge. Authentication checks, A/B testing logic, personalization, API response transformation, all of this can happen in the same PoP that’s serving your assets, eliminating additional round trips to origin.
Multi-region deployments are the most operationally complex but most impactful approach for latency-sensitive applications. Rather than running your application in a single region and distributing only cached content, you deploy full application stacks in multiple regions and use intelligent routing to direct each user to their nearest healthy instance. Global load balancers, AWS Global Accelerator, Google Cloud’s global anycast network, or Cloudflare’s Traffic Manager, handle the routing logic automatically.
The principle in all three approaches is the same: reduce the physical distance between the user and the resource they need.
Best Practice 2: Optimize Your Routing
Not all internet paths are created equal. The route a packet takes between two points is determined by BGP (Border Gateway Protocol), the internet’s routing protocol, and BGP optimizes for policy and stability, not speed. Two geographically close servers can have poor latency if they’re connected through suboptimal transit paths.
Anycast routing is one of the most powerful tools for latency optimization at scale. With anycast, a single IP address is announced from multiple geographic locations simultaneously. The network automatically routes each user’s traffic to the nearest (in network terms) instance of that IP. This is how major CDNs and DNS providers achieve consistently low latency globally, your DNS query doesn’t travel to a single server; it travels to whichever server is closest on the network topology.
Private backbone networks bypass the public internet entirely for inter-region traffic. Rather than letting packets traverse unpredictable public paths, providers like AWS (with its global backbone), Cloudflare (with its Argo Smart Routing), and Google (with its private fiber network) route traffic over dedicated infrastructure with known, optimized paths. For applications with heavy inter-region traffic or strict latency budgets, using a provider’s private backbone rather than relying on public internet peering can reduce latency by 30–40%.
Peering arrangements matter for high-traffic operations. When two networks exchange traffic directly rather than through transit providers, they eliminate hops and reduce latency. Large content operators negotiate direct peering agreements with major ISPs and cloud providers specifically for this reason.
Best Practice 3: Minimize Round Trips in Your Application Layer
Even on a fast network, a chatty application can feel slow. Every round trip, every request/response exchange between client and server, adds at least one full RTT (round-trip time) of latency. Applications that require many sequential round trips to complete a single user action compound that latency into something painful.
Use HTTP/3 and QUIC. HTTP/3 is built on the QUIC transport protocol, which was specifically designed to reduce latency. Unlike TCP, QUIC eliminates head-of-line blocking (where one lost packet holds up an entire stream), and its 0-RTT connection resumption means returning users don’t pay the handshake cost on reconnection. For mobile users on variable connections, the improvement is particularly significant.
Implement aggressive connection reuse. HTTP keep-alive, connection pooling, and multiplexing (multiple requests over a single connection) all reduce the overhead of establishing new connections. For high-frequency API communication, gRPC with its built-in multiplexing and binary serialization is significantly more efficient than REST over HTTP/1.1.
Push data proactively where possible. Rather than waiting for the client to request data it will predictably need, server-sent events or WebSocket connections allow the server to push updates as soon as they’re available. For real-time dashboards, live feeds, and collaborative applications, this eliminates polling, which is both latency-inducing and wasteful.
Optimize your DNS. DNS resolution happens before any content can be fetched, and slow DNS adds directly to every user’s first-request latency. Use a high-performance DNS provider with a globally distributed resolver network, and set appropriate TTLs (time-to-live) to allow aggressive caching at resolver level. DNS prefetching for known linked resources further reduces the impact.
Best Practice 4: Tune Your Infrastructure Configuration
Architecture choices matter, but so does the configuration of the infrastructure you’re running on. Significant latency often hides in default settings that weren’t designed for high-performance workloads.
TCP tuning for high-throughput connections. TCP’s congestion control algorithms and buffer sizes have defaults optimized for average conditions, not peak performance. Increasing TCP receive and send buffer sizes, enabling TCP BBR (Google’s congestion control algorithm designed for high-bandwidth, high-latency paths), and tuning the initial congestion window can meaningfully improve throughput and reduce latency on busy links.
Reduce TLS handshake overhead. TLS 1.3 cut the handshake from two round trips to one, and session resumption can bring it to zero for returning connections. If you’re still running TLS 1.2, upgrading is one of the lower-effort latency improvements available. Additionally, terminating TLS at the edge (at your CDN or load balancer) rather than at origin means the costly handshake happens close to the user, not on a distant server.
Load balancer and proxy configuration. Default keepalive timeouts, buffer sizes, and connection limits on load balancers like NGINX or HAProxy are often too conservative for high-concurrency workloads. Review and tune these settings based on your actual traffic patterns, and ensure your load balancers are not becoming bottlenecks under peak load.
Best Practice 5: Measure Everything, Continuously
You cannot optimize what you are not measuring. And latency measurement is trickier than it sounds.
Synthetic monitoring sends automated test requests from nodes around the world on a continuous schedule, giving you a baseline view of performance from each geography regardless of actual user traffic. Tools like Catchpoint, ThousandEyes, and Pingdom offer globally distributed synthetic testing.
Real User Monitoring (RUM) collects performance data from actual users’ browsers or apps as they interact with your service. Unlike synthetic monitoring, RUM reflects the reality of your user base, the ISPs, devices, and network conditions they’re actually on. This is where you discover that your application performs well in synthetic tests but poorly for users on mobile connections in Southeast Asia.
Distributed tracing maps the latency contribution of every component in your stack for individual requests. Tools like Jaeger, Zipkin, or commercial APM platforms like Datadog and New Relic show you exactly where time is being spent, whether it’s a slow database query, a blocking external API call, or an inefficient middleware layer. Without tracing, latency debugging is guesswork.
Set latency budgets. Define acceptable latency targets for each part of your stack, DNS resolution, TLS handshake, time-to-first-byte, total page load, and alert when actuals exceed budget. Latency tends to degrade gradually and go unnoticed until it becomes a significant user experience problem. Budgets and alerts catch the drift early.
Putting It Together
Reducing latency at global scale is not a single fix. It’s a layered practice, geographic distribution at the infrastructure level, routing optimization at the network level, round-trip reduction at the application level, careful configuration at the system level, and continuous measurement threading through all of it.
The organizations running the lowest-latency global operations didn’t achieve it in a single sprint. They built the measurement infrastructure first, identified their dominant latency contributors, addressed them in order of impact, and kept iterating.
Start with where your users are versus where your infrastructure is. Close that gap. Then work down the stack.
Every millisecond you reclaim is a better experience for someone, somewhere in the world, trying to use what you built.