Scaling Go Services: From Startup to Enterprise

A business-oriented guide to scaling Go backend services, covering horizontal scaling strategies, performance optimization, and the organizational practices that enable sustainable growth.

business7 min readBy Klivvr Engineering
Share:

Scaling a backend system is rarely a purely technical challenge. It is a business challenge that requires technical solutions. At Klivvr, Andromeda started as a backend serving a few hundred requests per minute. Today it handles tens of thousands. The path between those two points was not a single dramatic rewrite but a series of deliberate, incremental improvements driven by actual bottlenecks rather than anticipated ones. This article shares the scaling journey of Andromeda, the decisions we made at each stage, and the organizational practices that made those decisions effective.

The audience for this piece is engineering leaders and senior engineers who need to plan for growth. We cover both the technical mechanisms and the business context that drives scaling decisions.

Stage One: Vertical Scaling and Profiling

The first instinct when a service slows down is to throw more hardware at it. In Go, this instinct is surprisingly effective for the first leg of growth. Go's runtime efficiently uses multiple CPU cores, its garbage collector is tuned for low latency, and its memory allocation patterns are cache-friendly. Doubling the CPU and memory allocation of a single instance often doubles throughput with zero code changes.

But vertical scaling has limits, and more importantly, it has no diagnostic value. A faster machine masks the real bottleneck. Before scaling anything, we profile.

Go's built-in profiling tools are exceptional. Every Andromeda service exposes pprof endpoints in non-production environments:

import (
    "net/http"
    _ "net/http/pprof"
)
 
func startDebugServer() {
    go func() {
        // Only enabled in non-production environments
        http.ListenAndServe("localhost:6060", nil)
    }()
}

A five-minute CPU profile reveals where time is actually spent. In our experience, the answer is almost never "Go code is slow." It is one of three things: database queries, network calls to other services, or JSON/protobuf serialization of large payloads. These are the bottlenecks worth fixing.

For database queries, the fix is usually indexing, query optimization, or connection pool tuning. For network calls, it is batching, caching, or parallelization. For serialization, it is reducing payload size or using more efficient encoding. None of these fixes require more hardware.

The business lesson from this stage is: invest in observability before investing in infrastructure. A team that can identify bottlenecks in minutes will scale more effectively than a team that scales blindly.

Stage Two: Horizontal Scaling

When a single instance cannot handle the load regardless of hardware, you scale horizontally. Go services are well-suited for this because they are typically stateless: all state lives in databases, caches, or message brokers. Adding more instances is straightforward.

In Andromeda, every service runs as a Kubernetes Deployment with a Horizontal Pod Autoscaler (HPA). The HPA scales based on CPU utilization and custom metrics:

// Pseudocode for HPA configuration
// apiVersion: autoscaling/v2
// kind: HorizontalPodAutoscaler
// spec:
//   minReplicas: 2
//   maxReplicas: 20
//   metrics:
//   - type: Resource
//     resource:
//       name: cpu
//       target:
//         type: Utilization
//         averageUtilization: 70
//   - type: Pods
//     pods:
//       metric:
//         name: grpc_requests_per_second
//       target:
//         type: AverageValue
//         averageValue: 500

The minimum replica count of two ensures high availability. The maximum of twenty provides headroom for traffic spikes. Scaling on both CPU and request rate catches different types of load: CPU-bound work (like cryptographic operations or heavy computation) and I/O-bound work (like services that spend most of their time waiting on database queries).

For gRPC services, horizontal scaling requires attention to load balancing. Unlike HTTP/1.1, where each request is a separate TCP connection, gRPC uses long-lived HTTP/2 connections with multiplexed streams. A naive TCP load balancer sends all streams from a single client to the same backend. We use gRPC-aware load balancing, either through a service mesh or client-side balancing, to distribute streams evenly across instances:

import (
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials/insecure"
    _ "google.golang.org/grpc/balancer/roundrobin"
)
 
func dialWithLoadBalancing(target string) (*grpc.ClientConn, error) {
    return grpc.NewClient(
        target,
        grpc.WithDefaultServiceConfig(`{"loadBalancingConfig": [{"round_robin":{}}]}`),
        grpc.WithTransportCredentials(insecure.NewCredentials()),
    )
}

Stage Three: Data Layer Scaling

The service layer is rarely the true bottleneck at scale. The data layer is. Databases, caches, and message brokers are shared resources that every service instance competes for. Scaling the service layer without scaling the data layer just moves the bottleneck.

We address data layer scaling through several strategies:

Read replicas. For read-heavy services (which most of ours are), we route read queries to PostgreSQL read replicas and reserve the primary for writes. This is transparent to the application code because we use a connection wrapper that routes based on the query type:

type ReadWriteDB struct {
    primary *sql.DB
    replica *sql.DB
}
 
func (rw *ReadWriteDB) QueryContext(ctx context.Context, query string, args ...interface{}) (*sql.Rows, error) {
    return rw.replica.QueryContext(ctx, query, args...)
}
 
func (rw *ReadWriteDB) ExecContext(ctx context.Context, query string, args ...interface{}) (sql.Result, error) {
    return rw.primary.ExecContext(ctx, query, args...)
}
 
func (rw *ReadWriteDB) QueryRowContext(ctx context.Context, query string, args ...interface{}) *sql.Row {
    return rw.replica.QueryRowContext(ctx, query, args...)
}

Connection pooling. Go's database/sql package includes a connection pool, but its defaults are conservative. We tune MaxOpenConns, MaxIdleConns, and ConnMaxLifetime based on the service's concurrency profile:

db, _ := sql.Open("postgres", dsn)
db.SetMaxOpenConns(25)
db.SetMaxIdleConns(10)
db.SetConnMaxLifetime(5 * time.Minute)
db.SetConnMaxIdleTime(1 * time.Minute)

Caching. For data that is read frequently and changes infrequently (like configuration, feature flags, or user profiles), we use Redis as a read-through cache. The cache layer sits between the application service and the repository, reducing database load by an order of magnitude for cached entities.

NATS JetStream scaling. For event-driven workloads, NATS JetStream scales horizontally through its clustered mode. We partition high-volume event streams across multiple subjects and use consumer groups to distribute processing across service instances.

Stage Four: Organizational Scaling

Technical scaling is necessary but not sufficient. As the system grows, so does the team. And a team that cannot ship changes quickly is a bottleneck no amount of hardware can fix.

Our organizational scaling practices include:

Service ownership. Each service has a designated owning team. The owning team is responsible for the service's reliability, performance, and evolution. Ownership does not mean exclusivity; anyone can contribute to any service. But the owning team reviews changes, triages incidents, and sets the technical direction.

Runbooks. Every service has a runbook that documents common failure modes, debugging procedures, and recovery steps. Runbooks are stored alongside the service code in the monorepo, ensuring they stay up to date.

Load testing as a habit. We run load tests against staging before every significant release. Load tests are scripted, repeatable, and measure both throughput and latency percentiles. The results are compared against baselines to catch regressions before they reach production.

Capacity planning. Every quarter, we review traffic growth trends and project forward. This exercise, conducted jointly by engineering and product, ensures that scaling work is budgeted and prioritized alongside feature work. There is nothing worse than a scaling emergency that could have been prevented with a week of work three months earlier.

The Cost of Premature Scaling

A recurring theme in our scaling journey is the cost of premature optimization. Early in Andromeda's life, we introduced a distributed cache before we needed one. The cache added operational complexity (cache invalidation, consistency issues, an additional service to monitor) without measurable benefit because the database was handling the load just fine. We eventually removed the cache, simplified the architecture, and reintroduced it six months later when traffic genuinely warranted it.

The lesson is straightforward: measure before you optimize, and optimize the actual bottleneck, not the one you imagine. Go's profiling tools make measurement cheap. Use them aggressively.

Conclusion

Scaling Go services from startup to enterprise is a journey of incremental improvements, not a one-time architectural overhaul. Start with profiling to identify real bottlenecks. Exploit vertical scaling while it is effective. Move to horizontal scaling when single-instance limits are reached. Scale the data layer to match the service layer. And invest in organizational practices that allow the team to ship scaling improvements quickly and safely.

The most important insight is that scaling is a business function, not a technical one. The goal is not to handle the most requests per second possible. The goal is to handle the requests your users actually send, with acceptable latency, at reasonable cost. Every scaling decision should be evaluated against that goal. Go's efficiency, simplicity, and excellent tooling make it an ideal language for this kind of pragmatic, measured approach to growth.

Related Articles

technical

Testing Strategies for Go Backend Services

A comprehensive guide to testing Go backend services, covering unit tests, integration tests, end-to-end tests, table-driven patterns, test fixtures, and strategies for testing gRPC and NATS-based systems.

11 min read
business

How Monorepos Boost Team Productivity

An exploration of how monorepo architecture improves developer velocity, code quality, and cross-team collaboration, based on real-world experience with Andromeda.

9 min read
technical

Observability in Go: Tracing, Metrics, and Logging

A practical guide to implementing observability in Go backend services using OpenTelemetry for tracing, Prometheus for metrics, and structured logging with log/slog.

7 min read