Cluster Design Principles
A GPU cluster for AI inference is not just multiple GPUs in a rack. The interconnect topology, network fabric, storage subsystem, and scheduling layer all determine whether your cluster delivers linear performance scaling or becomes a bottleneck-limited system where half the GPUs sit idle waiting for data. We design clusters based on your model sizes, parallelism requirements, and concurrency targets to ensure every GPU is productively utilized.
GPU Selection and Sizing
Match A100, H100, or L40S to your workload. H100 for maximum throughput on large models. A100 for proven reliability at lower cost. L40S for mixed workloads in standard server chassis. Right-size to avoid idle capacity.
NVLink Topology
Tensor parallelism across GPUs requires high-bandwidth interconnect. NVLink 4.0 provides 900 GB/s between H100 GPUs versus 64 GB/s on PCIe. NVSwitch enables all-to-all GPU communication within a node.
Cluster Networking
InfiniBand NDR (400 Gb/s) or RoCE v2 for multi-node communication. RDMA-capable NICs reduce CPU overhead. Non-blocking fat-tree or rail-optimized topology based on traffic patterns and budget.
Right-Sizing Methodology
Profile your workload to determine GPU count, memory per GPU, and inter-GPU bandwidth requirements. Start with the minimum viable cluster and scale based on measured utilization, not theoretical projections.
Cluster Design Process
Profile
Workload analysis and modeling
Design
Topology and component selection
Validate
Benchmark before full build
Deploy
Installation and burn-in testing
Profile
Workload analysis and modeling
Design
Topology and component selection
Validate
Benchmark before full build
Deploy
Installation and burn-in testing
GPU Cluster Architecture
Multi-GPU Configurations
The number of GPUs and how they communicate determines what models you can serve and at what throughput. We design configurations that match your specific model portfolio.
Single-node tensor parallelism. Split a single large model across 2, 4, or 8 GPUs within one server. NVLink provides the bandwidth for tensor parallel communication. An 8xH100 node with NVSwitch serves 405B-parameter models at production throughput. This is the simplest multi-GPU configuration and the most common for enterprise inference.
Multi-node pipeline parallelism. Distribute model layers across multiple servers connected via InfiniBand. Enables serving models larger than a single node can hold. Pipeline parallelism introduces bubble overhead, so it is only justified for models that truly cannot fit on one node.
Replica scaling. Multiple independent model replicas behind a load balancer. Each replica serves a complete model on one or more GPUs. Linear throughput scaling with no inter-node communication overhead. The preferred scaling approach for models that fit on a single node.
Network Fabric Design
The network connecting GPU nodes is often the performance bottleneck in multi-node clusters. We design network fabrics that match the communication patterns of your inference workload.
InfiniBand NDR. 400 Gb/s per port with RDMA for low-latency, CPU-bypass communication. The gold standard for multi-node AI clusters. ConnectX-7 adapters with NVIDIA Quantum-2 switches. Higher cost but significantly lower latency than Ethernet alternatives.
RoCE v2 over Ethernet. RDMA over Converged Ethernet provides RDMA capability on standard 100/200/400 GbE networks. Lower cost than InfiniBand with acceptable performance for replica-based scaling where inter-node communication is limited to load balancing decisions.
Who This Is For
GPU cluster architecture consulting is for organizations buildinginference infrastructure that goes beyond a single server. Whether you need to serve very large models across multiple GPUs or scale throughput with multiple replicas, the cluster design determines your cost efficiency and performance ceiling.
Contact us at ben@oakenai.tech
