Plan Before You Provision
GPU infrastructure is expensive. A single H100 server costs $200,000+. Cloud GPU instances run $2-30 per hour per GPU. Without capacity planning, organizations either over-provision (wasting $50,000+/year on idle GPUs) or under-provision (users hit latency spikes and adoption stalls). Capacity planning uses your actual usage data, growth projections, and cost constraints to determine the right infrastructure size at the right time.
Cost Modeling
Build financial models comparing on-prem CAPEX, cloud reserved instances, on-demand pricing, and hybrid approaches. Factor in electricity, cooling, maintenance, and staffing for on-prem. Factor in data transfer and storage for cloud.
Scaling Policies
Define when and how to scale based on GPU utilization, queue depth, latency percentiles, and business calendar. Auto-scaling policies that respond to demand without manual intervention.
Instance Mix Optimization
Balance reserved instances (cheapest per hour), on-demand (most flexible), and spot (cheapest but interruptible). The optimal mix depends on your workload predictability and tolerance for interruption.
Growth Forecasting
Project infrastructure needs 3, 6, and 12 months ahead based on current adoption rate, planned rollouts, and seasonal patterns. Account for GPU procurement lead times (4-16 weeks for on-prem hardware).
Capacity Planning Cycle
Measure
Current utilization and trends
Model
Cost scenarios and projections
Plan
Procurement and scaling decisions
Review
Monthly actuals vs. forecast
Measure
Current utilization and trends
Model
Cost scenarios and projections
Plan
Procurement and scaling decisions
Review
Monthly actuals vs. forecast
Capacity Planning Metrics
Cost Modeling Approach
We build spreadsheet-based cost models that compare your options with real numbers, not vendor-provided estimates that conveniently favor their solution.
On-premises total cost of ownership. Hardware purchase price plus 3 years of electricity (~$0.10/kWh, 10-12 kW per 8-GPU server), cooling (typically 40% of compute power cost), maintenance contracts (10-15% of hardware cost per year), rack space ($500-2,000/month per rack), and staff time for operations. Depreciated over 3-5 years. Break-even versus cloud typically occurs at 40-60% average GPU utilization.
Cloud cost projection. On-demand pricing as the ceiling. Reserved instance pricing (1-year and 3-year) as the floor for predictable workloads. Spot pricing for batch workloads. Data transfer costs for ingestion and inference traffic. Storage costs for model weights and cached data. Monitoring and logging costs that scale with request volume.
Hybrid optimization. On-prem for baseline capacity, cloud for burst. On-prem for sensitive workloads, cloud for general-purpose. The hybrid model captures the best economics of both approaches when the workload has clear sensitivity and variability characteristics.
Procurement Timing
GPU hardware has significant lead times. Ordering too late means capacity shortfalls. Ordering too early means idle hardware depreciating in a rack.
Lead time awareness. NVIDIA H100 servers: 4-12 weeks depending on configuration and vendor. A100 servers: 2-6 weeks (more available on secondary market). Cloud reserved instances: immediate availability but 1- or 3-year commitment required.
Trigger-based procurement. Set utilization thresholds that trigger procurement processes with enough lead time. Example: when average GPU utilization exceeds 70% for 2 consecutive weeks, initiate hardware order for delivery in 8 weeks. This prevents both over-provisioning and emergency orders at premium prices.
Who This Is For
Capacity planning is for organizations spending $10,000+/month on AI infrastructure or planning to. The savings from right-sizing and optimal purchasing strategies typically pay for the planning engagement within the first quarter.
Contact us at ben@oakenai.tech
