GPU Cloud Comparison 2026

Twelve providers compared on H100 pricing, billing model, and which workloads each one fits.

This page tracks twelve GPU cloud providers running H100, H200, A100, and adjacent accelerators for AI training, fine-tuning, and inference.

Pricing reflects each provider's publicly listed on-demand H100 rate at the time of the most recent refresh. Providers that sell only into committed-capacity contracts show "On request" in place of a number.

The 2026 GPU cloud market splits into four shapes, with aggregators sitting across the layers. Hyperscale neoclouds like CoreWeave and Nebius compete for sustained training capacity at contract pricing, with Nscale as a sales-led alternative targeting larger enterprises. On-demand specialists like Lambda, RunPod, and Together AI publish per-minute or per-second rates. Serverless platforms like Modal, Fal, Replicate, and Baseten run inference behind a function-call API. Marketplaces like Vast.ai aggregate independent GPU owners at the lowest published rates anywhere, with reliability varying by host. zCLOUD routes capacity across a network of underlying providers spanning all four. The table covers all of them.

Provider comparison 12 tracked, ordered alphabetically

RefreshedMay 26, 2026 Sourcepublic pricing pages
Provider H100 / hr Billing GPUs available Regions Access
BasetenInference platform $6.50/hr per-minute B200·H100·A100 US Self-serve
CoreWeaveHyperscale neocloud $6.16/hr per-hour B200·H200·H100·A100·L40S US, EU Self-serve
FalInference, media $1.89*/hr per-second B200·H200·H100·A100 Global Self-serve
LambdaTraining-focused $3.99*/hr per-minute B200·H100·A100 US Self-serve
ModalServerless compute $3.95/hr per-second B200·H200·H100·A100·L40S US Self-serve
NebiusEuropean neocloud $2.95/hr per-second B200·H200·H100·L40S EU, US Self-serve
NscaleAI infrastructure On request contract B200·H200·H100 EU, UK, US Sales
ReplicateModel-hosting $5.49/hr per-second H100·A100·L40S Global Self-serve
RunPodOn-demand and spot $1.99*/hr per-second B200·H200·H100·A100·L40S Global Self-serve
Together AIInference and training $5.49*/hr per-minute B200·H200·H100 US Self-serve
Vast.aiMarketplace $1.87*/hr per-second B200·H200·H100·A100·L40S Global Self-serve
zCLOUDAggregator, 40+ providers On request bid/ask B200·H200·H100·A100·L40S Global Sales
*Marked prices are starting-at rates or lowest-tier pricing. Each provider profile explains the full context.

Choosing a provider

Pricing alone is not a decision. These four framings cover the most common starting points, and the provider profiles below explain each name in detail.

If you want the lowest published rate

Three providers sit within twelve cents of each other at the floor. Vast.ai at $1.87/hr lists the lowest of the three, sourced from a marketplace of independent GPU owners worldwide. Fal at $1.89/hr publishes a flat self-serve rate on dedicated H100 compute. RunPod at $1.99/hr matches that on its Community Cloud tier, while Secure Cloud sits higher for workloads that need a tighter SLA.

Each one trades something for the price. Vast.ai's hosts vary in consistency since the supply comes from independent operators, Fal's product surface is narrower than the broader-platform providers, and RunPod's cheaper tier ships with weaker uptime guarantees than its Secure Cloud equivalent. If the workload tolerates some variance, the floor is real and the savings compound. If it doesn't, the next framing fits better.

If you want predictable production performance

Lambda, Modal, and Baseten all own their hardware and run it as either a single-tenant rental or a function-call service. Lambda's $3.99/hr H100 rate suits training workloads where a full 8-GPU node runs for hours or days continuously. Modal at $3.95/hr bills per-second on a true serverless model, so idle time costs nothing and bursty inference scales cleanly. Baseten at $6.50/hr runs production inference behind an API with the platform layer built into the rate, covering auto-scaling, observability, and rollbacks.

Replicate sits adjacent to this group with a different shape. The platform pairs per-output billing on a large catalog of popular open models with per-second dedicated hardware at $5.49/hr, which makes it the fastest path from "we want to try this model" to "we have a working endpoint." Builders prototyping around an open model often start there and evaluate Modal or Baseten as production volume grows.

Three different shapes of production-ready, with the right one depending on whether the workload is bursty inference, sustained training, or a managed model endpoint.

If you need cluster-scale capacity

CoreWeave, Nebius, and Together AI sell multi-GPU configurations as their primary product rather than as an upsell. CoreWeave at $6.16/hr per GPU sells the HGX H100 8-GPU node and runs the broadest hardware fleet in the table, including GB200 NVL72 systems aimed at frontier training. Nebius at $2.95/hr per GPU undercuts most of the field on per-GPU pricing while operating from EU and US regions, which makes them a strong default for European teams. Together AI at $5.49/hr per GPU pairs cluster compute with their open-model serverless inference stack, so a team training a model can move it to production on the same platform.

This framing matters when the workload needs sustained access to 8, 64, or 256 GPUs rather than ad hoc rentals.

If your requirements change, or you do not know yet

The first three framings assume the buyer already knows the answer, but most don't. Training requirements shift between fine-tuning runs, inference traffic spikes unpredictably, and a region constraint can turn into a hard requirement halfway through a project.

zCLOUD aggregates supply from 40+ providers and routes capacity at purchase time using a bid/ask system, so the price isn't published in advance because the price depends on what's actually available when the request lands. The fit is for buyers who would rather not commit to one provider's strengths and weaknesses before they've worked out which ones matter most. Nscale fits a similar shape on the contract side, sales-led from the start and focused on large-scale infrastructure with negotiated terms.

Provider profiles

Twelve providers, alphabetical. What each company does, how it bills, who tends to use it, and where it earns its place in the table.

Baseten

Inference platform · Self-serve · US

Baseten runs production inference for teams that want a model behind an API without operating GPU infrastructure themselves. The platform sits a layer above raw compute, handling cold starts, auto-scaling, and observability so the engineering work focuses on the model rather than the cluster.

On-demand pricing is now public. The H100 80GB rate sits at $6.50/hr with per-minute billing, and the broader catalog spans B200, H100, and A100. The headline rate runs higher than raw-compute providers because the platform layer is built into the price, which is the trade Baseten asks production teams to make.

Common workloads include large language models, transcription, and image generation served at production latency, often with traffic shapes that need scale-from-zero and burst handling. Teams that have outgrown a hobby-tier inference provider but don't want to run their own Kubernetes typically land here.

The Model APIs product, separate from dedicated deployments, lets teams call popular open-weights models like DeepSeek, Kimi, and GLM at per-token rates. The two surfaces share the same infrastructure.

Visit baseten.co

CoreWeave

Hyperscale neocloud · Self-serve · US, EU

CoreWeave runs one of the larger NVIDIA fleets in the market, with capacity spanning GB200 NVL72, B200, H200, H100, A100, and L40S. The hardware breadth is the headline story for buyers, since it covers everything from a single L40S inference instance up to a multi-thousand-GPU training cluster.

On-demand pricing exists and is self-serve. An 8-GPU HGX H100 node prices at $49.24/hr, which works out to $6.16 per GPU per hour, with billing measured in instance-hours rather than seconds or minutes. Reserved capacity discounts run up to 60% for committed terms, and the company sells contracts measured in months or years as its primary revenue path.

The fit covers sustained training and large-scale inference workloads where high-bandwidth networking and capacity guarantees matter more than per-hour billing granularity. Teams running multi-node training jobs across dozens or hundreds of GPUs tend to land here once the workload outgrows a single-cluster provider.

Procurement on the contract side runs in weeks. Self-serve on-demand spin-up exists for buyers who want to try the platform before committing.

Visit coreweave.com

Fal

Inference platform · Self-serve · Global

Fal specializes in inference for image, video, and audio models, with a catalog of more than 1,000 production-ready open-weights models exposed behind a unified API. The platform handles cold starts, queueing, and model warming so a single endpoint can serve burst consumer-facing traffic without a self-managed cluster.

The company runs two pricing surfaces. The Model APIs charge per output unit (per image, per video-second, per audio-second), which suits teams that don't want to think about GPU hours at all. Dedicated compute rents H100 capacity at $1.89/hr with per-second billing, which puts Fal near the floor of the table for raw GPU pricing.

The low Compute rate likely reflects economics from the Model APIs side, where the company captures margin on the platform layer. Customers who use the Compute product typically have a model that doesn't fit the standard catalog and want raw access to a GPU at a competitive rate.

Teams shipping consumer-facing media products land here because the Model APIs include current-generation models like Flux, Veo, Kling, and Seedance behind a single integration. Production traffic at scale routes through dedicated endpoints with reserved capacity.

Visit fal.ai

Lambda

Training-focused · Self-serve · US

Lambda sells GPU compute aimed primarily at training and fine-tuning workloads. Current capacity spans B200, H100, A100, and older Tesla V100s across US regions.

The H100 SXM rate prices at $3.99/hr per GPU on an 8-GPU instance, with single-GPU rentals priced higher at $4.29/hr. The 1-Click Clusters product extends the same hardware into 16-to-2,000-GPU configurations with B200 capacity starting at $9.86/hr per GPU on 16-GPU clusters, scaling down to $8.87/hr at 256+ GPUs. Billing runs per-minute on on-demand, weekly on 1-Click Clusters.

The platform suits training and fine-tuning work where the team wants a familiar Linux box with the standard CUDA stack and the option to scale into multi-node clusters. The "self-serve, first-come access" model means capacity isn't always available at the size you want, which is one trade for the relatively clean pricing structure.

Lambda is no longer the cheapest H100 in the table by a wide margin. The current positioning is closer to "stable training-grade infrastructure with predictable pricing" than "lowest published rate."

Visit lambdalabs.com

Nebius

European neocloud · Self-serve · EU, US

Nebius operates large H200, H100, B200, and L40S capacity across EU and US regions, with the European footprint and compliance posture as the primary draw for buyers based in or selling into the EU.

On-demand H100 prices at $2.95/hr with per-second billing, putting Nebius among the cheaper hyperscale-grade options that still own their hardware. Reserved capacity discounts run up to 35% for committed terms, and the unified billing model introduced in late 2025 bundles GPU, vCPU, and RAM into a single per-GPU-hour rate rather than charging components separately.

The platform suits European AI companies handling regulated data or building products for the EU market, and the residency story is genuine rather than marketing language. US workloads route competitively too, particularly for teams that want a self-serve alternative to CoreWeave at training scale.

The fleet includes GB200 and B300 NVLink systems for frontier workloads, though those tier into "Contact us" pricing rather than published rates.

Visit nebius.com

Nscale

AI infrastructure · Sales-led · EU, UK, US

Nscale operates bare-metal NVIDIA GPU infrastructure for large-scale training and inference workloads. Data centers span Norway, the UK, the US, Portugal, and Iceland, with additional capacity coming online in West Virginia.

The product line spans bare-metal GPU compute, managed Slurm and Kubernetes, and AI services like inference endpoints and fine-tuning. Published on-demand pricing for raw compute remains limited, and engagements run through sales sized to workload, term, and capacity reservation across data centers.

The AI Services product (inference, fine-tuning, prompt workbench) is self-serve with Stripe-based credit purchases, but the compute infrastructure that anchors the company's revenue continues to be sales-led. Buyers who only need inference endpoints can land on the platform without a sales conversation.

Workloads that match the Nscale fit involve a procurement process measured in weeks or months, capacity reservations spanning multiple regions, and a contract specifying performance and operational guarantees.

Visit nscale.com

Replicate

Model-hosting platform · Self-serve · Global

Replicate runs a model-hosting platform with a large catalog of open-weights models exposed behind a per-second-billed API. The product surface targets developers who want to call a model the way they call any other API, with cold-start handling, model versioning, and a web playground reducing the friction between idea and working prototype.

H100 capacity prices at $5.49/hr with per-second billing for dedicated hardware, alongside A100 at $5.04/hr and L40S at $3.51/hr. The catalog also includes per-output pricing for popular public models like Flux, Claude, and DeepSeek, which is the dominant usage pattern.

Builders prototyping a product around an open model, or running batch inference jobs at modest volume, tend to find Replicate fastest to value. The deployment experience for custom models, built on Cog (Replicate's open-source packaging tool), means a team can move from "model trained on local machine" to "model running behind a production API" in an afternoon.

Teams that scale into steady high-volume traffic usually evaluate Modal or Fal as the per-request economics shift in their favor.

Visit replicate.com

RunPod

On-demand and spot · Self-serve · Global

RunPod sells on-demand and spot GPU instances at per-second billing across a fleet that spans B200, H200, H100, A100, L40S, and consumer-class GPUs like the 4090 and 5090. The platform covers more than 30 regions globally, which is the broadest geographic footprint of any provider in the table.

The H100 PCIe rate on Community Cloud prices at $1.99/hr, with H100 SXM at $2.69/hr and the higher-reliability Secure Cloud tier priced above both. Community Cloud uses capacity from independent hosts at lower rates and weaker SLAs, while Secure Cloud runs on RunPod-owned hardware with stronger uptime guarantees.

Three product modes serve different workloads. Pods are persistent GPU rentals priced by the hour at per-second granularity. Serverless adds autoscaling with per-second billing on cold-started workers, with the H100 Serverless rate at $4.18/hr. Clusters launch multi-GPU configurations from 8 to 64 GPUs for training jobs.

Iteration, fine-tuning, and cost-sensitive inference workloads benefit from the fast spin-up and per-second billing. Production teams that need stricter SLAs route through Secure Cloud or move to providers with more rigid reliability guarantees.

Visit runpod.io

Together AI

Inference and training · Self-serve · US

Together AI runs three distinct products that share the same underlying GPU infrastructure. The Serverless Inference catalog exposes a wide range of open-weights models (Llama, DeepSeek, Qwen, Kimi, GLM) at per-token pricing, which is the company's most-used product. Dedicated Inference rents single-tenant GPU instances by the hour for teams that need guaranteed performance ($6.49/hr H100). GPU Clusters sell multi-node configurations for training and fine-tuning at $5.49/hr per GPU on-demand, with reserved pricing as low as $3.99/hr at 91+ day commitments.

Billing on the dedicated hardware side runs per-minute, with the company's tooling targeting teams who train, fine-tune, and serve their own models on the same platform.

A Batch Inference API runs at 50% discount for workloads that don't need real-time responses, which materially shifts the economics for large-scale data labeling, evaluation, or offline generation jobs.

Customers building on open-weights models tend to start with the Serverless inference catalog and move to dedicated capacity as volume grows. The path from notebook to multi-node cluster on the same provider is a meaningful operational advantage that hyperscalers and pure-inference platforms struggle to match.

Visit together.ai

Vast.ai

GPU marketplace · Self-serve · Global

Vast.ai operates a marketplace where independent GPU owners list capacity for rent, which makes it structurally different from every other provider in the table. The company doesn't own the hardware. It runs the platform that matches buyers to sellers, charges by the second, and lets the market set the price.

H100 SXM pricing starts around $1.87/hr at the marketplace floor, with median pricing closer to $2.12/hr and the upper quartile at $3.40/hr depending on host reliability score and region. The fleet covers more than 60 GPU types, from current-gen B200 down to consumer 3060s, with the headline floor reflecting the cheapest available hosts at any given moment.

Three instance types serve different workloads. On-demand provides guaranteed uptime at the standard rate. Interruptible runs at 50%+ discount but the host can reclaim capacity. Reserved offers commitment-based discounts up to 50% off.

Researchers, hobbyists, and price-sensitive teams use Vast.ai for training runs and experimentation where uptime guarantees take a back seat to dollar cost. Production workloads generally avoid the marketplace, though the platform's own filters let buyers screen out lower-tier hosts at higher prices. Bandwidth charges apply per byte transferred and vary by host, which can materially affect total cost on data-intensive workloads.

Visit vast.ai

zCLOUD

Aggregator · Sales-led · Global

zCLOUD aggregates capacity across a network of 40+ underlying GPU providers and presents it as a single procurement surface. The platform sits a layer above the providers in this table rather than alongside them, which makes the comparison structurally different from every other row.

Pricing uses a bid/ask system. Rather than publishing a rate card, zCLOUD prices each request against live availability from the underlying network at the time of purchase, then routes the workload to the best available match. Region constraints, hardware requirements, and term length all enter the routing logic.

The fit covers teams whose requirements shift faster than a single provider's strengths can keep up with. Training workloads that move between fine-tuning runs, inference traffic that spikes unpredictably, region constraints that turn into hard requirements mid-project, or simply the case where the buyer hasn't yet decided which single provider's strengths matter most.

Engagement runs through sales because the routing logic depends on workload specifics that a self-serve checkout can't fully capture. The trade for that procurement step is access to breadth that no single provider can match, including the eleven other providers on this page when they offer the right capacity at the right price.

The site that hosts this comparison is operated by Zettabyte Technology, which also operates zCLOUD. The footer carries the disclosure.

Visit zettabytecloud.com

Frequently asked questions

Common questions about GPU cloud pricing, provider selection, and where the table draws its lines.

Which GPU cloud has the cheapest H100?

At the time of the last refresh, Vast.ai lists the lowest published H100 rate at $1.87/hr through its marketplace model.

The headline rate comes with the trade-offs of marketplace economics. Host reliability varies, capacity availability fluctuates, and production workloads usually pay a small premium for higher-reliability hosts.

Among curated single-provider clouds, Fal at $1.89/hr and RunPod at $1.99/hr sit close to the floor.

What's the difference between on-demand and reserved GPU pricing?

On-demand pricing applies to GPU capacity rented by the second, minute, or hour with no time commitment.

Reserved pricing applies to capacity bought on a contract, typically months or years, at a discount that often runs 30-60% below on-demand.

On-demand wins for short workloads, experimentation, and bursty traffic. Reserved wins once a workload runs sustained for weeks or longer.

Is RunPod or Modal better for inference?

The shape of the traffic decides it.

RunPod runs cheaper per GPU-hour and works well when a workload runs continuously enough to fill a rented instance.

Modal runs more expensive per GPU-hour but scales to zero when idle and starts in seconds, so it wins for variable traffic, batch jobs, and inference patterns with low average utilization.

Continuous high-volume inference favors RunPod or a dedicated cluster.

Why don't all providers publish on-demand prices?

Providers that sell primarily into committed-capacity contracts often skip publishing on-demand rates because the contract negotiation captures more value than a public price list. Nscale and aggregators like zCLOUD fall into this group.

The "On request" label in the table reflects this. The price exists but lives in a quote, sized to the customer's workload and term.

What GPU cloud should a startup use?

The answer depends on the workload phase, and startups often span multiple phases inside a single month.

Early prototyping and fine-tuning fit on RunPod, Lambda, or Vast.ai at on-demand rates, where per-second billing keeps experiment costs negligible.

Production inference behind variable traffic fits on Modal, Fal, Replicate, or Baseten, all of which scale automatically and bill only on active compute.

Sustained training at multi-node scale usually justifies the procurement effort of CoreWeave, Nebius, or a reserved cluster with Together AI.

Startups whose requirements move faster than these categories can accommodate sometimes route through zCLOUD instead, since the aggregator model means the right provider for the current workload is selected at purchase rather than locked in by a prior decision. The fit is for founders who would rather not bet on a single provider's strengths before the workload has stabilized.

Nscale sits outside the typical startup fit. Their procurement model and contract-based pricing target larger enterprises with sustained, predictable capacity needs.

How often does the comparison table update?

The table refreshes regularly, with the "Refreshed" date at the top reflecting the most recent update.

When a provider changes published pricing between refreshes, the table can update out-of-cycle.

Are AWS, GCP, and Azure GPU instances comparable?

The hyperscalers sit outside the scope of this table.

Their pricing models bundle network, storage, and support into rates that resist apples-to-apples comparison with neoclouds and on-demand specialists.

Hyperscaler GPU instances suit teams already deep in the broader AWS, GCP, or Azure ecosystem. The providers in this table compete on price and GPU-specific tooling against that incumbency.

What does "self-serve" versus "sales" mean in the access column?

Self-serve providers let a buyer sign up, enter a card, and launch a GPU within minutes.

Sales-led providers route procurement through a quote, contract, and onboarding process measured in weeks.

The choice maps closely to workload duration. Self-serve suits experiments and short runs. Sales-led suits sustained workloads where contract terms create the savings.

Methodology

Pricing source

Prices reflect each provider's publicly listed on-demand H100 rate at the time of the most recent refresh.

The figures come from each provider's pricing page or product documentation, captured manually rather than scraped, so a human eye catches edge cases like tiered minute-blocks or region-conditional rates.

Providers that sell only into committed-capacity contracts show "On request" in place of a number.

Provider selection

The table covers GPU clouds running H100, H200, A100, or comparable accelerators with relevance to AI training, fine-tuning, or production inference in 2026.

Hyperscalers (AWS, GCP, Azure) sit outside the scope. Their pricing models differ enough that a side-by-side comparison mislabels what's being compared.

Marketplace, on-demand, serverless, neocloud, and aggregator shapes all appear when they meet the audience-relevance bar.

Refresh cadence

The table refreshes regularly, with the "Refreshed" date in the header reflecting the most recent update.

When a provider changes pricing between refreshes, the table can update out-of-cycle.

Historical pricing isn't retained on this page. Comparison pages that reference past rates use snapshots captured at the time of writing.

What gets compared

The table compares headline rates and structural attributes. Billing model, available GPU types, regions, and access path.

It doesn't score reliability, customer support, or networking quality, since these vary enough by workload that a single rating would mislead.

Comparison pages get into those dimensions where a specific provider pair makes the comparison meaningful.