Prometheus is free. Running it in production is not. Self-hosted monitoring for 100 hosts costs $2,000-8,000/mo in infrastructure plus 0.5-1 FTE in engineering time. That is $8,000-20,000/mo total. Datadog for the same scale: $5,000-15,000/mo. Grafana Cloud: $3,000-9,000/mo. The answer depends on your team's capacity and whether you have dedicated platform engineers.
The question "should we use Prometheus or Datadog?" appears on r/devops and Hacker News weekly. It is asked repeatedly because the answer is genuinely nuanced and depends on factors that most comparison articles ignore: engineering team capacity, Kubernetes expertise, willingness to maintain monitoring infrastructure, and the value of engineering time that could be spent on product development instead of monitoring tool maintenance. Every existing comparison online is written by a vendor: SigNoz promotes open source, Grafana Labs promotes Grafana Cloud, and Datadog promotes itself. This is the independent TCO analysis the industry has been missing.
This page provides a complete total cost of ownership comparison across three monitoring approaches: fully self-hosted open source (Prometheus, Grafana, Loki, Tempo), managed open source (Grafana Cloud), and fully commercial (Datadog). We include the costs that self-hosted advocates often understate (engineering time, on-call overhead, upgrade maintenance) and the costs that commercial advocates often inflate (the real price of vendor flexibility and competitive alternatives). Our goal is to help you make the right decision for your specific situation, not to advocate for any particular approach.
This table compares the three primary monitoring approaches for a standard 100-host deployment with APM on 50% of hosts, 100GB/day log ingest, and 15-day retention. Engineering costs assume a fully-loaded engineer cost of $12,500/month (approximately $150,000/year including benefits, equipment, and overhead). Setup costs are amortised over 12 months. The self-hosted option includes Prometheus for metrics, Grafana for dashboards, Loki for logs, and Tempo for distributed traces.
| Cost Component | Self-Hosted OSS | Grafana Cloud | Datadog |
|---|---|---|---|
| Software License | $0 | $3,000-9,000 | $5,000-15,000 |
| Infrastructure (compute + storage) | $2,000-5,000 | $0 (included) | $0 (included) |
| Setup Engineering (amortised/mo) | $780 | $104 | $52 |
| Ongoing Maintenance (FTE/mo) | $4,375 | $625 | $250 |
| Training (amortised/mo) | $208 | $83 | $42 |
| Total Monthly TCO | $7,363-10,363 | $3,812-9,812 | $5,344-15,344 |
Self-hosted maintenance assumes 0.35 FTE ongoing. Grafana Cloud assumes 0.05 FTE. Datadog assumes 0.02 FTE. Setup costs amortised over 12 months.
Self-hosting a monitoring stack based on Prometheus, Grafana, Loki, and Tempo requires dedicated infrastructure and ongoing engineering investment. The software itself is completely free and open source under Apache 2.0 (Prometheus, Loki, Tempo) and AGPL-3.0 (Grafana) licenses. The costs come from three areas: the cloud infrastructure to run the monitoring systems, the engineering time to set up and maintain them, and the opportunity cost of engineering time spent on monitoring infrastructure rather than product development.
A production-ready self-hosted monitoring stack for 100 hosts requires dedicated compute and storage resources that scale with the volume of telemetry data being collected. The infrastructure must be separate from the application workloads being monitored to ensure monitoring remains available during application incidents. For high availability, each component should run as a multi-replica deployment.
Metric ingestion and storage. For 100 hosts generating ~500K active time series: 2-3 instances with 8 vCPU, 32GB RAM, 500GB SSD each. Mimir recommended over vanilla Prometheus for horizontal scaling. Monthly compute cost: $800-1,500. Storage cost (1-2TB/month): $200-400.
Dashboard and alerting UI. Lightweight compared to data stores: 1-2 instances with 2 vCPU, 4GB RAM. Monthly compute cost: $80-150. Database for dashboard storage (PostgreSQL or SQLite): $50-100. This is the easiest component to self-host.
Log aggregation. For 100GB/day log ingest: 3-4 instances with 4 vCPU, 16GB RAM for ingesters, 2 instances for queriers. Object storage backend (S3/GCS) for log data: $200-500/month depending on retention. Monthly total: $600-1,200.
Distributed trace storage. For APM-equivalent tracing on 50 hosts: 2-3 instances with 4 vCPU, 8GB RAM. Object storage backend: $100-300/month. Monthly total: $300-600. Tempo is the newest component and requires the most operational expertise.
Engineering time is the most commonly underestimated cost of self-hosted monitoring. Initial setup for a production-ready Prometheus + Grafana + Loki + Tempo stack takes 4-8 weeks for a senior platform engineer, including high availability configuration, alerting rule migration, dashboard creation, and team training. Ongoing maintenance averages 0.25-0.5 FTE, covering Prometheus/Mimir upgrades (quarterly), capacity planning, storage management, alert rule tuning, dashboard maintenance, and troubleshooting data ingestion issues. This engineering time has an opportunity cost: every hour spent maintaining monitoring infrastructure is an hour not spent on product engineering. For companies where engineering capacity is the bottleneck, this opportunity cost can exceed the direct savings from avoiding vendor licensing.
Vanilla Prometheus has a single-node scaling limit of approximately 10-15 million time series. Beyond this, you need a horizontally-scalable metric store like Mimir, Thanos, or VictoriaMetrics. Mimir is recommended as the most actively developed option with native Grafana Labs support. VictoriaMetrics is an excellent alternative that is more resource-efficient but has a smaller community. Thanos is well-established but has been largely superseded by Mimir for new deployments. Each of these adds operational complexity compared to single-node Prometheus, requiring distributed consensus, compaction management, and more sophisticated capacity planning.
Many organisations adopt a hybrid approach that combines open-source metrics collection with commercial log management and APM. This is not a compromise but a pragmatic architecture that optimises cost per telemetry type. The most common hybrid pattern is Prometheus for metrics (where open source excels and per-host pricing is expensive), combined with Datadog or Grafana Cloud for APM and log management (where commercial tooling provides significantly better query performance and analysis capabilities than self-hosted alternatives).
The hybrid approach works because metrics and logs have fundamentally different cost profiles. Metrics are structured, compact, and well-handled by Prometheus at scale. Logs are unstructured, voluminous, and require sophisticated indexing and query engines that are difficult to self-host efficiently. APM traces require complex correlation and analysis that commercial tools handle better than self-hosted Jaeger or Tempo. By self-hosting the metric layer and using a managed service for logs and traces, you capture 60-70% of the potential self-hosting savings while avoiding the hardest operational challenges.
A typical hybrid stack for 100 hosts might cost $1,500-3,000/month for self-hosted Prometheus + Grafana (metrics and dashboards) plus $2,000-5,000/month for Grafana Cloud or Datadog (logs and APM only), totaling $3,500-8,000/month. This compares to $5,000-15,000/month for fully commercial or $7,000-10,000/month for fully self-hosted. The hybrid approach often represents the optimal cost-to-effort ratio for mid-market companies with moderate platform engineering capacity.
If you are currently on Datadog and considering a migration to self-hosted monitoring, the one-time migration cost is the critical factor that determines whether the switch is financially worthwhile. Migration is not just installing new software. It requires rewriting every dashboard, reconfiguring every alert, updating every runbook, retraining every engineer, and running parallel systems for validation. Based on typical mid-market deployments, the migration costs break down as follows.
| Migration Task | Engineering Weeks | Estimated Cost |
|---|---|---|
| Infrastructure setup (Prometheus, Grafana, Loki, Tempo) | 2-4 | $7,500-15,000 |
| Dashboard recreation (20-100 dashboards) | 2-4 | $7,500-15,000 |
| Alert rule migration (50-500 rules) | 1-2 | $3,750-7,500 |
| Runbook and documentation updates | 1 | $3,750 |
| Team training and onboarding | 1 | $3,750 |
| Parallel running and validation | 4-8 | $15,000-30,000 |
| Total Migration Cost | 11-20 weeks | $41,250-75,000 |
For the migration to break even within 12 months, you need monthly savings of at least $3,500-6,250 from the new platform versus Datadog. At 100 hosts, the typical monthly savings from switching to self-hosted is $2,000-8,000, meaning break-even ranges from 6 months (best case) to 24 months (worst case). The financial case for migration is strongest above 200 hosts where monthly savings exceed $5,000 and break-even occurs within 8-12 months.
6 vendors compared across 3 scenarios
Cost CalculatorModel your specific infrastructure costs
Cost Reduction Guide12 strategies before migrating
Grafana Cloud PricingThe managed middle ground
Kubernetes MonitoringWhere OSS has the biggest advantage
Cost BenchmarksIndustry spend benchmarks
Prometheus is 100% free and open source under the Apache 2.0 license. There is no paid tier, no premium features, and no usage limits in the software itself. However, running Prometheus in production requires cloud infrastructure (compute instances, storage volumes, networking) that costs $800-3,000/month for a 100-host deployment, plus engineering time for setup (4-8 weeks), ongoing maintenance (0.25-0.5 FTE), and the operational expertise to manage scaling, high availability, and retention. The total cost of ownership for a self-hosted Prometheus stack is typically $7,000-10,000/month at 100 hosts when engineering time is included, which is comparable to or slightly less than Datadog at the same scale.
Open source monitoring software is free. Operating it in production is not. The three hidden costs of self-hosted monitoring are: infrastructure costs ($2,000-5,000/month for 100 hosts on AWS/GCP), engineering time for setup and maintenance (0.5-1 FTE worth $6,000-12,500/month), and opportunity cost (engineering time spent on monitoring infrastructure instead of product development). At small scale (under 50 hosts), these costs often exceed what a commercial vendor would charge. At large scale (500+ hosts), self-hosted monitoring becomes significantly cheaper than commercial alternatives because infrastructure costs scale sub-linearly while vendor pricing scales linearly per host. The breakeven point is typically around 100-200 hosts.
The total cost of ownership of self-hosted monitoring (Prometheus + Grafana + Loki + Tempo) for a 100-host deployment is approximately $7,000-10,000 per month. This breaks down as: cloud infrastructure for monitoring servers and storage ($2,000-5,000), ongoing engineering maintenance at 0.35 FTE ($4,375), amortised setup costs ($780/month over 12 months), and training ($208/month amortised). At 500 hosts, the TCO increases to approximately $12,000-20,000/month because infrastructure costs grow sub-linearly. Compare this to Datadog at 500 hosts ($25,000-75,000/month) or Grafana Cloud at 500 hosts ($15,000-45,000/month) to understand the savings potential at larger scales.