Reading time: 2 minutes, 57 seconds
Azure can get expensive fast. In this post I’ll share cost optimisation techniques from real workloads — starting with the well-known levers, then the non-obvious ones most teams miss. Those second ones are where the real money is.

The Standard Levers
1. Right-Size VMs with Azure Advisor
Azure Advisor → Cost analyses CPU/memory over 7–14 days. Typical savings: 30–50%. 15 minutes to identify, 30 to fix.
az monitor metrics list \ --resource /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachines/{vm} \ --metric "Percentage CPU" --interval PT1H \ --query "value[0].timeseries[0].data[*].average"
2. Reserved Instances
1-year: ~40% off. 3-year: ~60–65%. If a VM runs 24/7 in production, commit. Buy in the smallest size in a series — Azure’s flexibility applies the reservation upward automatically.
3. Spot VMs
Up to 90% discount. 30-second eviction notice. Use for: batch, CI/CD agents, ML training, dev/test. Never for production services.
4. Auto-Shutdown Dev VMs
az vm auto-shutdown --resource-group rg-dev --name vm-dev-01 --time 1900 --email you@example.com
5. Azure Hybrid Benefit
Windows Server or SQL Server SA licences? Up to 40% off Windows VMs, 55% off SQL. A single checkbox many teams forget.
6. Blob Lifecycle Policies
Cool tier (~50% cheaper) after 30 days, Archive (~90% cheaper) after 180 days. Set once, runs forever.
7. Reserved Database Capacity
Azure SQL, Cosmos DB, PostgreSQL: 1yr (~35%) and 3yr (~55%) reserved capacity. Same concept as VM reservations.
8. Cost Budgets and Alerts
Cost Management → Budgets → Add. Alert at 80% and 100%. Doesn’t reduce spend but gives early warning.
9. Enforce Resource Tagging
Azure Policy requiring Environment, Owner, CostCentre. Without attribution, Cost Management is noise.
🔴 Log Analytics Workspace — The Silent Budget Killer
This is the one I’ve seen cause the biggest bill shocks. Log Analytics charges ~$2.30/GB for ingestion on pay-as-you-go. Sounds manageable — until a developer enables verbose diagnostic logs on a busy service, retention gets set to 2 years across every table, and nobody notices for three months. I’ve seen LAW bills go from $200/month to over $8,000/month after a “temporary” debug session that never got turned off.
Find what you’re actually ingesting
// Top data sources by ingestion volume (last 30 days)Usage| where TimeGenerated > ago(30d)| summarize TotalGB = sum(Quantity) / 1000 by DataType| order by TotalGB desc| take 20
AppTraces, AzureDiagnostics, ContainerLog are the usual offenders. Then find tables nobody queries:
// High ingestion, zero queries in 30 days — paying for nothingUsage| where TimeGenerated > ago(30d)| summarize IngestedGB = sum(Quantity)/1000 by DataType| join kind=leftouter ( search * | where TimeGenerated > ago(30d) | summarize Queries = count() by $table) on $left.DataType == $right.$table| where isempty(Queries)| order by IngestedGB desc
Per-table retention — not one global setting
AppTraces doesn’t need the same retention as SecurityEvent. Set them independently — this alone cuts storage 40–60% on workspaces where a compliance retention was applied globally:
# AppTraces: 30 daysaz monitor log-analytics workspace table update \ --resource-group rg-monitoring --workspace-name law-prod \ --name AppTraces --retention-time 30# SecurityEvent: 2 years for complianceaz monitor log-analytics workspace table update \ --resource-group rg-monitoring --workspace-name law-prod \ --name SecurityEvent --retention-time 730
Data Collection Rules — filter before it arrives
DCRs transform and filter log data before it reaches the workspace. You never pay for data you don’t store. Drop debug-level traces entirely:
transformKql: "source | where SeverityLevel != 'Debug'"
Application Insights adaptive sampling — almost nobody enables this
Default: 100% of all telemetry. A busy API generates 15–50GB/day in App Insights alone. Adaptive sampling reduces this while preserving accuracy for P95 latency and error rates:
// appsettings.json (.NET){ "ApplicationInsights": { "EnableAdaptiveSampling": true, "MaxTelemetryItemsPerSecond": 5 }}
💡 Real impact: 15GB/day → under 2GB/day on a production API, zero loss of diagnostic value.
LAW ingestion commitment tiers
Like Reserved Instances for compute — commit to an ingestion tier (100GB/day etc.) for up to 30% discount vs pay-as-you-go. Do this after you’ve cleaned up your volume.
🤖 AI Workload Costs — Model Routing Saves 60–80%
Most teams send every LLM request through the most expensive model. 60–80% of enterprise AI requests are simple enough for a mini model:
| Request type | Right model | Cost per 1M tokens |
|---|---|---|
| FAQ, classification, simple retrieval | GPT-4o-mini / Phi-3 | ~$0.15 |
| Summarisation, moderate reasoning | GPT-4o (cached) | ~$1.25 |
| Complex analysis, long context | GPT-4o / o1 | ~$5–15 |
🔗 AI Model Router: How to Cut Your LLM Bill by 60–80% Without Sacrificing Quality →
🔍 Hidden Costs Most Teams Ignore
Orphaned resources at scale
Organisations with 100+ VMs churned over 2 years can have 20–40 unattached managed disks billing silently. Premium SSD P30 (1TB) = $135/month each:
az graph query -q "Resources| where type == 'microsoft.compute/disks'| where properties.diskState == 'Unattached'| project name, resourceGroup, sku = tostring(properties.sku.name), sizeGB = tostring(properties.diskSizeGB)| order by sizeGB desc" --subscriptions {sub-id}
Also: unused public IPs ($3.65/month each), empty App Service Plans, unused Load Balancers, old snapshots.
Azure Firewall per-subscription sprawl
~$900/month per Firewall instance. 15–20 subscriptions with independent firewalls = $13,000–18,000/month in firewall costs alone. Fix: hub-spoke with a single Firewall + Policy inheritance. Same coverage, one instance.
AKS user node pools on spot
System pools can’t use spot. User pools — where your application workloads run — absolutely can:
az aks nodepool add \ --cluster-name aks-prod --resource-group rg-aks \ --name spotnodes --priority Spot \ --eviction-policy Delete --spot-max-price -1 \ --node-count 3 --node-vm-size Standard_D4s_v5
Cost Management anomaly detection — built in, almost nobody uses it
Cost Management → Cost alerts → Anomaly alerts → Enable. Add an action group. You get an email the day a spike starts — not at month end when the damage is done.
NAT Gateway vs per-VM public IPs
20 VMs with public IPs = $73/month just in IPs. One NAT Gateway (~$32/month) handles all outbound traffic, is more secure, and is simpler to manage.
Where to Start
- Run the LAW ingestion KQL now. If you find a table over 5GB/month with zero queries — that’s immediate recoverable money.
- Open Azure Advisor → Cost. Tells you exactly which VMs to resize.
- Enable Cost Management anomaly alerts. Two minutes. Already built into your subscription.
Cost optimisation is ongoing, not a project. Monthly review: Advisor, LAW ingestion audit, anomaly alerts, reserved instance coverage.
What’s your biggest Azure cost surprise? Drop a comment below.




