Azure Cost Optimisation: Beyond the Obvious — Techniques That Actually Save Money

Reading time: 2 minutes, 57 seconds

Azure can get expensive fast. In this post I’ll share cost optimisation techniques from real workloads — starting with the well-known levers, then the non-obvious ones most teams miss. Those second ones are where the real money is.

Azure cost optimisation key levers
Figure 1 — The four main levers of Azure cost optimisation

The Standard Levers

1. Right-Size VMs with Azure Advisor

Azure Advisor → Cost analyses CPU/memory over 7–14 days. Typical savings: 30–50%. 15 minutes to identify, 30 to fix.

az monitor metrics list \
--resource /subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachines/{vm} \
--metric "Percentage CPU" --interval PT1H \
--query "value[0].timeseries[0].data[*].average"

2. Reserved Instances

1-year: ~40% off. 3-year: ~60–65%. If a VM runs 24/7 in production, commit. Buy in the smallest size in a series — Azure’s flexibility applies the reservation upward automatically.

3. Spot VMs

Up to 90% discount. 30-second eviction notice. Use for: batch, CI/CD agents, ML training, dev/test. Never for production services.

4. Auto-Shutdown Dev VMs

az vm auto-shutdown --resource-group rg-dev --name vm-dev-01 --time 1900 --email you@example.com

5. Azure Hybrid Benefit

Windows Server or SQL Server SA licences? Up to 40% off Windows VMs, 55% off SQL. A single checkbox many teams forget.

6. Blob Lifecycle Policies

Cool tier (~50% cheaper) after 30 days, Archive (~90% cheaper) after 180 days. Set once, runs forever.

7. Reserved Database Capacity

Azure SQL, Cosmos DB, PostgreSQL: 1yr (~35%) and 3yr (~55%) reserved capacity. Same concept as VM reservations.

8. Cost Budgets and Alerts

Cost Management → Budgets → Add. Alert at 80% and 100%. Doesn’t reduce spend but gives early warning.

9. Enforce Resource Tagging

Azure Policy requiring Environment, Owner, CostCentre. Without attribution, Cost Management is noise.


🔴 Log Analytics Workspace — The Silent Budget Killer

This is the one I’ve seen cause the biggest bill shocks. Log Analytics charges ~$2.30/GB for ingestion on pay-as-you-go. Sounds manageable — until a developer enables verbose diagnostic logs on a busy service, retention gets set to 2 years across every table, and nobody notices for three months. I’ve seen LAW bills go from $200/month to over $8,000/month after a “temporary” debug session that never got turned off.

Find what you’re actually ingesting

// Top data sources by ingestion volume (last 30 days)
Usage
| where TimeGenerated > ago(30d)
| summarize TotalGB = sum(Quantity) / 1000 by DataType
| order by TotalGB desc
| take 20

AppTraces, AzureDiagnostics, ContainerLog are the usual offenders. Then find tables nobody queries:

// High ingestion, zero queries in 30 days — paying for nothing
Usage
| where TimeGenerated > ago(30d)
| summarize IngestedGB = sum(Quantity)/1000 by DataType
| join kind=leftouter (
search * | where TimeGenerated > ago(30d)
| summarize Queries = count() by $table
) on $left.DataType == $right.$table
| where isempty(Queries)
| order by IngestedGB desc

Per-table retention — not one global setting

AppTraces doesn’t need the same retention as SecurityEvent. Set them independently — this alone cuts storage 40–60% on workspaces where a compliance retention was applied globally:

# AppTraces: 30 days
az monitor log-analytics workspace table update \
--resource-group rg-monitoring --workspace-name law-prod \
--name AppTraces --retention-time 30
# SecurityEvent: 2 years for compliance
az monitor log-analytics workspace table update \
--resource-group rg-monitoring --workspace-name law-prod \
--name SecurityEvent --retention-time 730

Data Collection Rules — filter before it arrives

DCRs transform and filter log data before it reaches the workspace. You never pay for data you don’t store. Drop debug-level traces entirely:

transformKql: "source | where SeverityLevel != 'Debug'"

Application Insights adaptive sampling — almost nobody enables this

Default: 100% of all telemetry. A busy API generates 15–50GB/day in App Insights alone. Adaptive sampling reduces this while preserving accuracy for P95 latency and error rates:

// appsettings.json (.NET)
{
"ApplicationInsights": {
"EnableAdaptiveSampling": true,
"MaxTelemetryItemsPerSecond": 5
}
}

💡 Real impact: 15GB/day → under 2GB/day on a production API, zero loss of diagnostic value.

LAW ingestion commitment tiers

Like Reserved Instances for compute — commit to an ingestion tier (100GB/day etc.) for up to 30% discount vs pay-as-you-go. Do this after you’ve cleaned up your volume.


🤖 AI Workload Costs — Model Routing Saves 60–80%

Most teams send every LLM request through the most expensive model. 60–80% of enterprise AI requests are simple enough for a mini model:

Request typeRight modelCost per 1M tokens
FAQ, classification, simple retrievalGPT-4o-mini / Phi-3~$0.15
Summarisation, moderate reasoningGPT-4o (cached)~$1.25
Complex analysis, long contextGPT-4o / o1~$5–15

🔗 AI Model Router: How to Cut Your LLM Bill by 60–80% Without Sacrificing Quality →


🔍 Hidden Costs Most Teams Ignore

Orphaned resources at scale

Organisations with 100+ VMs churned over 2 years can have 20–40 unattached managed disks billing silently. Premium SSD P30 (1TB) = $135/month each:

az graph query -q "
Resources
| where type == 'microsoft.compute/disks'
| where properties.diskState == 'Unattached'
| project name, resourceGroup, sku = tostring(properties.sku.name), sizeGB = tostring(properties.diskSizeGB)
| order by sizeGB desc" --subscriptions {sub-id}

Also: unused public IPs ($3.65/month each), empty App Service Plans, unused Load Balancers, old snapshots.

Azure Firewall per-subscription sprawl

~$900/month per Firewall instance. 15–20 subscriptions with independent firewalls = $13,000–18,000/month in firewall costs alone. Fix: hub-spoke with a single Firewall + Policy inheritance. Same coverage, one instance.

AKS user node pools on spot

System pools can’t use spot. User pools — where your application workloads run — absolutely can:

az aks nodepool add \
--cluster-name aks-prod --resource-group rg-aks \
--name spotnodes --priority Spot \
--eviction-policy Delete --spot-max-price -1 \
--node-count 3 --node-vm-size Standard_D4s_v5

Cost Management anomaly detection — built in, almost nobody uses it

Cost Management → Cost alerts → Anomaly alerts → Enable. Add an action group. You get an email the day a spike starts — not at month end when the damage is done.

NAT Gateway vs per-VM public IPs

20 VMs with public IPs = $73/month just in IPs. One NAT Gateway (~$32/month) handles all outbound traffic, is more secure, and is simpler to manage.


Where to Start

  1. Run the LAW ingestion KQL now. If you find a table over 5GB/month with zero queries — that’s immediate recoverable money.
  2. Open Azure Advisor → Cost. Tells you exactly which VMs to resize.
  3. Enable Cost Management anomaly alerts. Two minutes. Already built into your subscription.

Cost optimisation is ongoing, not a project. Monthly review: Advisor, LAW ingestion audit, anomaly alerts, reserved instance coverage.

What’s your biggest Azure cost surprise? Drop a comment below.

Facebook
Twitter
LinkedIn
Email

Leave a Reply

Get new articles by email

Practical Cloud, DevOps and AI walkthroughs

We don’t spam! Read our privacy policy for more info.

Discover more from HandsOnAzure

Subscribe now to keep reading and get access to the full archive.

Continue reading