Observability
Overview
Observability is a core engineering discipline that applies to every service in the Convenient Checkout platform — not just v2 implementations. This guide defines the standards for logging, monitoring, alerting, and dashboarding that all engineers must follow.
Our primary observability stack:
| Tool | Purpose |
|---|---|
| Azure Application Insights | Distributed tracing, exception tracking, performance monitoring, dependency tracking |
| Splunk | Log aggregation, search, alerting, operational dashboards |
Logging Standards
Structured Logging
All services must produce structured JSON logs. Unstructured freeform log messages are not acceptable in production.
Required fields in every log entry:
| Field | Description | Example |
|---|---|---|
timestamp | ISO 8601 UTC | 2026-04-04T14:30:00.123Z |
level | Log level | INFO, WARN, ERROR |
service | Service name | wallet-payment-service |
requestId | Correlation ID from X-Request-Id header | abc123-def456-ghi789 |
merchantId | Merchant context (when available) | b955db5e-aef2-47de-bbb9-c80b9cc16e8f |
message | Human-readable description | Payment created successfully |
traceId | Distributed trace ID (from Application Insights) | 4bf92f3577b34da6a3ce929d0e0e4736 |
Optional contextual fields (add when relevant):
| Field | When to Include |
|---|---|
resourceId | Any operation on a specific resource (paymentId, refundId, etc.) |
status | State transitions |
duration | Timed operations (vendor calls, database queries) |
errorCode | Error scenarios |
userId | User-initiated actions |
Log Levels
| Level | When to Use | Examples |
|---|---|---|
ERROR | Unexpected failures requiring investigation | Unhandled exceptions, database connection failures, vendor errors that indicate a bug |
WARN | Expected but noteworthy conditions | Retry attempts, circuit breaker trips, deprecated API usage, approaching rate limits |
INFO | Normal operations worth recording | Resource created, state transition, external call completed |
DEBUG | Detailed diagnostic information (disabled in production) | Request/response payloads (masked), internal decision logic |
What to Log — What NOT to Log
| ✅ Always Log | ❌ Never Log |
|---|---|
| Resource IDs (paymentId, customerId, etc.) | Full card numbers (PAN) |
| Status transitions with before/after | CVV / CVC |
| Error codes and messages | Full SSN |
| External call duration and status | Full bank account numbers |
| Masked card last-four and brand | OAuth token values |
| Request correlation IDs | API secret keys |
| Merchant ID | Customer passwords |
Azure Application Insights
Application Insights is the primary tool for distributed tracing and exception tracking across services.
What Application Insights Provides
| Capability | How We Use It |
|---|---|
| Request tracking | Every inbound HTTP request is automatically traced with duration, status code, and dependency calls |
| Dependency tracking | Outbound calls (Stripe API, database, other services) are captured with timing and success/failure |
| Exception tracking | Unhandled and explicitly logged exceptions are captured with stack traces |
| Distributed tracing | Trace IDs propagate across service boundaries, enabling end-to-end request visualization |
| Performance metrics | Server response time, dependency duration, failure rate, availability |
| Custom events | Domain-specific events (e.g., payment state transitions) can be tracked as custom telemetry |
Engineering Requirements
| Requirement | Standard |
|---|---|
| SDK integration | Every service must include the Application Insights SDK and configure the instrumentation key |
| Trace propagation | Services must propagate the traceparent / traceId header on all outbound calls |
| Custom dimensions | Add merchantId, resourceId, and requestId as custom dimensions on all telemetry |
| Exception logging | All caught exceptions must be logged to Application Insights with trackException() — do not swallow silently |
| Sensitive data | Never include sensitive data in custom dimensions or custom event properties — same rules as logging |
Useful Queries (Kusto / KQL)
Find all exceptions for a specific request:
exceptions
| where customDimensions.requestId == "abc123-def456"
| order by timestamp desc
Track request latency by service:
requests
| where cloud_RoleName == "wallet-payment-service"
| summarize percentile(duration, 50), percentile(duration, 95), percentile(duration, 99) by bin(timestamp, 5m)
| render timechart
Find failed dependency calls:
dependencies
| where success == false
| summarize count() by target, resultCode, bin(timestamp, 1h)
| render timechart
Splunk
Splunk is the primary tool for log aggregation, search, and operational alerting.
Engineering Requirements
| Requirement | Standard |
|---|---|
| Log shipping | All services must ship structured JSON logs to Splunk via the configured log shipper |
| Index | Logs must be written to the team-specific Splunk index |
| Source type | Use _json source type for structured log parsing |
| Retention | Follow organizational retention policy (typically 30-90 days depending on environment) |
Useful Search Patterns
Find all logs for a request:
index=<team-index> requestId="abc123-def456" | sort _time
Find all errors in a service over the last hour:
index=<team-index> service="wallet-payment-service" level="ERROR" earliest=-1h | stats count by errorCode, message
Track error rate over time:
index=<team-index> service="wallet-payment-service" level="ERROR" | timechart count by errorCode
Dashboards
Every service must have an operational dashboard. Dashboards should be created in Splunk and/or Azure Application Insights.
Required Dashboard Panels
| Panel | Metric | Why |
|---|---|---|
| Request rate | Requests/second by endpoint | Detect traffic spikes or drops |
| Error rate | Errors/second by HTTP status code | Detect failures early |
| Latency | P50, P95, P99 response time | Detect performance degradation |
| External dependency health | Success rate and latency for Stripe, database, other services | Detect downstream issues |
| State transitions | Count of resources entering each state per interval (domain-specific) | Detect stuck or abnormal processing |
Dashboard Per Domain
Each domain should maintain a dashboard focused on its specific metrics:
| Domain | Key Metrics |
|---|---|
| Payments | Payment success rate, decline rate, average processing time, split-tender rollback count |
| Customers | Customer creation rate, lookup failures, wallet operations |
| Merchants | Configuration propagation latency, onboarding success rate |
| Reporting | Report generation time, export success rate |
Alerting
Alert Standards
| Standard | Detail |
|---|---|
| Every alert must be actionable | If an alert fires, the on-call engineer must know what to investigate and what action to take |
| No alert fatigue | Tune thresholds to avoid false positives; review alert noise monthly |
| Severity levels | SEV1 (customer-facing outage), SEV2 (degraded but functional), SEV3 (non-critical anomaly) |
| Notification channels | SEV1 → PagerDuty/on-call; SEV2 → Slack channel; SEV3 → Dashboard only |
Required Alerts (All Services)
| Alert | Condition | Severity |
|---|---|---|
| High error rate | Error rate > 5% of requests for 5 min | SEV2 |
| Service unavailable | 0 successful requests for 2 min | SEV1 |
| High latency | P95 > 10s for 5 min | SEV2 |
| External dependency failure | Dependency error rate > 10% for 5 min | SEV2 |
| Database connection pool exhaustion | Available connections < 10% for 2 min | SEV1 |
Correlation ID Propagation
All services must propagate the X-Request-Id header (or requestId) across every inter-service call to enable end-to-end tracing.
| Rule | Detail |
|---|---|
| Generate if missing | If the inbound request has no X-Request-Id, generate a UUID and use it |
| Propagate always | Pass X-Request-Id on every outbound HTTP call, message publication, and log entry |
| Log always | Every log line must include requestId — this is the primary debugging key |
| Application Insights | The requestId should be added as a custom dimension for trace correlation |