Skip to main content
Version: v2

Observability

Overview

Observability is a core engineering discipline that applies to every service in the Convenient Checkout platform — not just v2 implementations. This guide defines the standards for logging, monitoring, alerting, and dashboarding that all engineers must follow.

Our primary observability stack:

ToolPurpose
Azure Application InsightsDistributed tracing, exception tracking, performance monitoring, dependency tracking
SplunkLog aggregation, search, alerting, operational dashboards

Logging Standards

Structured Logging

All services must produce structured JSON logs. Unstructured freeform log messages are not acceptable in production.

Required fields in every log entry:

FieldDescriptionExample
timestampISO 8601 UTC2026-04-04T14:30:00.123Z
levelLog levelINFO, WARN, ERROR
serviceService namewallet-payment-service
requestIdCorrelation ID from X-Request-Id headerabc123-def456-ghi789
merchantIdMerchant context (when available)b955db5e-aef2-47de-bbb9-c80b9cc16e8f
messageHuman-readable descriptionPayment created successfully
traceIdDistributed trace ID (from Application Insights)4bf92f3577b34da6a3ce929d0e0e4736

Optional contextual fields (add when relevant):

FieldWhen to Include
resourceIdAny operation on a specific resource (paymentId, refundId, etc.)
statusState transitions
durationTimed operations (vendor calls, database queries)
errorCodeError scenarios
userIdUser-initiated actions

Log Levels

LevelWhen to UseExamples
ERRORUnexpected failures requiring investigationUnhandled exceptions, database connection failures, vendor errors that indicate a bug
WARNExpected but noteworthy conditionsRetry attempts, circuit breaker trips, deprecated API usage, approaching rate limits
INFONormal operations worth recordingResource created, state transition, external call completed
DEBUGDetailed diagnostic information (disabled in production)Request/response payloads (masked), internal decision logic

What to Log — What NOT to Log

✅ Always Log❌ Never Log
Resource IDs (paymentId, customerId, etc.)Full card numbers (PAN)
Status transitions with before/afterCVV / CVC
Error codes and messagesFull SSN
External call duration and statusFull bank account numbers
Masked card last-four and brandOAuth token values
Request correlation IDsAPI secret keys
Merchant IDCustomer passwords

Azure Application Insights

Application Insights is the primary tool for distributed tracing and exception tracking across services.

What Application Insights Provides

CapabilityHow We Use It
Request trackingEvery inbound HTTP request is automatically traced with duration, status code, and dependency calls
Dependency trackingOutbound calls (Stripe API, database, other services) are captured with timing and success/failure
Exception trackingUnhandled and explicitly logged exceptions are captured with stack traces
Distributed tracingTrace IDs propagate across service boundaries, enabling end-to-end request visualization
Performance metricsServer response time, dependency duration, failure rate, availability
Custom eventsDomain-specific events (e.g., payment state transitions) can be tracked as custom telemetry

Engineering Requirements

RequirementStandard
SDK integrationEvery service must include the Application Insights SDK and configure the instrumentation key
Trace propagationServices must propagate the traceparent / traceId header on all outbound calls
Custom dimensionsAdd merchantId, resourceId, and requestId as custom dimensions on all telemetry
Exception loggingAll caught exceptions must be logged to Application Insights with trackException() — do not swallow silently
Sensitive dataNever include sensitive data in custom dimensions or custom event properties — same rules as logging

Useful Queries (Kusto / KQL)

Find all exceptions for a specific request:

exceptions
| where customDimensions.requestId == "abc123-def456"
| order by timestamp desc

Track request latency by service:

requests
| where cloud_RoleName == "wallet-payment-service"
| summarize percentile(duration, 50), percentile(duration, 95), percentile(duration, 99) by bin(timestamp, 5m)
| render timechart

Find failed dependency calls:

dependencies
| where success == false
| summarize count() by target, resultCode, bin(timestamp, 1h)
| render timechart

Splunk

Splunk is the primary tool for log aggregation, search, and operational alerting.

Engineering Requirements

RequirementStandard
Log shippingAll services must ship structured JSON logs to Splunk via the configured log shipper
IndexLogs must be written to the team-specific Splunk index
Source typeUse _json source type for structured log parsing
RetentionFollow organizational retention policy (typically 30-90 days depending on environment)

Useful Search Patterns

Find all logs for a request:

index=<team-index> requestId="abc123-def456" | sort _time

Find all errors in a service over the last hour:

index=<team-index> service="wallet-payment-service" level="ERROR" earliest=-1h | stats count by errorCode, message

Track error rate over time:

index=<team-index> service="wallet-payment-service" level="ERROR" | timechart count by errorCode

Dashboards

Every service must have an operational dashboard. Dashboards should be created in Splunk and/or Azure Application Insights.

Required Dashboard Panels

PanelMetricWhy
Request rateRequests/second by endpointDetect traffic spikes or drops
Error rateErrors/second by HTTP status codeDetect failures early
LatencyP50, P95, P99 response timeDetect performance degradation
External dependency healthSuccess rate and latency for Stripe, database, other servicesDetect downstream issues
State transitionsCount of resources entering each state per interval (domain-specific)Detect stuck or abnormal processing

Dashboard Per Domain

Each domain should maintain a dashboard focused on its specific metrics:

DomainKey Metrics
PaymentsPayment success rate, decline rate, average processing time, split-tender rollback count
CustomersCustomer creation rate, lookup failures, wallet operations
MerchantsConfiguration propagation latency, onboarding success rate
ReportingReport generation time, export success rate

Alerting

Alert Standards

StandardDetail
Every alert must be actionableIf an alert fires, the on-call engineer must know what to investigate and what action to take
No alert fatigueTune thresholds to avoid false positives; review alert noise monthly
Severity levelsSEV1 (customer-facing outage), SEV2 (degraded but functional), SEV3 (non-critical anomaly)
Notification channelsSEV1 → PagerDuty/on-call; SEV2 → Slack channel; SEV3 → Dashboard only

Required Alerts (All Services)

AlertConditionSeverity
High error rateError rate > 5% of requests for 5 minSEV2
Service unavailable0 successful requests for 2 minSEV1
High latencyP95 > 10s for 5 minSEV2
External dependency failureDependency error rate > 10% for 5 minSEV2
Database connection pool exhaustionAvailable connections < 10% for 2 minSEV1

Correlation ID Propagation

All services must propagate the X-Request-Id header (or requestId) across every inter-service call to enable end-to-end tracing.

RuleDetail
Generate if missingIf the inbound request has no X-Request-Id, generate a UUID and use it
Propagate alwaysPass X-Request-Id on every outbound HTTP call, message publication, and log entry
Log alwaysEvery log line must include requestId — this is the primary debugging key
Application InsightsThe requestId should be added as a custom dimension for trace correlation