Version: v2

Observability

🔒 InternalInternal information — not visible in the public (merchant) site.

Overview

Observability is a core engineering discipline that applies to every service in the Convenient Checkout platform — not just v2 implementations. This guide defines the standards for logging, monitoring, alerting, and dashboarding that all engineers must follow.

Our primary observability stack:

Tool	Purpose
Azure Application Insights	Distributed tracing, exception tracking, performance monitoring, dependency tracking
Splunk	Log aggregation, search, alerting, operational dashboards

Logging Standards

Structured Logging

All services must produce structured JSON logs. Unstructured freeform log messages are not acceptable in production.

Required fields in every log entry:

Field	Description	Example
`timestamp`	ISO 8601 UTC	`2026-04-04T14:30:00.123Z`
`level`	Log level	`INFO`, `WARN`, `ERROR`
`service`	Service name	`wallet-payment-service`
`requestId`	Correlation ID from `X-Request-Id` header	`abc123-def456-ghi789`
`merchantId`	Merchant context (when available)	`b955db5e-aef2-47de-bbb9-c80b9cc16e8f`
`message`	Human-readable description	`Payment created successfully`
`traceId`	Distributed trace ID (from Application Insights)	`4bf92f3577b34da6a3ce929d0e0e4736`

Optional contextual fields (add when relevant):

Field	When to Include
`resourceId`	Any operation on a specific resource (paymentId, refundId, etc.)
`status`	State transitions
`duration`	Timed operations (vendor calls, database queries)
`errorCode`	Error scenarios
`userId`	User-initiated actions

Log Levels

Level	When to Use	Examples
`ERROR`	Unexpected failures requiring investigation	Unhandled exceptions, database connection failures, vendor errors that indicate a bug
`WARN`	Expected but noteworthy conditions	Retry attempts, circuit breaker trips, deprecated API usage, approaching rate limits
`INFO`	Normal operations worth recording	Resource created, state transition, external call completed
`DEBUG`	Detailed diagnostic information (disabled in production)	Request/response payloads (masked), internal decision logic

What to Log — What NOT to Log

✅ Always Log	❌ Never Log
Resource IDs (paymentId, customerId, etc.)	Full card numbers (PAN)
Status transitions with before/after	CVV / CVC
Error codes and messages	Full SSN
External call duration and status	Full bank account numbers
Masked card last-four and brand	OAuth token values
Request correlation IDs	API secret keys
Merchant ID	Customer passwords

Azure Application Insights

Application Insights is the primary tool for distributed tracing and exception tracking across services.

What Application Insights Provides

Capability	How We Use It
Request tracking	Every inbound HTTP request is automatically traced with duration, status code, and dependency calls
Dependency tracking	Outbound calls (Stripe API, database, other services) are captured with timing and success/failure
Exception tracking	Unhandled and explicitly logged exceptions are captured with stack traces
Distributed tracing	Trace IDs propagate across service boundaries, enabling end-to-end request visualization
Performance metrics	Server response time, dependency duration, failure rate, availability
Custom events	Domain-specific events (e.g., payment state transitions) can be tracked as custom telemetry

Engineering Requirements

Requirement	Standard
SDK integration	Every service must include the Application Insights SDK and configure the instrumentation key
Trace propagation	Services must propagate the `traceparent` / `traceId` header on all outbound calls
Custom dimensions	Add `merchantId`, `resourceId`, and `requestId` as custom dimensions on all telemetry
Exception logging	All caught exceptions must be logged to Application Insights with `trackException()` — do not swallow silently
Sensitive data	Never include sensitive data in custom dimensions or custom event properties — same rules as logging

Useful Queries (Kusto / KQL)

Find all exceptions for a specific request:

exceptions
| where customDimensions.requestId == "abc123-def456"
| order by timestamp desc

Track request latency by service:

requests
| where cloud_RoleName == "wallet-payment-service"
| summarize percentile(duration, 50), percentile(duration, 95), percentile(duration, 99) by bin(timestamp, 5m)
| render timechart

Find failed dependency calls:

dependencies
| where success == false
| summarize count() by target, resultCode, bin(timestamp, 1h)
| render timechart

Splunk

Splunk is the primary tool for log aggregation, search, and operational alerting.

Engineering Requirements

Requirement	Standard
Log shipping	All services must ship structured JSON logs to Splunk via the configured log shipper
Index	Logs must be written to the team-specific Splunk index
Source type	Use `_json` source type for structured log parsing
Retention	Follow organizational retention policy (typically 30-90 days depending on environment)

Useful Search Patterns

Find all logs for a request:

index=<team-index> requestId="abc123-def456" | sort _time

Find all errors in a service over the last hour:

index=<team-index> service="wallet-payment-service" level="ERROR" earliest=-1h | stats count by errorCode, message

Track error rate over time:

index=<team-index> service="wallet-payment-service" level="ERROR" | timechart count by errorCode

Dashboards

Every service must have an operational dashboard. Dashboards should be created in Splunk and/or Azure Application Insights.

Required Dashboard Panels

Panel	Metric	Why
Request rate	Requests/second by endpoint	Detect traffic spikes or drops
Error rate	Errors/second by HTTP status code	Detect failures early
Latency	P50, P95, P99 response time	Detect performance degradation
External dependency health	Success rate and latency for Stripe, database, other services	Detect downstream issues
State transitions	Count of resources entering each state per interval (domain-specific)	Detect stuck or abnormal processing

Dashboard Per Domain

Each domain should maintain a dashboard focused on its specific metrics:

Domain	Key Metrics
Payments	Payment success rate, decline rate, average processing time, split-tender rollback count
Customers	Customer creation rate, lookup failures, wallet operations
Merchants	Configuration propagation latency, onboarding success rate
Reporting	Report generation time, export success rate

Alerting

Alert Standards

Standard	Detail
Every alert must be actionable	If an alert fires, the on-call engineer must know what to investigate and what action to take
No alert fatigue	Tune thresholds to avoid false positives; review alert noise monthly
Severity levels	`SEV1` (customer-facing outage), `SEV2` (degraded but functional), `SEV3` (non-critical anomaly)
Notification channels	SEV1 → PagerDuty/on-call; SEV2 → Slack channel; SEV3 → Dashboard only

Required Alerts (All Services)

Alert	Condition	Severity
High error rate	Error rate > 5% of requests for 5 min	SEV2
Service unavailable	0 successful requests for 2 min	SEV1
High latency	P95 > 10s for 5 min	SEV2
External dependency failure	Dependency error rate > 10% for 5 min	SEV2
Database connection pool exhaustion	Available connections < 10% for 2 min	SEV1

Correlation ID Propagation

All services must propagate the X-Request-Id header (or requestId) across every inter-service call to enable end-to-end tracing.

Rule	Detail
Generate if missing	If the inbound request has no `X-Request-Id`, generate a UUID and use it
Propagate always	Pass `X-Request-Id` on every outbound HTTP call, message publication, and log entry
Log always	Every log line must include `requestId` — this is the primary debugging key
Application Insights	The `requestId` should be added as a custom dimension for trace correlation

Overview​

Logging Standards​

Structured Logging​

Log Levels​

What to Log — What NOT to Log​

Azure Application Insights​

What Application Insights Provides​

Engineering Requirements​

Useful Queries (Kusto / KQL)​

Splunk​

Engineering Requirements​

Useful Search Patterns​

Dashboards​

Required Dashboard Panels​

Dashboard Per Domain​

Alerting​

Alert Standards​

Required Alerts (All Services)​

Correlation ID Propagation​

Related Documentation​

Overview

Logging Standards

Structured Logging

Log Levels

What to Log — What NOT to Log

Azure Application Insights

What Application Insights Provides

Engineering Requirements

Useful Queries (Kusto / KQL)

Splunk

Engineering Requirements

Useful Search Patterns

Dashboards

Required Dashboard Panels

Dashboard Per Domain

Alerting

Alert Standards

Required Alerts (All Services)

Correlation ID Propagation

Related Documentation