Prompt Techiniques & Caching
This page covers the structure of the system prompt, how to modify it, how prompt caching works on Azure OpenAI, and how to estimate the savings it provides.
System Prompt Overview
The system prompt lives in src/components/ChatBot/aiConfig.js as the systemPrompt constant. It is a template literal string — plain text, no external file, no build step required.
src/components/ChatBot/aiConfig.js
└── systemPrompt (imported by walletAIService.js)
└── sent as system_prompt field in every POST /api/ask payload
Every request to POST /api/ask includes the full system prompt in the request body:
{
"query": "...",
"session_id": "<uuid>",
"system_prompt": "<systemPrompt>",
"search_config": { "index_name": "wallet-docs-index-v2", "top": 5 }
}
The system prompt is stateless — it does not change between requests. This is what makes prompt caching effective (see below).
Prompt Structure
The system prompt is divided into six sections:
| Section | Purpose |
|---|---|
| ROLE | Establishes model identity as Senior Integration Architect for CCG |
| CORE PRINCIPLES | Zero-hallucination rules; security-first requirements |
| ROLE-SPECIFIC GUIDANCE | Audience-aware behaviour (Dev, QA, PO, Business) |
| RESPONSE RULES | Business vs API question routing; silent section omission rule |
| SECURITY | Repeats critical security constraints so they are near the end of the prompt |
| REQUIRED RESPONSE FORMAT | Bold label format (**Summary**), ==highlight== syntax, \``bash` blocks |
Why bold labels instead of headings
The chat widget's MarkdownContent renderer detects bold-only lines (/^\*\*([^*]+?)\*\*[\s:.,]*$/) and styles them as section headers. Markdown ## headings are not used in responses because the Azure OpenAI model was inserting literal ## characters that appeared unstyled in the chat bubble.
The ==highlight== convention
The system prompt instructs the model to wrap important inline values in double-equals signs:
Use ==value== to highlight error codes, status values, field names, and HTTP methods.
Example: ==PAYMENT_FAILED==, ==400==, ==paymentMethodId==
ChatBot.js converts ==text== to <mark class="chatbot-highlight"> at render time.
Modifying the System Prompt
The system prompt is a plain string in api/config.js. Edit it directly — no rebuild is needed (the server reads it on startup).
After editing:
- Restart
api/server.js - Send a few test queries covering business, integration, and API scenarios
- Verify that sections appear in order and no
##headings leak into responses
Adding a new audience section
Add a new block under ROLE-SPECIFIC GUIDANCE:
For <New Audience>
- <Specific instruction about how to handle responses for this audience>
- <Second instruction>
No code changes are needed — the model reads the section from the prompt at inference time.
Tightening hallucination controls
If the model starts inferring undocumented fields, add stronger constraints to CORE PRINCIPLES:
- NEVER infer or suggest fields that are not explicitly listed in the documentation context.
- If a field name does not appear verbatim in the context, do not mention it.
Rate Limiting
The server enforces a sliding-window rate limit per client IP to prevent abuse and keep Azure OpenAI costs bounded.
| Parameter | Value | Config location |
|---|---|---|
| Enabled by default | Yes | server.rateLimit.enabled in config.js |
| Max requests | 5 per minute | server.rateLimit.maxRequests |
| Window | 60 seconds | server.rateLimit.windowMs |
| Override for local dev | RATE_LIMIT_ENABLED=false | Environment variable |
When the limit is exceeded the server returns:
HTTP 429 Too Many Requests
Retry-After: 60
{ "error": "Rate limit exceeded. Max 5 requests per minute." }
The rate limiter reads the real client IP from X-Forwarded-For (set by nginx) before falling back to req.socket.remoteAddress. This means all requests proxied through nginx share the correct per-user IP, not the nginx pod IP.
Disabling for local development
Add RATE_LIMIT_ENABLED=false to your local start command:
RATELIMIT_ENABLED=false \
AZURE_SEARCH_ENDPOINT=https://ccg-docs.search.windows.net \
...
node api/server.js
Or set it permanently in a .env-style shell alias for local work.
Response Cache
The server maintains an in-process TTL cache that stores the full { answer, docs } payload keyed by the normalised query string. A cache hit skips both the Azure Search call(s) and the Azure OpenAI call entirely.
| Parameter | Value | Where |
|---|---|---|
| TTL | None — entries live until server restart | server.js |
| Max entries | 500 | CACHE_MAX_SIZE in server.js |
| Eviction policy | Oldest-inserted first | Map insertion order |
| Scope | Single Node process | Lost on restart; not shared across K8s pods |
Because there is no TTL, restart the server after running uploadToAzureSearch.js to prevent users
getting cached answers based on old docs.
What it saves
| Without cache | With cache hit | |
|---|---|---|
| Azure Search calls | 1–2 | 0 |
| Azure OpenAI call | 1 | 0 |
| Cost | ~$0.008–0.024 | $0.000 |
| Latency | 2–8 s | < 1 ms |
How the cache key is formed
The key is query.trim().slice(0, 500) — the same normalised string that is sent to Azure Search. Queries that differ only in leading/trailing whitespace will share a cache entry. Case differences will not (e.g. "What is CCG" and "what is CCG" are separate entries). This is intentional — adding .toLowerCase() to the key is a safe future optimisation if needed.
Logging
Cache hits are logged as:
Cache hit — "what is convenient checkout"
Cache misses log the query type as normal:
Query type: business/general — "what is convenient checkout"
Trade-offs vs Redis
| In-process Map (current) | Azure Cache for Redis | |
|---|---|---|
| Cost | Free | ~$16–55/month |
| Shared across pods | No | Yes |
| Survives restart | No | Yes |
| Setup required | None | Redis client npm package + K8s secret |
For a single-replica internal docs assistant, the in-process cache is the right choice. If the assistant ever scales to multiple replicas, add Redis using the same _getCached / _setCached interface — only the implementation of those two functions changes.
Azure OpenAI Prompt Caching
Prompt caching (available on GPT-4.1 via Azure OpenAI) automatically reuses the KV-cache for the static prefix of the messages array — specifically the system prompt — when:
- The system prompt content is identical between requests
- The deployment supports caching (GPT-4.1 does)
- The cache is warm (within a short time window — typically a few minutes)
When the cache is hit, Azure OpenAI charges at 50% of the normal input token rate for the cached prefix.
What it saves
The system prompt is ~600 tokens. With caching:
| Scenario | Input cost without cache | Input cost with cache | Saving |
|---|---|---|---|
| API query (6 750 total input tokens) | $0.0135 | $0.0129 | ~$0.0006 per query |
| Business query (2 750 total input tokens) | $0.0055 | $0.0049 | ~$0.0006 per query |
At 1 000 queries/day, prompt caching saves roughly $0.60/day on the system prompt alone. The bigger savings come from keeping contextMaxChars minimal (see Token Utilization).
Does server.js enable it?
Yes — automatically. Azure OpenAI GPT-4.1 enables prompt caching by default for qualifying deployments. No code change is required.
The key condition is that the system prompt must be byte-for-byte identical across requests. In our setup this is guaranteed because systemPrompt is a module-level constant loaded once at startup.
If you restart api/server.js while a high volume of requests is being served, the cache will be cold for the first few requests after restart. This is expected and has negligible impact.
Verifying cache hits
The Azure OpenAI response body includes usage metadata. You can log json.usage in callOpenAI() to inspect whether tokens were served from cache:
// In callOpenAI(), after parsing the response:
const json = JSON.parse(data);
if (json.usage) {
console.log('Token usage:', JSON.stringify(json.usage));
// Example output:
// {"prompt_tokens":6750,"completion_tokens":420,
// "total_tokens":7170,
// "prompt_tokens_details":{"cached_tokens":600,"audio_tokens":0},
// "completion_tokens_details":{"reasoning_tokens":0,"audio_tokens":0}}
}
prompt_tokens_details.cached_tokens will be 600 (the system prompt length) when the cache is warm.
Adding usage logging temporarily
To add temporary usage logging without changing production behavior, edit callOpenAI() in api/server.js:
const json = JSON.parse(data);
// Temporary: log token usage for cost monitoring
if (process.env.LOG_TOKEN_USAGE === 'true' && json.usage) {
const u = json.usage;
const cached = u.prompt_tokens_details?.cached_tokens || 0;
console.log(
`[tokens] prompt=${u.prompt_tokens} (cached=${cached})` +
` completion=${u.completion_tokens} total=${u.total_tokens}`
);
}
resolve(json.choices?.[0]?.message?.content || '');
Enable with LOG_TOKEN_USAGE=true in the environment — no code change needed in production.
Separate System Prompt vs Inline Prompts
The current design keeps the system prompt in a single constant in api/config.js. An alternative is splitting it into an external .txt or .md file and loading it with fs.readFileSync at startup.
| Approach | Pros | Cons |
|---|---|---|
| Inline constant (current) | Simple; no file I/O; Docker image is self-contained | Slightly harder to diff prompt changes in PRs |
External .txt file | Easy to view diff in PRs; supports comments | Adds one file to manage; must be included in Docker image |
External .md file | Human-readable; preview in GitHub | Markdown syntax in the prompt file may confuse editors |
For this project, the inline constant is the right trade-off. The system prompt is ~50 lines and changes rarely.