Version: v2

Prompt Techiniques & Caching

This page covers the structure of the system prompt, how to modify it, how prompt caching works on Azure OpenAI, and how to estimate the savings it provides.

System Prompt Overview

The system prompt lives in src/components/ChatBot/aiConfig.js as the systemPrompt constant. It is a template literal string — plain text, no external file, no build step required.

src/components/ChatBot/aiConfig.js
└── systemPrompt  (imported by walletAIService.js)
        └── sent as system_prompt field in every POST /api/ask payload

Every request to POST /api/ask includes the full system prompt in the request body:

{
  "query": "...",
  "session_id": "<uuid>",
  "system_prompt": "<systemPrompt>",
  "search_config": { "index_name": "wallet-docs-index-v2", "top": 5 }
}

The system prompt is stateless — it does not change between requests. This is what makes prompt caching effective (see below).

Prompt Structure

The system prompt is divided into six sections:

Section	Purpose
ROLE	Establishes model identity as Senior Integration Architect for CCG
CORE PRINCIPLES	Zero-hallucination rules; security-first requirements
ROLE-SPECIFIC GUIDANCE	Audience-aware behaviour (Dev, QA, PO, Business)
RESPONSE RULES	Business vs API question routing; silent section omission rule
SECURITY	Repeats critical security constraints so they are near the end of the prompt
REQUIRED RESPONSE FORMAT	Bold label format (`Summary`), `==highlight==` syntax, `\```bash` blocks

Why bold labels instead of headings

The chat widget's MarkdownContent renderer detects bold-only lines (/^\*\*([^*]+?)\*\*[\s:.,]*$/) and styles them as section headers. Markdown ## headings are not used in responses because the Azure OpenAI model was inserting literal ## characters that appeared unstyled in the chat bubble.

The `==highlight==` convention

The system prompt instructs the model to wrap important inline values in double-equals signs:

Use ==value== to highlight error codes, status values, field names, and HTTP methods.
Example: ==PAYMENT_FAILED==, ==400==, ==paymentMethodId==

ChatBot.js converts ==text== to <mark class="chatbot-highlight"> at render time.

Modifying the System Prompt

The system prompt is a plain string in api/config.js. Edit it directly — no rebuild is needed (the server reads it on startup).

After editing:

Restart api/server.js
Send a few test queries covering business, integration, and API scenarios
Verify that sections appear in order and no ## headings leak into responses

Adding a new audience section

Add a new block under ROLE-SPECIFIC GUIDANCE:

For <New Audience>
- <Specific instruction about how to handle responses for this audience>
- <Second instruction>

No code changes are needed — the model reads the section from the prompt at inference time.

Tightening hallucination controls

If the model starts inferring undocumented fields, add stronger constraints to CORE PRINCIPLES:

- NEVER infer or suggest fields that are not explicitly listed in the documentation context.
- If a field name does not appear verbatim in the context, do not mention it.

Rate Limiting

The server enforces a sliding-window rate limit per client IP to prevent abuse and keep Azure OpenAI costs bounded.

Parameter	Value	Config location
Enabled by default	Yes	`server.rateLimit.enabled` in `config.js`
Max requests	5 per minute	`server.rateLimit.maxRequests`
Window	60 seconds	`server.rateLimit.windowMs`
Override for local dev	`RATE_LIMIT_ENABLED=false`	Environment variable

When the limit is exceeded the server returns:

HTTP 429 Too Many Requests
Retry-After: 60
{ "error": "Rate limit exceeded. Max 5 requests per minute." }

The rate limiter reads the real client IP from X-Forwarded-For (set by nginx) before falling back to req.socket.remoteAddress. This means all requests proxied through nginx share the correct per-user IP, not the nginx pod IP.

Disabling for local development

Add RATE_LIMIT_ENABLED=false to your local start command:

RATELIMIT_ENABLED=false \
AZURE_SEARCH_ENDPOINT=https://ccg-docs.search.windows.net \
...
node api/server.js

Or set it permanently in a .env-style shell alias for local work.

Response Cache

The server maintains an in-process TTL cache that stores the full { answer, docs } payload keyed by the normalised query string. A cache hit skips both the Azure Search call(s) and the Azure OpenAI call entirely.

Parameter	Value	Where
TTL	None — entries live until server restart	`server.js`
Max entries	500	`CACHE_MAX_SIZE` in `server.js`
Eviction policy	Oldest-inserted first	Map insertion order
Scope	Single Node process	Lost on restart; not shared across K8s pods

caution

Because there is no TTL, restart the server after running uploadToAzureSearch.js to prevent users getting cached answers based on old docs.

What it saves

	Without cache	With cache hit
Azure Search calls	1–2	0
Azure OpenAI call	1	0
Cost	~$0.008–0.024	$0.000
Latency	2–8 s	< 1 ms

How the cache key is formed

The key is query.trim().slice(0, 500) — the same normalised string that is sent to Azure Search. Queries that differ only in leading/trailing whitespace will share a cache entry. Case differences will not (e.g. "What is CCG" and "what is CCG" are separate entries). This is intentional — adding .toLowerCase() to the key is a safe future optimisation if needed.

Logging

Cache hits are logged as:

Cache hit — "what is convenient checkout"

Cache misses log the query type as normal:

Query type: business/general — "what is convenient checkout"

Trade-offs vs Redis

	In-process Map (current)	Azure Cache for Redis
Cost	Free	~$16–55/month
Shared across pods	No	Yes
Survives restart	No	Yes
Setup required	None	Redis client npm package + K8s secret

For a single-replica internal docs assistant, the in-process cache is the right choice. If the assistant ever scales to multiple replicas, add Redis using the same _getCached / _setCached interface — only the implementation of those two functions changes.

Azure OpenAI Prompt Caching

Prompt caching (available on GPT-4.1 via Azure OpenAI) automatically reuses the KV-cache for the static prefix of the messages array — specifically the system prompt — when:

The system prompt content is identical between requests
The deployment supports caching (GPT-4.1 does)
The cache is warm (within a short time window — typically a few minutes)

When the cache is hit, Azure OpenAI charges at 50% of the normal input token rate for the cached prefix.

What it saves

The system prompt is ~600 tokens. With caching:

Scenario	Input cost without cache	Input cost with cache	Saving
API query (6 750 total input tokens)	$0.0135	$0.0129	~$0.0006 per query
Business query (2 750 total input tokens)	$0.0055	$0.0049	~$0.0006 per query

At 1 000 queries/day, prompt caching saves roughly $0.60/day on the system prompt alone. The bigger savings come from keeping contextMaxChars minimal (see Token Utilization).

Does `server.js` enable it?

Yes — automatically. Azure OpenAI GPT-4.1 enables prompt caching by default for qualifying deployments. No code change is required.

The key condition is that the system prompt must be byte-for-byte identical across requests. In our setup this is guaranteed because systemPrompt is a module-level constant loaded once at startup.

caution

If you restart api/server.js while a high volume of requests is being served, the cache will be cold for the first few requests after restart. This is expected and has negligible impact.

Verifying cache hits

The Azure OpenAI response body includes usage metadata. You can log json.usage in callOpenAI() to inspect whether tokens were served from cache:

// In callOpenAI(), after parsing the response:
const json = JSON.parse(data);
if (json.usage) {
  console.log('Token usage:', JSON.stringify(json.usage));
  // Example output:
  // {"prompt_tokens":6750,"completion_tokens":420,
  //  "total_tokens":7170,
  //  "prompt_tokens_details":{"cached_tokens":600,"audio_tokens":0},
  //  "completion_tokens_details":{"reasoning_tokens":0,"audio_tokens":0}}
}

prompt_tokens_details.cached_tokens will be 600 (the system prompt length) when the cache is warm.

Adding usage logging temporarily

To add temporary usage logging without changing production behavior, edit callOpenAI() in api/server.js:

const json = JSON.parse(data);
// Temporary: log token usage for cost monitoring
if (process.env.LOG_TOKEN_USAGE === 'true' && json.usage) {
  const u = json.usage;
  const cached = u.prompt_tokens_details?.cached_tokens || 0;
  console.log(
    `[tokens] prompt=${u.prompt_tokens} (cached=${cached})` +
    ` completion=${u.completion_tokens} total=${u.total_tokens}`
  );
}
resolve(json.choices?.[0]?.message?.content || '');

Enable with LOG_TOKEN_USAGE=true in the environment — no code change needed in production.

Separate System Prompt vs Inline Prompts

The current design keeps the system prompt in a single constant in api/config.js. An alternative is splitting it into an external .txt or .md file and loading it with fs.readFileSync at startup.

Approach	Pros	Cons
Inline constant (current)	Simple; no file I/O; Docker image is self-contained	Slightly harder to diff prompt changes in PRs
External `.txt` file	Easy to view diff in PRs; supports comments	Adds one file to manage; must be included in Docker image
External `.md` file	Human-readable; preview in GitHub	Markdown syntax in the prompt file may confuse editors

For this project, the inline constant is the right trade-off. The system prompt is ~50 lines and changes rarely.

System Prompt Overview​

Prompt Structure​

Why bold labels instead of headings​

The ==highlight== convention​

Modifying the System Prompt​

Adding a new audience section​

Tightening hallucination controls​

Rate Limiting​

Disabling for local development​

Response Cache​

What it saves​

How the cache key is formed​

Logging​

Trade-offs vs Redis​

Azure OpenAI Prompt Caching​

What it saves​

Does server.js enable it?​

Verifying cache hits​

Adding usage logging temporarily​

Separate System Prompt vs Inline Prompts​

Related​

System Prompt Overview

Prompt Structure

Why bold labels instead of headings

The `==highlight==` convention

Modifying the System Prompt

Adding a new audience section

Tightening hallucination controls

Rate Limiting

Disabling for local development

Response Cache

What it saves

How the cache key is formed

Logging

Trade-offs vs Redis

Azure OpenAI Prompt Caching

What it saves

Does `server.js` enable it?

Verifying cache hits

Adding usage logging temporarily

Separate System Prompt vs Inline Prompts

Related