Skip to main content
Version: v2

Prompt Techiniques & Caching

This page covers the structure of the system prompt, how to modify it, how prompt caching works on Azure OpenAI, and how to estimate the savings it provides.


System Prompt Overview

The system prompt lives in src/components/ChatBot/aiConfig.js as the systemPrompt constant. It is a template literal string — plain text, no external file, no build step required.

src/components/ChatBot/aiConfig.js
└── systemPrompt (imported by walletAIService.js)
└── sent as system_prompt field in every POST /api/ask payload

Every request to POST /api/ask includes the full system prompt in the request body:

{
"query": "...",
"session_id": "<uuid>",
"system_prompt": "<systemPrompt>",
"search_config": { "index_name": "wallet-docs-index-v2", "top": 5 }
}

The system prompt is stateless — it does not change between requests. This is what makes prompt caching effective (see below).


Prompt Structure

The system prompt is divided into six sections:

SectionPurpose
ROLEEstablishes model identity as Senior Integration Architect for CCG
CORE PRINCIPLESZero-hallucination rules; security-first requirements
ROLE-SPECIFIC GUIDANCEAudience-aware behaviour (Dev, QA, PO, Business)
RESPONSE RULESBusiness vs API question routing; silent section omission rule
SECURITYRepeats critical security constraints so they are near the end of the prompt
REQUIRED RESPONSE FORMATBold label format (**Summary**), ==highlight== syntax, \``bash` blocks

Why bold labels instead of headings

The chat widget's MarkdownContent renderer detects bold-only lines (/^\*\*([^*]+?)\*\*[\s:.,]*$/) and styles them as section headers. Markdown ## headings are not used in responses because the Azure OpenAI model was inserting literal ## characters that appeared unstyled in the chat bubble.

The ==highlight== convention

The system prompt instructs the model to wrap important inline values in double-equals signs:

Use ==value== to highlight error codes, status values, field names, and HTTP methods.
Example: ==PAYMENT_FAILED==, ==400==, ==paymentMethodId==

ChatBot.js converts ==text== to <mark class="chatbot-highlight"> at render time.


Modifying the System Prompt

The system prompt is a plain string in api/config.js. Edit it directly — no rebuild is needed (the server reads it on startup).

After editing:

  1. Restart api/server.js
  2. Send a few test queries covering business, integration, and API scenarios
  3. Verify that sections appear in order and no ## headings leak into responses

Adding a new audience section

Add a new block under ROLE-SPECIFIC GUIDANCE:

For <New Audience>
- <Specific instruction about how to handle responses for this audience>
- <Second instruction>

No code changes are needed — the model reads the section from the prompt at inference time.

Tightening hallucination controls

If the model starts inferring undocumented fields, add stronger constraints to CORE PRINCIPLES:

- NEVER infer or suggest fields that are not explicitly listed in the documentation context.
- If a field name does not appear verbatim in the context, do not mention it.

Rate Limiting

The server enforces a sliding-window rate limit per client IP to prevent abuse and keep Azure OpenAI costs bounded.

ParameterValueConfig location
Enabled by defaultYesserver.rateLimit.enabled in config.js
Max requests5 per minuteserver.rateLimit.maxRequests
Window60 secondsserver.rateLimit.windowMs
Override for local devRATE_LIMIT_ENABLED=falseEnvironment variable

When the limit is exceeded the server returns:

HTTP 429 Too Many Requests
Retry-After: 60
{ "error": "Rate limit exceeded. Max 5 requests per minute." }

The rate limiter reads the real client IP from X-Forwarded-For (set by nginx) before falling back to req.socket.remoteAddress. This means all requests proxied through nginx share the correct per-user IP, not the nginx pod IP.

Disabling for local development

Add RATE_LIMIT_ENABLED=false to your local start command:

RATELIMIT_ENABLED=false \
AZURE_SEARCH_ENDPOINT=https://ccg-docs.search.windows.net \
...
node api/server.js

Or set it permanently in a .env-style shell alias for local work.


Response Cache

The server maintains an in-process TTL cache that stores the full { answer, docs } payload keyed by the normalised query string. A cache hit skips both the Azure Search call(s) and the Azure OpenAI call entirely.

ParameterValueWhere
TTLNone — entries live until server restartserver.js
Max entries500CACHE_MAX_SIZE in server.js
Eviction policyOldest-inserted firstMap insertion order
ScopeSingle Node processLost on restart; not shared across K8s pods
caution

Because there is no TTL, restart the server after running uploadToAzureSearch.js to prevent users getting cached answers based on old docs.

What it saves

Without cacheWith cache hit
Azure Search calls1–20
Azure OpenAI call10
Cost~$0.008–0.024$0.000
Latency2–8 s< 1 ms

How the cache key is formed

The key is query.trim().slice(0, 500) — the same normalised string that is sent to Azure Search. Queries that differ only in leading/trailing whitespace will share a cache entry. Case differences will not (e.g. "What is CCG" and "what is CCG" are separate entries). This is intentional — adding .toLowerCase() to the key is a safe future optimisation if needed.

Logging

Cache hits are logged as:

Cache hit — "what is convenient checkout"

Cache misses log the query type as normal:

Query type: business/general — "what is convenient checkout"

Trade-offs vs Redis

In-process Map (current)Azure Cache for Redis
CostFree~$16–55/month
Shared across podsNoYes
Survives restartNoYes
Setup requiredNoneRedis client npm package + K8s secret

For a single-replica internal docs assistant, the in-process cache is the right choice. If the assistant ever scales to multiple replicas, add Redis using the same _getCached / _setCached interface — only the implementation of those two functions changes.

Azure OpenAI Prompt Caching

Prompt caching (available on GPT-4.1 via Azure OpenAI) automatically reuses the KV-cache for the static prefix of the messages array — specifically the system prompt — when:

  1. The system prompt content is identical between requests
  2. The deployment supports caching (GPT-4.1 does)
  3. The cache is warm (within a short time window — typically a few minutes)

When the cache is hit, Azure OpenAI charges at 50% of the normal input token rate for the cached prefix.

What it saves

The system prompt is ~600 tokens. With caching:

ScenarioInput cost without cacheInput cost with cacheSaving
API query (6 750 total input tokens)$0.0135$0.0129~$0.0006 per query
Business query (2 750 total input tokens)$0.0055$0.0049~$0.0006 per query

At 1 000 queries/day, prompt caching saves roughly $0.60/day on the system prompt alone. The bigger savings come from keeping contextMaxChars minimal (see Token Utilization).

Does server.js enable it?

Yes — automatically. Azure OpenAI GPT-4.1 enables prompt caching by default for qualifying deployments. No code change is required.

The key condition is that the system prompt must be byte-for-byte identical across requests. In our setup this is guaranteed because systemPrompt is a module-level constant loaded once at startup.

caution

If you restart api/server.js while a high volume of requests is being served, the cache will be cold for the first few requests after restart. This is expected and has negligible impact.

Verifying cache hits

The Azure OpenAI response body includes usage metadata. You can log json.usage in callOpenAI() to inspect whether tokens were served from cache:

// In callOpenAI(), after parsing the response:
const json = JSON.parse(data);
if (json.usage) {
console.log('Token usage:', JSON.stringify(json.usage));
// Example output:
// {"prompt_tokens":6750,"completion_tokens":420,
// "total_tokens":7170,
// "prompt_tokens_details":{"cached_tokens":600,"audio_tokens":0},
// "completion_tokens_details":{"reasoning_tokens":0,"audio_tokens":0}}
}

prompt_tokens_details.cached_tokens will be 600 (the system prompt length) when the cache is warm.

Adding usage logging temporarily

To add temporary usage logging without changing production behavior, edit callOpenAI() in api/server.js:

const json = JSON.parse(data);
// Temporary: log token usage for cost monitoring
if (process.env.LOG_TOKEN_USAGE === 'true' && json.usage) {
const u = json.usage;
const cached = u.prompt_tokens_details?.cached_tokens || 0;
console.log(
`[tokens] prompt=${u.prompt_tokens} (cached=${cached})` +
` completion=${u.completion_tokens} total=${u.total_tokens}`
);
}
resolve(json.choices?.[0]?.message?.content || '');

Enable with LOG_TOKEN_USAGE=true in the environment — no code change needed in production.


Separate System Prompt vs Inline Prompts

The current design keeps the system prompt in a single constant in api/config.js. An alternative is splitting it into an external .txt or .md file and loading it with fs.readFileSync at startup.

ApproachProsCons
Inline constant (current)Simple; no file I/O; Docker image is self-containedSlightly harder to diff prompt changes in PRs
External .txt fileEasy to view diff in PRs; supports commentsAdds one file to manage; must be included in Docker image
External .md fileHuman-readable; preview in GitHubMarkdown syntax in the prompt file may confuse editors

For this project, the inline constant is the right trade-off. The system prompt is ~50 lines and changes rarely.