AI Search Indexing
This page describes exactly how scripts/ai-search/uploadToAzureSearch.js builds the wallet-docs-index Azure AI Search index — covering which files are parsed, how URLs are constructed, how keywords are generated, and why the index is structured the way it is.
Overview
The upload script runs in 3 sequential steps:
Step 1 yarn build → resolves all OpenAPI $refs into build/redocusaurus/
Step 2 Parse source files → extract markdown docs + OpenAPI spec operations + error code JSONs + UI validation JSONs
Step 3 Upload → (re)create index schema, clear old docs, batch-upload new docs
Run it with:
AZURE_SEARCH_ENDPOINT=https://ccg-docs.search.windows.net \
AZURE_SEARCH_API_KEY=<admin-key> \
AZURE_SEARCH_INDEX=wallet-docs-index \
node scripts/ai-search/uploadToAzureSearch.js
Requires an admin key — the script creates the index schema and deletes all existing documents. A query key (used by the runtime server) is read-only and cannot be used here.
Step 1 — Why yarn build Runs First
OpenAPI source files in openapi/v2/ contain $ref pointers to external YAML schema files:
# openapi/v2/apispec.yaml
schema:
$ref: './schemas/PaymentRequest.yaml#/components/schemas/PaymentRequest'
The upload script does not resolve these external files itself. Instead, it reads from build/redocusaurus/ — the fully-bundled YAML files that Docusaurus generates during yarn build. By the time Redocusaurus outputs these files, every $ref is inlined. The built files contain zero external cross-file references.
| Built file | Content | External $ref count |
|---|---|---|
build/redocusaurus/plugin-redoc-1.yaml | v2 public API | 0 (all inlined) |
build/redocusaurus/plugin-redoc-2.yaml | v2 Webhooks | 0 (all inlined) |
build/redocusaurus/plugin-redoc-0.yaml | v1 public API | 0 (all inlined) |
If yarn build fails, the script aborts immediately to prevent uploading stale content from a previous build.
Step 2 — Parsing Source Files
2a. Markdown documents
All .md and .mdx files under docs/ are walked recursively.
Per file, the script:
- Reads the raw file content
- Strips YAML frontmatter (
---…---block) - Extracts the page title — first
# Headingline - Extracts sub-headings — all
##and###lines (used for theheadingsfield) - Strips all remaining markdown syntax (links, bold, tables, code blocks, JSX tags, Docusaurus admonitions) to produce plain
bodyText - Truncates
bodyTextto 2,000 chars for thecontent(snippet) field - Computes the URL (see URL Generation below)
- Computes keywords (see Keyword Extraction below)
Pages without a # Title heading are silently skipped (no indexed entry created).
Release notes pages are also skipped (matched by title or URL pattern).
2b. OpenAPI spec operations
For each spec file in the SPECS array (v2 → webhooks → v1):
- The bundled YAML is loaded from
build/redocusaurus/ - The script detects whether it is a webhook spec (
webhooks:top-level key) or a regular API spec (paths:key) - For each path/event and each HTTP method (or
postonly for webhooks), one indexed entry is created
Per operation, the snippet includes (in order):
| Part | Format | Example |
|---|---|---|
| Endpoint line | Endpoint: METHOD BASE_URL/path | Endpoint: POST https://api.healthsafepay.com/v2/sessions |
| Summary | plain text | Create a session |
| Description | truncated to 500 chars | full operation description |
| Parameters | name [in]: type (required) — description | merchantId [header]: string (required) |
| Request body | full schema tree (see below) | amount: integer (required) — Payment amount in cents |
| Responses | code: description per status | 200: Session created |
| curl skeleton | ready-to-copy curl command | curl -X POST ... |
The full snippet is capped at 8,000 chars per operation.
URL Generation
Markdown pages
URL is derived from the file path relative to docs/. The conversion rules are:
- If the frontmatter has a
slug:field, that value is used directly (prefixed with/docs) - Otherwise, each path segment has its numeric prefix stripped (
01-,03-, etc.) and is lowercased with underscores converted to hyphens indexfilename segments are dropped (the directory becomes the URL)
docs/03-developers/1-Getting-Started/overview.md
→ /docs/developers/getting-started/overview
docs/01-business/3-Core-Capabilities/payments.md
→ /docs/business/core-capabilities/payments
docs/03-developers/5-convenient-checkout-api/index.md
→ /docs/developers/convenient-checkout-api
These URLs match the Docusaurus-generated site URLs exactly, so Related Pages links in the chat widget navigate to the correct pages.
API spec operations
URL format: {apiRefRoute}#tag/{TagSlug}/operation/{operationId}
| Spec | apiRefRoute | Example URL |
|---|---|---|
| v2 public | /api-reference-v2/ | /api-reference-v2/#tag/Payments/operation/createSession |
| v2 webhooks | /webhooks-v2/ | /webhooks-v2/#tag/Webhooks/operation/paymentFailed |
| v1 public | /api-reference/ | /api-reference/#tag/Payments/operation/createPayment |
TagSlug— first tag on the operation, percent-encoded (encodeURIComponent)operationId— taken directly from the spec; falls back toslugify(METHOD-/path)if absent
These match the anchor URLs that Redocusaurus generates for each operation in the rendered API reference pages.
Request Body Schema Expansion
Schemas are expanded inline using schemaToText() — a recursive function that resolves $ref chains and produces a human-readable field listing:
amount: integer (required) — Payment amount in cents
currency: string — ISO 4217 currency code
metadata: object — Key-value pairs containing payment metadata
key: string
value: string
customer: object — The customer object for authenticated flows
hsid: string — HealthSafeId of the customer
enterpriseId: string
metadata: object
agent: object — Information about the agent submitting on behalf of a customer
firstName: string
lastName: string
msid: string
refundAllocations: array of:
paymentMethodId: string (required)
amount: integer (required)
Circular reference guard: visited is a Set of already-seen $ref keys. If a $ref is encountered that is already in visited, the text (circular) is emitted and recursion stops.
Depth handled: All oneOf / anyOf / allOf variants are expanded as variant N: blocks.
Because the 8,000-char snippet can still truncate deeply nested schemas, all field names are also separately indexed as keywords (see below), ensuring they are always searchable.
Keyword Extraction
Keywords are stored in the summary field and used by Azure AI Search as prioritised keyword fields in the semantic configuration. There are two separate extraction paths:
For markdown docs — extractKeywords()
Combines title + headings + bodyText, lowercases everything, extracts all words of 3+ chars, removes a fixed stop-word list (common English words like "the", "and", "for"), and returns the top 15 by frequency.
Title: "Create a Session"
Headings: ["Prerequisites", "Request Body", "Response Fields"]
Body: "A session is required before submitting a payment..."
→ keywords: ["session", "payment", "required", "request", "create", ...]
For API spec operations — field names + operation metadata
Keywords for API entries are built from a union of:
| Source | Example values |
|---|---|
| HTTP method | POST, GET |
| Path segments | sessions, payments, refunds (split on /, _, {}) |
| Operation tags | Payments, Merchant |
| Summary words | create, session, refund |
operationId | createSession, getPaymentById |
| All request body field names (recursive) | metadata, customer, agent, refundAllocations, paymentMethodId, amount, hsid, enterpriseId |
The last entry — recursive field name collection — is done by collectSchemaFieldNames(). It walks the entire request body schema tree up to 5 levels deep and gathers every property name. This guarantees that a user asking "what is the metadata field?" or "how do I send agent information?" will match the correct operation's indexed entry even if those field names appear past the 8,000-char snippet cutoff.
Keywords are capped at 40 entries per operation.
Step 2c — API Error Code JSON files
The loadErrorCodeEntries() function reads all *-api.json files from docs/03-developers/5-convenient-checkout-api/4-error-codes/. One indexed entry is created per individual error code.
File structure
Each *-api.json file has this shape:
{
"apiName": "Payment API",
"apiVersion": "v2",
"basePath": "/v2/payments",
"searchKeywords": ["error code", "api error", "payment error", "troubleshooting", "http status", "validation error", "error message", "error handling"],
"categories": [
{
"name": "Authorization error",
"errors": [
{
"id": "v2-payment-merchant-not-linked",
"title": "FORBIDDEN",
"httpStatus": "403",
"message": "403 FORBIDDEN. RequestId: ${x-request-id}",
"scenario": ["..."],
"resolution": "...",
"description": "..."
}
]
}
]
}
The searchKeywords field
Each JSON file contains a root-level searchKeywords array. These terms are placed first in the summary field (the prioritizedKeywordsFields slot in the semantic configuration), giving them the highest ranking weight in Azure AI Search.
This ensures that any user query containing words like "error", "troubleshooting", "http status", or "validation error" will reliably surface error-code entries above general documentation pages.
To add or adjust search priority terms for an API's errors, edit its searchKeywords array:
// docs/03-developers/5-convenient-checkout-api/4-error-codes/payment-api.json
"searchKeywords": ["error code", "api error", "payment error", "troubleshooting", "http status", "validation error", "error message", "error handling"]
How each error entry is built
| Field | Value |
|---|---|
title | [Error Code] {error.title} ({apiName}) |
snippet | [Error Code] {apiName} — {category} | HTTP {status}: {title} | {message} | {scenario} | {resolution} |
headings | [apiName, category, error.title, "error code", "troubleshooting"] |
keywords | searchKeywords from file + ["error", "error code", "troubleshooting", apiName, httpStatus, id, category] |
The [Error Code] prefix in both title and snippet gives semantic re-ranking an additional signal that these entries are specifically about error handling.
Step 2d — UI Validation Error JSON files
The loadUiErrorEntries() function recursively walks docs/03-developers/4-convenient-checkout-ui/13-Error-Messages/ and reads all *-validation-data.json and *-form-validation-data.json files. One indexed entry is created per file.
Why JSON files are not modified
These files are imported directly by MDX pages as plain arrays:
import cardValidationData from './card-validation-data.json';
Adding a root-level searchKeywords field would change the array structure and break those imports. All keyword enrichment is therefore derived from the folder path and filename entirely within the upload script.
How context is extracted
The walker passes folderContext down through each directory level, accumulating the folder names:
13-Error-Messages/
├── Wallet Mode/ ← folderContext = "Wallet Mode"
│ └── Payment Method/ ← folderContext = "Wallet Mode Payment Method"
│ └── card-validation-data.json
└── Payment Mode/ ← folderContext = "Payment Mode"
└── card-validation-data.json
The context words (wallet, mode, payment, method) are extracted from folderContext and added to keywords. The payment type (card, ach, telephonic) is derived from the filename.
How each UI error entry is built
| Field | Value |
|---|---|
title | [UI Error] {label} — {folderContext} |
snippet | [UI Validation Error] {folderContext} | Field: {fieldName} | {validationRules} | {errorMessages} |
headings | [label, "Validation", "Error Messages", "UI error", folderContext, ...paymentTypes] |
keywords | ["error", "validation error", "ui error", "error message", "troubleshooting", "form validation" or "field validation", ...paymentTypes, ...contextWords, ...labelWords] |
The [UI Validation Error] prefix in the snippet and "ui error" / "validation error" baseline keywords ensure these entries rank highly for queries about checkout widget form errors, field validation messages, and UI error states.
Index Schema
The wallet-docs-index schema is defined in ensureIndex() and uses a PUT /indexes/{name} call with allowIndexDowntime=false.
| Field | Type | Searchable | Retrievable | Notes |
|---|---|---|---|---|
id | Edm.String (key) | No | Yes | Base64-encoded URL, URL-safe characters only ([^a-zA-Z0-9_-] → _) |
title | Edm.String | Yes | Yes | Page title or operation summary |
file_path | Edm.String | No | Yes | Relative URL — used as the link in the chat widget |
file_name | Edm.String | No | Yes | Last segment of the URL |
content | Edm.String | Yes | Yes | Truncated snippet (2,000 chars for docs, 8,000 chars for specs) |
headings | Edm.String | Yes | Yes | Pipe-separated list of ##/### headings |
summary | Edm.String | Yes | Yes | Top keywords (comma-separated; up to 40 for specs, 15 for docs) |
section | Edm.String | Yes | Yes | Second URL segment (e.g. developers, business) |
category | Edm.String | Yes | Yes | First URL segment — used for the two-pass search filter |
last_modified | Edm.String | No | Yes | ISO timestamp of the upload run |
Semantic configuration
{
"name": "default",
"prioritizedFields": {
"titleField": { "fieldName": "title" },
"prioritizedContentFields": [{ "fieldName": "content" }],
"prioritizedKeywordsFields":[{ "fieldName": "summary" }]
}
}
Azure AI Search uses this configuration to apply L2 semantic re-ranking — title gets the highest semantic weight, content is the primary text, and summary (keywords) boost recall for exact field-name or endpoint matches.
The category Field and Two-Pass Search
The category value is always the first segment of the URL:
| Category value | Documents |
|---|---|
docs | All markdown pages |
api-reference-v2 | v2 API operations |
webhooks-v2 | v2 webhook events |
api-reference | v1 API operations |
The runtime server (api/server.js) uses this field in a two-pass parallel search:
- Pass 1 —
filter: "category ne 'docs'"+top: 3— guarantees at least 3 spec/webhook entries reach the OpenAI context window regardless of how prose docs rank in the overall semantic score - Pass 2 — no filter +
top: 8— captures the best matching markdown docs
The passes run in parallel and their results are merged (spec entries first, v2/webhooks before v1, deduplicated by file_path).
Document ID Generation
id: Buffer.from(e.url).toString('base64').replace(/[^a-zA-Z0-9_-]/g, '_')
Azure AI Search key fields cannot contain /, +, or =. The URL is Base64-encoded and all non-alphanumeric non-_/- characters are replaced with _. This produces a stable, deterministic ID — re-uploading the same URL always produces the same key, enabling mergeOrUpload semantics.
Upload Mechanics
- Action:
mergeOrUpload— creates the document if it does not exist, updates if it does (matched byid) - Batch size: 500 documents per POST request to the Azure Search batch endpoint
- Pre-upload: All existing documents are deleted with
deleteAllDocs()before uploading so stale entries (renamed/deleted pages) do not persist in the index - Failure handling: Any failed documents within a batch are logged; the script exits with code 1 if any failures occurred