Skip to main content
Version: v2

AI Search Indexing

This page describes exactly how scripts/ai-search/uploadToAzureSearch.js builds the wallet-docs-index Azure AI Search index — covering which files are parsed, how URLs are constructed, how keywords are generated, and why the index is structured the way it is.


Overview

The upload script runs in 3 sequential steps:

Step 1  yarn build          → resolves all OpenAPI $refs into build/redocusaurus/
Step 2 Parse source files → extract markdown docs + OpenAPI spec operations + error code JSONs + UI validation JSONs
Step 3 Upload → (re)create index schema, clear old docs, batch-upload new docs

Run it with:

AZURE_SEARCH_ENDPOINT=https://ccg-docs.search.windows.net \
AZURE_SEARCH_API_KEY=<admin-key> \
AZURE_SEARCH_INDEX=wallet-docs-index \
node scripts/ai-search/uploadToAzureSearch.js
caution

Requires an admin key — the script creates the index schema and deletes all existing documents. A query key (used by the runtime server) is read-only and cannot be used here.


Step 1 — Why yarn build Runs First

OpenAPI source files in openapi/v2/ contain $ref pointers to external YAML schema files:

# openapi/v2/apispec.yaml
schema:
$ref: './schemas/PaymentRequest.yaml#/components/schemas/PaymentRequest'

The upload script does not resolve these external files itself. Instead, it reads from build/redocusaurus/ — the fully-bundled YAML files that Docusaurus generates during yarn build. By the time Redocusaurus outputs these files, every $ref is inlined. The built files contain zero external cross-file references.

Built fileContentExternal $ref count
build/redocusaurus/plugin-redoc-1.yamlv2 public API0 (all inlined)
build/redocusaurus/plugin-redoc-2.yamlv2 Webhooks0 (all inlined)
build/redocusaurus/plugin-redoc-0.yamlv1 public API0 (all inlined)

If yarn build fails, the script aborts immediately to prevent uploading stale content from a previous build.


Step 2 — Parsing Source Files

2a. Markdown documents

All .md and .mdx files under docs/ are walked recursively.

Per file, the script:

  1. Reads the raw file content
  2. Strips YAML frontmatter (---…--- block)
  3. Extracts the page title — first # Heading line
  4. Extracts sub-headings — all ## and ### lines (used for the headings field)
  5. Strips all remaining markdown syntax (links, bold, tables, code blocks, JSX tags, Docusaurus admonitions) to produce plain bodyText
  6. Truncates bodyText to 2,000 chars for the content (snippet) field
  7. Computes the URL (see URL Generation below)
  8. Computes keywords (see Keyword Extraction below)

Pages without a # Title heading are silently skipped (no indexed entry created).

Release notes pages are also skipped (matched by title or URL pattern).

2b. OpenAPI spec operations

For each spec file in the SPECS array (v2 → webhooks → v1):

  1. The bundled YAML is loaded from build/redocusaurus/
  2. The script detects whether it is a webhook spec (webhooks: top-level key) or a regular API spec (paths: key)
  3. For each path/event and each HTTP method (or post only for webhooks), one indexed entry is created

Per operation, the snippet includes (in order):

PartFormatExample
Endpoint lineEndpoint: METHOD BASE_URL/pathEndpoint: POST https://api.healthsafepay.com/v2/sessions
Summaryplain textCreate a session
Descriptiontruncated to 500 charsfull operation description
Parametersname [in]: type (required) — descriptionmerchantId [header]: string (required)
Request bodyfull schema tree (see below)amount: integer (required) — Payment amount in cents
Responsescode: description per status200: Session created
curl skeletonready-to-copy curl commandcurl -X POST ...

The full snippet is capped at 8,000 chars per operation.


URL Generation

Markdown pages

URL is derived from the file path relative to docs/. The conversion rules are:

  1. If the frontmatter has a slug: field, that value is used directly (prefixed with /docs)
  2. Otherwise, each path segment has its numeric prefix stripped (01-, 03-, etc.) and is lowercased with underscores converted to hyphens
  3. index filename segments are dropped (the directory becomes the URL)
docs/03-developers/1-Getting-Started/overview.md
→ /docs/developers/getting-started/overview

docs/01-business/3-Core-Capabilities/payments.md
→ /docs/business/core-capabilities/payments

docs/03-developers/5-convenient-checkout-api/index.md
→ /docs/developers/convenient-checkout-api

These URLs match the Docusaurus-generated site URLs exactly, so Related Pages links in the chat widget navigate to the correct pages.

API spec operations

URL format: {apiRefRoute}#tag/{TagSlug}/operation/{operationId}

SpecapiRefRouteExample URL
v2 public/api-reference-v2//api-reference-v2/#tag/Payments/operation/createSession
v2 webhooks/webhooks-v2//webhooks-v2/#tag/Webhooks/operation/paymentFailed
v1 public/api-reference//api-reference/#tag/Payments/operation/createPayment
  • TagSlug — first tag on the operation, percent-encoded (encodeURIComponent)
  • operationId — taken directly from the spec; falls back to slugify(METHOD-/path) if absent

These match the anchor URLs that Redocusaurus generates for each operation in the rendered API reference pages.


Request Body Schema Expansion

Schemas are expanded inline using schemaToText() — a recursive function that resolves $ref chains and produces a human-readable field listing:

amount: integer (required) — Payment amount in cents
currency: string — ISO 4217 currency code
metadata: object — Key-value pairs containing payment metadata
key: string
value: string
customer: object — The customer object for authenticated flows
hsid: string — HealthSafeId of the customer
enterpriseId: string
metadata: object
agent: object — Information about the agent submitting on behalf of a customer
firstName: string
lastName: string
msid: string
refundAllocations: array of:
paymentMethodId: string (required)
amount: integer (required)

Circular reference guard: visited is a Set of already-seen $ref keys. If a $ref is encountered that is already in visited, the text (circular) is emitted and recursion stops.

Depth handled: All oneOf / anyOf / allOf variants are expanded as variant N: blocks.

info

Because the 8,000-char snippet can still truncate deeply nested schemas, all field names are also separately indexed as keywords (see below), ensuring they are always searchable.


Keyword Extraction

Keywords are stored in the summary field and used by Azure AI Search as prioritised keyword fields in the semantic configuration. There are two separate extraction paths:

For markdown docs — extractKeywords()

Combines title + headings + bodyText, lowercases everything, extracts all words of 3+ chars, removes a fixed stop-word list (common English words like "the", "and", "for"), and returns the top 15 by frequency.

Title: "Create a Session"
Headings: ["Prerequisites", "Request Body", "Response Fields"]
Body: "A session is required before submitting a payment..."

→ keywords: ["session", "payment", "required", "request", "create", ...]

For API spec operations — field names + operation metadata

Keywords for API entries are built from a union of:

SourceExample values
HTTP methodPOST, GET
Path segmentssessions, payments, refunds (split on /, _, {})
Operation tagsPayments, Merchant
Summary wordscreate, session, refund
operationIdcreateSession, getPaymentById
All request body field names (recursive)metadata, customer, agent, refundAllocations, paymentMethodId, amount, hsid, enterpriseId

The last entry — recursive field name collection — is done by collectSchemaFieldNames(). It walks the entire request body schema tree up to 5 levels deep and gathers every property name. This guarantees that a user asking "what is the metadata field?" or "how do I send agent information?" will match the correct operation's indexed entry even if those field names appear past the 8,000-char snippet cutoff.

Keywords are capped at 40 entries per operation.


Step 2c — API Error Code JSON files

The loadErrorCodeEntries() function reads all *-api.json files from docs/03-developers/5-convenient-checkout-api/4-error-codes/. One indexed entry is created per individual error code.

File structure

Each *-api.json file has this shape:

{
"apiName": "Payment API",
"apiVersion": "v2",
"basePath": "/v2/payments",
"searchKeywords": ["error code", "api error", "payment error", "troubleshooting", "http status", "validation error", "error message", "error handling"],
"categories": [
{
"name": "Authorization error",
"errors": [
{
"id": "v2-payment-merchant-not-linked",
"title": "FORBIDDEN",
"httpStatus": "403",
"message": "403 FORBIDDEN. RequestId: ${x-request-id}",
"scenario": ["..."],
"resolution": "...",
"description": "..."
}
]
}
]
}

The searchKeywords field

Each JSON file contains a root-level searchKeywords array. These terms are placed first in the summary field (the prioritizedKeywordsFields slot in the semantic configuration), giving them the highest ranking weight in Azure AI Search.

This ensures that any user query containing words like "error", "troubleshooting", "http status", or "validation error" will reliably surface error-code entries above general documentation pages.

To add or adjust search priority terms for an API's errors, edit its searchKeywords array:

// docs/03-developers/5-convenient-checkout-api/4-error-codes/payment-api.json
"searchKeywords": ["error code", "api error", "payment error", "troubleshooting", "http status", "validation error", "error message", "error handling"]

How each error entry is built

FieldValue
title[Error Code] {error.title} ({apiName})
snippet[Error Code] {apiName} — {category} | HTTP {status}: {title} | {message} | {scenario} | {resolution}
headings[apiName, category, error.title, "error code", "troubleshooting"]
keywordssearchKeywords from file + ["error", "error code", "troubleshooting", apiName, httpStatus, id, category]

The [Error Code] prefix in both title and snippet gives semantic re-ranking an additional signal that these entries are specifically about error handling.


Step 2d — UI Validation Error JSON files

The loadUiErrorEntries() function recursively walks docs/03-developers/4-convenient-checkout-ui/13-Error-Messages/ and reads all *-validation-data.json and *-form-validation-data.json files. One indexed entry is created per file.

Why JSON files are not modified

These files are imported directly by MDX pages as plain arrays:

import cardValidationData from './card-validation-data.json';

Adding a root-level searchKeywords field would change the array structure and break those imports. All keyword enrichment is therefore derived from the folder path and filename entirely within the upload script.

How context is extracted

The walker passes folderContext down through each directory level, accumulating the folder names:

13-Error-Messages/
├── Wallet Mode/ ← folderContext = "Wallet Mode"
│ └── Payment Method/ ← folderContext = "Wallet Mode Payment Method"
│ └── card-validation-data.json
└── Payment Mode/ ← folderContext = "Payment Mode"
└── card-validation-data.json

The context words (wallet, mode, payment, method) are extracted from folderContext and added to keywords. The payment type (card, ach, telephonic) is derived from the filename.

How each UI error entry is built

FieldValue
title[UI Error] {label} — {folderContext}
snippet[UI Validation Error] {folderContext} | Field: {fieldName} | {validationRules} | {errorMessages}
headings[label, "Validation", "Error Messages", "UI error", folderContext, ...paymentTypes]
keywords["error", "validation error", "ui error", "error message", "troubleshooting", "form validation" or "field validation", ...paymentTypes, ...contextWords, ...labelWords]

The [UI Validation Error] prefix in the snippet and "ui error" / "validation error" baseline keywords ensure these entries rank highly for queries about checkout widget form errors, field validation messages, and UI error states.


Index Schema

The wallet-docs-index schema is defined in ensureIndex() and uses a PUT /indexes/{name} call with allowIndexDowntime=false.

FieldTypeSearchableRetrievableNotes
idEdm.String (key)NoYesBase64-encoded URL, URL-safe characters only ([^a-zA-Z0-9_-]_)
titleEdm.StringYesYesPage title or operation summary
file_pathEdm.StringNoYesRelative URL — used as the link in the chat widget
file_nameEdm.StringNoYesLast segment of the URL
contentEdm.StringYesYesTruncated snippet (2,000 chars for docs, 8,000 chars for specs)
headingsEdm.StringYesYesPipe-separated list of ##/### headings
summaryEdm.StringYesYesTop keywords (comma-separated; up to 40 for specs, 15 for docs)
sectionEdm.StringYesYesSecond URL segment (e.g. developers, business)
categoryEdm.StringYesYesFirst URL segment — used for the two-pass search filter
last_modifiedEdm.StringNoYesISO timestamp of the upload run

Semantic configuration

{
"name": "default",
"prioritizedFields": {
"titleField": { "fieldName": "title" },
"prioritizedContentFields": [{ "fieldName": "content" }],
"prioritizedKeywordsFields":[{ "fieldName": "summary" }]
}
}

Azure AI Search uses this configuration to apply L2 semantic re-rankingtitle gets the highest semantic weight, content is the primary text, and summary (keywords) boost recall for exact field-name or endpoint matches.


The category value is always the first segment of the URL:

Category valueDocuments
docsAll markdown pages
api-reference-v2v2 API operations
webhooks-v2v2 webhook events
api-referencev1 API operations

The runtime server (api/server.js) uses this field in a two-pass parallel search:

  • Pass 1filter: "category ne 'docs'" + top: 3 — guarantees at least 3 spec/webhook entries reach the OpenAI context window regardless of how prose docs rank in the overall semantic score
  • Pass 2 — no filter + top: 8 — captures the best matching markdown docs

The passes run in parallel and their results are merged (spec entries first, v2/webhooks before v1, deduplicated by file_path).


Document ID Generation

id: Buffer.from(e.url).toString('base64').replace(/[^a-zA-Z0-9_-]/g, '_')

Azure AI Search key fields cannot contain /, +, or =. The URL is Base64-encoded and all non-alphanumeric non-_/- characters are replaced with _. This produces a stable, deterministic ID — re-uploading the same URL always produces the same key, enabling mergeOrUpload semantics.


Upload Mechanics

  • Action: mergeOrUpload — creates the document if it does not exist, updates if it does (matched by id)
  • Batch size: 500 documents per POST request to the Azure Search batch endpoint
  • Pre-upload: All existing documents are deleted with deleteAllDocs() before uploading so stale entries (renamed/deleted pages) do not persist in the index
  • Failure handling: Any failed documents within a batch are logged; the script exits with code 1 if any failures occurred