Version: v2

AI Search Indexing

This page describes exactly how scripts/ai-search/uploadToAzureSearch.js builds the wallet-docs-index Azure AI Search index — covering which files are parsed, how URLs are constructed, how keywords are generated, and why the index is structured the way it is.

Overview

The upload script runs in 3 sequential steps:

Step 1  yarn build          → resolves all OpenAPI $refs into build/redocusaurus/
Step 2  Parse source files  → extract markdown docs + OpenAPI spec operations + error code JSONs + UI validation JSONs
Step 3  Upload              → (re)create index schema, clear old docs, batch-upload new docs

Run it with:

AZURE_SEARCH_ENDPOINT=https://ccg-docs.search.windows.net \
AZURE_SEARCH_API_KEY=<admin-key> \
AZURE_SEARCH_INDEX=wallet-docs-index \
node scripts/ai-search/uploadToAzureSearch.js

caution

Requires an admin key — the script creates the index schema and deletes all existing documents. A query key (used by the runtime server) is read-only and cannot be used here.

Step 1 — Why `yarn build` Runs First

OpenAPI source files in openapi/v2/ contain $ref pointers to external YAML schema files:

# openapi/v2/apispec.yaml
schema:
  $ref: './schemas/PaymentRequest.yaml#/components/schemas/PaymentRequest'

The upload script does not resolve these external files itself. Instead, it reads from build/redocusaurus/ — the fully-bundled YAML files that Docusaurus generates during yarn build. By the time Redocusaurus outputs these files, every $ref is inlined. The built files contain zero external cross-file references.

Built file	Content	External `$ref` count
`build/redocusaurus/plugin-redoc-1.yaml`	v2 public API	0 (all inlined)
`build/redocusaurus/plugin-redoc-2.yaml`	v2 Webhooks	0 (all inlined)
`build/redocusaurus/plugin-redoc-0.yaml`	v1 public API	0 (all inlined)

If yarn build fails, the script aborts immediately to prevent uploading stale content from a previous build.

Step 2 — Parsing Source Files

2a. Markdown documents

All .md and .mdx files under docs/ are walked recursively.

Per file, the script:

Reads the raw file content
Strips YAML frontmatter (---…--- block)
Extracts the page title — first # Heading line
Extracts sub-headings — all ## and ### lines (used for the headings field)
Strips all remaining markdown syntax (links, bold, tables, code blocks, JSX tags, Docusaurus admonitions) to produce plain bodyText
Truncates bodyText to 2,000 chars for the content (snippet) field
Computes the URL (see URL Generation below)
Computes keywords (see Keyword Extraction below)

Pages without a # Title heading are silently skipped (no indexed entry created).

Release notes pages are also skipped (matched by title or URL pattern).

2b. OpenAPI spec operations

For each spec file in the SPECS array (v2 → webhooks → v1):

The bundled YAML is loaded from build/redocusaurus/
The script detects whether it is a webhook spec (webhooks: top-level key) or a regular API spec (paths: key)
For each path/event and each HTTP method (or post only for webhooks), one indexed entry is created

Per operation, the snippet includes (in order):

Part	Format	Example
Endpoint line	`Endpoint: METHOD BASE_URL/path`	`Endpoint: POST https://api.healthsafepay.com/v2/sessions`
Summary	plain text	`Create a session`
Description	truncated to 500 chars	full operation description
Parameters	`name [in]: type (required) — description`	`merchantId [header]: string (required)`
Request body	full schema tree (see below)	`amount: integer (required) — Payment amount in cents`
Responses	`code: description` per status	`200: Session created`
curl skeleton	ready-to-copy curl command	`curl -X POST ...`

The full snippet is capped at 8,000 chars per operation.

URL Generation

Markdown pages

URL is derived from the file path relative to docs/. The conversion rules are:

If the frontmatter has a slug: field, that value is used directly (prefixed with /docs)
Otherwise, each path segment has its numeric prefix stripped (01-, 03-, etc.) and is lowercased with underscores converted to hyphens
index filename segments are dropped (the directory becomes the URL)

docs/03-developers/1-Getting-Started/overview.md
  → /docs/developers/getting-started/overview

docs/01-business/3-Core-Capabilities/payments.md
  → /docs/business/core-capabilities/payments

docs/03-developers/5-convenient-checkout-api/index.md
  → /docs/developers/convenient-checkout-api

These URLs match the Docusaurus-generated site URLs exactly, so Related Pages links in the chat widget navigate to the correct pages.

API spec operations

URL format: {apiRefRoute}#tag/{TagSlug}/operation/{operationId}

Spec	`apiRefRoute`	Example URL
v2 public	`/api-reference-v2/`	`/api-reference-v2/#tag/Payments/operation/createSession`
v2 webhooks	`/webhooks-v2/`	`/webhooks-v2/#tag/Webhooks/operation/paymentFailed`
v1 public	`/api-reference/`	`/api-reference/#tag/Payments/operation/createPayment`

TagSlug — first tag on the operation, percent-encoded (encodeURIComponent)
operationId — taken directly from the spec; falls back to slugify(METHOD-/path) if absent

These match the anchor URLs that Redocusaurus generates for each operation in the rendered API reference pages.

Request Body Schema Expansion

Schemas are expanded inline using schemaToText() — a recursive function that resolves $ref chains and produces a human-readable field listing:

amount: integer (required) — Payment amount in cents
currency: string — ISO 4217 currency code
metadata: object — Key-value pairs containing payment metadata
  key: string
  value: string
customer: object — The customer object for authenticated flows
  hsid: string — HealthSafeId of the customer
  enterpriseId: string
  metadata: object
agent: object — Information about the agent submitting on behalf of a customer
  firstName: string
  lastName: string
  msid: string
refundAllocations: array of:
  paymentMethodId: string (required)
  amount: integer (required)

Circular reference guard: visited is a Set of already-seen $ref keys. If a $ref is encountered that is already in visited, the text (circular) is emitted and recursion stops.

Depth handled: All oneOf / anyOf / allOf variants are expanded as variant N: blocks.

info

Because the 8,000-char snippet can still truncate deeply nested schemas, all field names are also separately indexed as keywords (see below), ensuring they are always searchable.

Keyword Extraction

Keywords are stored in the summary field and used by Azure AI Search as prioritised keyword fields in the semantic configuration. There are two separate extraction paths:

For markdown docs — `extractKeywords()`

Combines title + headings + bodyText, lowercases everything, extracts all words of 3+ chars, removes a fixed stop-word list (common English words like "the", "and", "for"), and returns the top 15 by frequency.

Title: "Create a Session"
Headings: ["Prerequisites", "Request Body", "Response Fields"]
Body: "A session is required before submitting a payment..."

→ keywords: ["session", "payment", "required", "request", "create", ...]

For API spec operations — field names + operation metadata

Keywords for API entries are built from a union of:

Source	Example values
HTTP method	`POST`, `GET`
Path segments	`sessions`, `payments`, `refunds` (split on `/`, `_`, `{}`)
Operation tags	`Payments`, `Merchant`
Summary words	`create`, `session`, `refund`
`operationId`	`createSession`, `getPaymentById`
All request body field names (recursive)	`metadata`, `customer`, `agent`, `refundAllocations`, `paymentMethodId`, `amount`, `hsid`, `enterpriseId`

The last entry — recursive field name collection — is done by collectSchemaFieldNames(). It walks the entire request body schema tree up to 5 levels deep and gathers every property name. This guarantees that a user asking "what is the metadata field?" or "how do I send agent information?" will match the correct operation's indexed entry even if those field names appear past the 8,000-char snippet cutoff.

Keywords are capped at 40 entries per operation.

Step 2c — API Error Code JSON files

The loadErrorCodeEntries() function reads all *-api.json files from docs/03-developers/5-convenient-checkout-api/4-error-codes/. One indexed entry is created per individual error code.

File structure

Each *-api.json file has this shape:

{
  "apiName": "Payment API",
  "apiVersion": "v2",
  "basePath": "/v2/payments",
  "searchKeywords": ["error code", "api error", "payment error", "troubleshooting", "http status", "validation error", "error message", "error handling"],
  "categories": [
    {
      "name": "Authorization error",
      "errors": [
        {
          "id": "v2-payment-merchant-not-linked",
          "title": "FORBIDDEN",
          "httpStatus": "403",
          "message": "403 FORBIDDEN. RequestId: ${x-request-id}",
          "scenario": ["..."],
          "resolution": "...",
          "description": "..."
        }
      ]
    }
  ]
}

The `searchKeywords` field

Each JSON file contains a root-level searchKeywords array. These terms are placed first in the summary field (the prioritizedKeywordsFields slot in the semantic configuration), giving them the highest ranking weight in Azure AI Search.

This ensures that any user query containing words like "error", "troubleshooting", "http status", or "validation error" will reliably surface error-code entries above general documentation pages.

To add or adjust search priority terms for an API's errors, edit its searchKeywords array:

// docs/03-developers/5-convenient-checkout-api/4-error-codes/payment-api.json
"searchKeywords": ["error code", "api error", "payment error", "troubleshooting", "http status", "validation error", "error message", "error handling"]

How each error entry is built

Field	Value
`title`	`[Error Code] {error.title} ({apiName})`
`snippet`	`[Error Code] {apiName} — {category} \| HTTP {status}: {title} \| {message} \| {scenario} \| {resolution}`
`headings`	`[apiName, category, error.title, "error code", "troubleshooting"]`
`keywords`	`searchKeywords` from file + `["error", "error code", "troubleshooting", apiName, httpStatus, id, category]`

The [Error Code] prefix in both title and snippet gives semantic re-ranking an additional signal that these entries are specifically about error handling.

Step 2d — UI Validation Error JSON files

The loadUiErrorEntries() function recursively walks docs/03-developers/4-convenient-checkout-ui/13-Error-Messages/ and reads all *-validation-data.json and *-form-validation-data.json files. One indexed entry is created per file.

Why JSON files are not modified

These files are imported directly by MDX pages as plain arrays:

import cardValidationData from './card-validation-data.json';

Adding a root-level searchKeywords field would change the array structure and break those imports. All keyword enrichment is therefore derived from the folder path and filename entirely within the upload script.

How context is extracted

The walker passes folderContext down through each directory level, accumulating the folder names:

13-Error-Messages/
├── Wallet Mode/          ← folderContext = "Wallet Mode"
│   └── Payment Method/   ← folderContext = "Wallet Mode Payment Method"
│       └── card-validation-data.json
└── Payment Mode/         ← folderContext = "Payment Mode"
    └── card-validation-data.json

The context words (wallet, mode, payment, method) are extracted from folderContext and added to keywords. The payment type (card, ach, telephonic) is derived from the filename.

How each UI error entry is built

Field	Value
`title`	`[UI Error] {label} — {folderContext}`
`snippet`	`[UI Validation Error] {folderContext} \| Field: {fieldName} \| {validationRules} \| {errorMessages}`
`headings`	`[label, "Validation", "Error Messages", "UI error", folderContext, ...paymentTypes]`
`keywords`	`["error", "validation error", "ui error", "error message", "troubleshooting", "form validation" or "field validation", ...paymentTypes, ...contextWords, ...labelWords]`

The [UI Validation Error] prefix in the snippet and "ui error" / "validation error" baseline keywords ensure these entries rank highly for queries about checkout widget form errors, field validation messages, and UI error states.

Index Schema

The wallet-docs-index schema is defined in ensureIndex() and uses a PUT /indexes/{name} call with allowIndexDowntime=false.

Field	Type	Searchable	Retrievable	Notes
`id`	`Edm.String` (key)	No	Yes	Base64-encoded URL, URL-safe characters only (`[^a-zA-Z0-9_-]` → `_`)
`title`	`Edm.String`	Yes	Yes	Page title or operation summary
`file_path`	`Edm.String`	No	Yes	Relative URL — used as the link in the chat widget
`file_name`	`Edm.String`	No	Yes	Last segment of the URL
`content`	`Edm.String`	Yes	Yes	Truncated snippet (2,000 chars for docs, 8,000 chars for specs)
`headings`	`Edm.String`	Yes	Yes	Pipe-separated list of `##`/`###` headings
`summary`	`Edm.String`	Yes	Yes	Top keywords (comma-separated; up to 40 for specs, 15 for docs)
`section`	`Edm.String`	Yes	Yes	Second URL segment (e.g. `developers`, `business`)
`category`	`Edm.String`	Yes	Yes	First URL segment — used for the two-pass search filter
`last_modified`	`Edm.String`	No	Yes	ISO timestamp of the upload run

Semantic configuration

{
  "name": "default",
  "prioritizedFields": {
    "titleField":               { "fieldName": "title"   },
    "prioritizedContentFields": [{ "fieldName": "content" }],
    "prioritizedKeywordsFields":[{ "fieldName": "summary" }]
  }
}

Azure AI Search uses this configuration to apply L2 semantic re-ranking — title gets the highest semantic weight, content is the primary text, and summary (keywords) boost recall for exact field-name or endpoint matches.

The `category` Field and Two-Pass Search

The category value is always the first segment of the URL:

Category value	Documents
`docs`	All markdown pages
`api-reference-v2`	v2 API operations
`webhooks-v2`	v2 webhook events
`api-reference`	v1 API operations

The runtime server (api/server.js) uses this field in a two-pass parallel search:

Pass 1 — filter: "category ne 'docs'" + top: 3 — guarantees at least 3 spec/webhook entries reach the OpenAI context window regardless of how prose docs rank in the overall semantic score
Pass 2 — no filter + top: 8 — captures the best matching markdown docs

The passes run in parallel and their results are merged (spec entries first, v2/webhooks before v1, deduplicated by file_path).

Document ID Generation

id: Buffer.from(e.url).toString('base64').replace(/[^a-zA-Z0-9_-]/g, '_')

Azure AI Search key fields cannot contain /, +, or =. The URL is Base64-encoded and all non-alphanumeric non-_/- characters are replaced with _. This produces a stable, deterministic ID — re-uploading the same URL always produces the same key, enabling mergeOrUpload semantics.

Upload Mechanics

Action: mergeOrUpload — creates the document if it does not exist, updates if it does (matched by id)
Batch size: 500 documents per POST request to the Azure Search batch endpoint
Pre-upload: All existing documents are deleted with deleteAllDocs() before uploading so stale entries (renamed/deleted pages) do not persist in the index
Failure handling: Any failed documents within a batch are logged; the script exits with code 1 if any failures occurred

AI Assistant — Implementation Guide

Overview​

Step 1 — Why yarn build Runs First​

Step 2 — Parsing Source Files​

2a. Markdown documents​

2b. OpenAPI spec operations​

URL Generation​

Markdown pages​

API spec operations​

Request Body Schema Expansion​

Keyword Extraction​

For markdown docs — extractKeywords()​

For API spec operations — field names + operation metadata​

Step 2c — API Error Code JSON files​

File structure​

The searchKeywords field​

How each error entry is built​

Step 2d — UI Validation Error JSON files​

Why JSON files are not modified​

How context is extracted​

How each UI error entry is built​

Index Schema​

Semantic configuration​

The category Field and Two-Pass Search​

Document ID Generation​

Upload Mechanics​

Related​