# Knowledge sources and crawling

How the assistant learns what's on your site, what your PDFs say, and what you've written by hand.

## Three knowledge sources

Every answer the assistant gives comes from one of three sources you control:

- **Crawled pages** — Pages we fetch automatically from your website. Good for product pages, FAQs, blog posts — anything that's already public.
- **Uploaded documents** — PDF, DOCX, and DOC files you upload. Good for price lists, terms, internal handbooks, anything that isn't on the public site.
- **Articles you write** — Knowledge articles you author directly in the dashboard. Good for filling gaps the crawl missed or for answers you want phrased a specific way.

## One site, multiple sources

A Clarifier site can pull content from more than one place. The site has a primary domain (set when you create it) and you can add additional crawl sources for sibling subdomains or specific URLs. All sources feed the same knowledge base — visitors get answers from any of your content regardless of which source it came from.

- **Primary domain** — The URL you used to create the site. Cannot be removed; the widget can always be embedded here.
- **Additional subdomain** — A sibling domain like blog.acme.com or shop.acme.com — must share the same registrable domain as the primary. Each subdomain source is scanned independently and gets its own include/exclude patterns. The widget can also be embedded on any added subdomain.
- **Single page** — One specific URL — useful for a page the scan missed, a freshly published page you don't want to wait for the next full re-crawl to pick up, or a high-importance page outside your normal patterns.

All sources share the same per-plan crawl-page limit. Pro's 5,000 pages are split however you like across primary, subdomains, and single-page sources combined.

## Scan vs. crawl

When you add a site, Clarifier first runs a scan: it reads your sitemap and a sample of pages to suggest URL patterns to include or exclude. The scan is fast and doesn't fetch every page. The crawl is the actual fetch — it follows the patterns you've approved and downloads each page's content. Both can be re-run from the dashboard whenever you change your site.

## Include and exclude patterns

After the scan you'll see URL patterns grouped by structure — for example, /produkter/*, /blog/*, /admin/*. Tick the ones to include in the crawl and untick the ones to skip. Common things to exclude: admin pages, search-result pages, paginated archives, customer-account pages. The crawl only fetches pages that match an included pattern and don't match any excluded one.

## robots.txt and blocking

Clarifier respects robots.txt. If your robots.txt blocks our crawler, the dashboard will tell you and link to a fix — usually removing a Disallow rule that targets all crawlers. While the crawl is blocked you can still build a knowledge base from uploaded documents and hand-written articles; the assistant just won't know about your live pages.

## What happens after a crawl

Once content is fetched (whether from the crawl, a document upload, or an article), it goes through the same three-step processing pipeline:

- The text is split into overlapping chunks — roughly 1,500 characters each, with 200-character overlap so context isn't cut at chunk boundaries.
- Each chunk is converted into a 1,536-dimensional embedding vector using OpenAI's text-embedding-3-small model.
- Vectors are stored in our search index. When a visitor asks a question, we embed their question the same way and retrieve the closest chunks, which the language model uses to write the answer.

## Per-plan limits

The crawl page limit caps how many pages we'll fetch from your site. The knowledge sources limit caps how many uploaded documents and hand-written articles you can have per site combined.

| Plan | Crawl pages | Documents + articles |
|---|---:|---:|
| Starter | 500 | 10 |
| Pro | 5,000 | 50 |
| Business | 25,000 | 200 |

If your site has more pages than your plan allows, use exclude patterns to skip low-value sections (search results, archives, account pages) so the crawl focuses on the pages that matter for visitor questions.

## Keeping the assistant up to date

When you publish new content, re-run the crawl from the dashboard. Re-uploading a document replaces the old one; editing an article updates it in place. There's no automatic re-crawl yet — refresh manually whenever your site changes meaningfully.
