Docs/Knowledge sources and crawling

Knowledge sources and crawling

How the assistant learns what's on your site, what your PDFs say, and what you've written by hand.

Three knowledge sources

Every answer the assistant gives comes from one of three sources you control:

Crawled pages — Pages we fetch automatically from your website. Good for product pages, FAQs, blog posts — anything that's already public.
Uploaded documents — PDF, DOCX, and DOC files you upload. Good for price lists, terms, internal handbooks, anything that isn't on the public site.
Articles you write — Knowledge articles you author directly in the dashboard. Good for filling gaps the crawl missed or for answers you want phrased a specific way.

One site, multiple sources

A Clarifier site can pull content from more than one place. The site has a primary domain (set when you create it) and you can add additional crawl sources for sibling subdomains or specific URLs. All sources feed the same knowledge base — visitors get answers from any of your content regardless of which source it came from.

Primary domain — The URL you used to create the site. Cannot be removed; the widget can always be embedded here.
Additional subdomain — A sibling domain like blog.acme.com or shop.acme.com — must share the same registrable domain as the primary. Each subdomain source is scanned independently and gets its own include/exclude patterns. The widget can also be embedded on any added subdomain.
Single page — One specific URL — useful for a page the scan missed, a freshly published page you don't want to wait for the next full re-crawl to pick up, or a high-importance page outside your normal patterns.

All sources share the same per-plan crawl-page limit. Pro's 5,000 pages are split however you like across primary, subdomains, and single-page sources combined.

Scan vs. crawl

When you add a site, Clarifier first runs a scan: it reads your sitemap and a sample of pages to suggest URL patterns to include or exclude. The scan is fast and doesn't fetch every page. The crawl is the actual fetch — it follows the patterns you've approved and downloads each page's content. Both can be re-run from the dashboard whenever you change your site.

Include and exclude patterns

After the scan you'll see URL patterns grouped by structure — for example, /produkter/*, /blog/*, /admin/*. Tick the ones to include in the crawl and untick the ones to skip. Common things to exclude: admin pages, search-result pages, paginated archives, customer-account pages. The crawl only fetches pages that match an included pattern and don't match any excluded one.

robots.txt and blocking

Clarifier respects robots.txt. If your robots.txt blocks our crawler, the dashboard will tell you and link to a fix — usually removing a Disallow rule that targets all crawlers. While the crawl is blocked you can still build a knowledge base from uploaded documents and hand-written articles; the assistant just won't know about your live pages.

What happens after a crawl

Once content is fetched (whether from the crawl, a document upload, or an article), it goes through the same three-step processing pipeline:

The text is split into overlapping chunks — roughly 1,500 characters each, with 200-character overlap so context isn't cut at chunk boundaries.
Each chunk is converted into a 1,536-dimensional embedding vector using OpenAI's text-embedding-3-small model.
Vectors are stored in our search index. When a visitor asks a question, we embed their question the same way and retrieve the closest chunks, which the language model uses to write the answer.

Per-plan limits

The crawl page limit caps how many pages we'll fetch from your site. The knowledge sources limit caps how many uploaded documents and hand-written articles you can have per site combined.

Plan	Crawl pages	Documents + articles
Starter	500	10
Pro	5,000	50
Business	25,000	200

If your site has more pages than your plan allows, use exclude patterns to skip low-value sections (search results, archives, account pages) so the crawl focuses on the pages that matter for visitor questions.

Keeping the assistant up to date

When you publish new content, re-run the crawl from the dashboard. Re-uploading a document replaces the old one; editing an article updates it in place. There's no automatic re-crawl yet — refresh manually whenever your site changes meaningfully.