Guide

HTML to Markdown for LangChain and LlamaIndex

A RAG pipeline is only as clean as the source material it loads. Paepae Stack fits before ingestion: clean the page into Markdown first, then load the result into your retrieval workflow.

Clean first, load secondInspect the source before it becomes chunks and embeddingsUseful for both manual and early automated RAG pipelines

Ingestion path

Paepae Stack belongs before the document loader, not instead of it.

LangChain and LlamaIndex both give you ways to load and transform documents, but a noisy web page can still carry navigation, boilerplate, repeated links, and layout text into your chunks. Cleaning first makes the next stage easier to reason about.

Suggested flow

public page or copied HTML
-> Paepae Stack cleanup
-> inspected Markdown
-> document loader
-> splitter or ingestion pipeline
-> embeddings or retrieval store

Why clean first

Chunking noisy HTML usually preserves the wrong things.

If chunking starts from raw browser payloads, chunks can begin with navigation, repeat site-wide boilerplate, or split the useful article body in awkward places. Cleaning first lets you confirm that headings, code, and tables survived before they become retrieval inputs.

Manual staging workflow

Start with a few clean source documents before you automate harder.

The main goal here is not to build the perfect ingestion stack in one move. It is to prove that the source shape is useful before you scale the pipeline.

Step 1

Run the public page or copied HTML through Paepae Stack.

Step 2

Copy or download the cleaned Markdown.

Step 3

Save it in your project as a source document.

Step 4

Load the Markdown with the loader or reader you already use.

Step 5

Split by headers or sections when the document structure supports it.

Step 6

Keep source URL, page title, and capture date in metadata.

Example handoff shape

Keep metadata close to the cleaned source.

Source URL, title, and capture date make later QA, refresh work, and debugging much easier. That matters more than using one specific loader API.

Example Markdown document

---
source_url: https://example.com/docs/widget-api
title: Widget API
captured_at: 2026-04-29
prepared_with: Paepae Stack HTML to Markdown for AI
---

# Widget API

## Authentication

...

What to test

Trust the pipeline only after you inspect the retrieval outcomes.

A cleaner source should make the chunks easier to understand and the retrieval results easier to believe. If it does not, the source shape still needs work.

Test 1

Do retrieval results quote the main content instead of navigation?

Test 2

Do chunks start under meaningful headings?

Test 3

Did code blocks or tables survive in usable form?

Test 4

Can a human inspect the Markdown and understand the source?

Test 5

Does metadata preserve where the page came from?

Common questions

Answer the implementation questions before you automate more aggressively.

These questions usually show up right before teams decide whether the cleanup layer is ready to become part of a real ingestion path.

Should I use raw HTML or Markdown for RAG?

Use raw HTML when DOM structure matters. Use Markdown when you want readable content structure without most browser page noise.

Can Paepae Stack replace my document loader?

No. Paepae Stack is a cleanup step before loading. Your RAG stack still needs its loader, splitter, embeddings, and retrieval layer.

Should I build a crawler first?

For a small source set, usually no. Start with a few cleaned pages and inspect the retrieval quality before you automate more aggressively.

Next steps

Keep the ingestion path moving with the next useful branch.

From here, the strongest follow-up is to run the cleanup, step back to the broader RAG framing, or validate the source shape before you automate more aggressively.

Run the cleanup on a source page

Open the tool when you want to convert a public page or copied HTML into a Markdown document before it reaches your loader.

Open HTML to Markdown for AI

Step back to the broader RAG framing

Use the RAG guide when you want the higher-level format, chunking, and retrieval explanation behind this developer-specific branch.

Read the RAG guide

Inspect chunks before loading

Use the Markdown Chunk Inspector when you want token estimates, warnings, and JSONL-ready records before a loader or splitter takes over.

Open chunk inspector

Compare chunking patterns

Use the chunking examples guide when you want to compare heading-based, token-window, paragraph-safe, and code/table-safe outputs before ingestion.

View chunking examples

Review examples before you automate harder

Check cleanup examples when you want a clearer sense of what Paepae Stack preserves across docs, support, blog, and wiki-style pages.

View cleanup examples

Use the automation cousin when code is not the next hop

Move into the n8n page when the same cleanup logic needs to feed a workflow tool instead of a code-first ingestion stack.

Read the n8n page