Parser Cleaning
Email bodies often contain forwarded-message wrappers, signatures, tracking links, legal footers, invisible characters, repeated whitespace, and newsletter boilerplate. MailAtlas exposes parser cleaning controls so you can tune output for the workflow that will consume it.
Use stricter cleaning when the output will feed retrieval chunks, model prompts, or analytics. Use lighter cleaning when you need forensic review or want to preserve the message as closely as possible.
Cleaning affects body_text and parser metadata. It does not remove the stored raw message from the workspace.
The current CLI defaults enable all parser-cleaning controls. Each flag supports a matching --no-... form when you want to disable one control for a run.
Available controls
Section titled “Available controls”| Control | Use it when | Tradeoff |
|---|---|---|
strip_forwarded_headers | You want to remove common forwarded-message headers from cleaned text. | May remove context that matters for forensic review. |
strip_boilerplate | You want cleaner outputs from newsletters, templates, or automated emails. | May remove content that looks repetitive but is meaningful in some messages. |
strip_link_only_lines | You want to remove lines that contain only links or tracking URLs. | May remove important standalone links. |
stop_at_footer | You want to stop body extraction at a detected footer or signature boundary. | May truncate content if the footer detector is too aggressive. |
strip_invisible_chars | You want to remove invisible Unicode characters that can interfere with matching or chunking. | Usually safe, but preserve raw messages for exact review. |
normalize_whitespace | You want predictable spacing for search, export, or retrieval. | May change visual spacing from the original message. |
CLI usage
Section titled “CLI usage”Apply cleaning controls during file ingest:
mailatlas ingest sample-data/fixtures/eml/atlas-founder-forward.eml \ --strip-forwarded-headers \ --strip-boilerplate \ --stop-at-footerDisable one default control for a run:
mailatlas ingest sample-data/fixtures/eml/atlas-founder-forward.eml \ --no-strip-boilerplateApply the same controls during IMAP receive:
mailatlas receive \ --provider imap \ --folder INBOX \ --strip-boilerplate \ --normalize-whitespacePython usage
Section titled “Python usage”from mailatlas import ParserConfig, parse_eml
document = parse_eml( "sample-data/fixtures/eml/atlas-founder-forward.eml", parser_config=ParserConfig( strip_forwarded_headers=True, strip_boilerplate=True, stop_at_footer=True, ),)Use ParserConfig with a storage-backed instance:
from mailatlas import MailAtlas, ParserConfig
atlas = MailAtlas( workspace_path=".mailatlas", db_path=".mailatlas/store.db", parser_config=ParserConfig( strip_boilerplate=True, normalize_whitespace=True, ),)Choosing cleaning settings
Section titled “Choosing cleaning settings”Retrieval or RAG
Section titled “Retrieval or RAG”Use more cleaning. Strip boilerplate, strip link-only lines, stop at footers when reliable, and normalize whitespace. This reduces repeated content and usually produces cleaner chunks.
Archival or legal review
Section titled “Archival or legal review”Use lighter cleaning. Preserve forwarded headers, avoid footer stopping unless reviewed, and keep raw messages linked to the stored document.
Parser development
Section titled “Parser development”Run the same fixture with different cleaning options and compare stored metadata and exports.
mailatlas ingest sample-data/fixtures/eml/atlas-founder-forward.emlmailatlas ingest sample-data/fixtures/eml/atlas-founder-forward.eml --no-strip-boilerplateMetadata
Section titled “Metadata”MailAtlas records parser notes in metadata, including:
cleaning.removed_forwarded_headerscleaning.dropped_line_countcleaning.stopped_at_footerparser_config.*provenance.is_forwardedprovenance.forwarded_chain
Why it matters
Section titled “Why it matters”Cleaning choices affect downstream behavior. Search indexes, retrieval systems, agents, audits, and exports can all need different representations. MailAtlas keeps raw artifacts linked so you can tune cleaned outputs without losing the original message.
Next step
Section titled “Next step”- Use Document Schema to inspect cleaning metadata.
- Use Quickstart to run cleaning against local files.
- Use IMAP Receive to apply cleaning during mailbox receive.