Skip to content

Document Schema

A MailAtlas document is the normalized record created from an inbound email message. Documents are stored in SQLite and linked to local files in the workspace root.

A document can come from:

  • An .eml file.
  • One message inside an mbox archive.
  • A message fetched from an IMAP folder.

The schema is alpha. Field names and compatibility guarantees may evolve before a stable schema version is published.

Use this page when you are building against stored document output. Use Workspace Model when you need to know where linked files live on disk.

FieldDescription
idStable MailAtlas document ID used by CLI, Python API, exports, and asset references.
source_kindInput path that produced the document, such as eml, mbox, or imap.
message_idEmail Message-ID header when present. Used for dedupe when available.
thread_idThread identifier when MailAtlas can derive or store one.
subjectEmail subject.
sender_nameDisplay name from the sender header when present.
sender_emailSender email address when present.
authorNormalized author string used by downstream consumers.
received_atReceived timestamp when available.
published_atPublication-like timestamp when MailAtlas derives one for downstream use.
body_textCleaned plain-text body.
body_html_pathPath to the normalized HTML snapshot when available.
raw_pathPath to the stored raw email bytes.
content_hashNormalized hash used for dedupe when no usable message_id exists.
metadataParser notes, provenance, parser configuration, and source-specific details.
created_atTimestamp when MailAtlas created the stored document record.

Assets are files extracted from the message and associated with a document.

FieldDescription
idAsset ID.
document_idID of the document that owns the asset.
ordinalAsset order within the document.
kindinline for embedded HTML assets, or attachment for regular file attachments.
mime_typeMIME type when known.
file_pathLocal path to the copied asset.
cidContent ID for inline assets when present.
sha256SHA-256 hash of the stored asset file when available.

metadata carries parser notes and provenance so application code can inspect what happened during parsing and cleaning.

Metadata pathDescription
provenance.is_forwardedWhether MailAtlas detected forwarded-message structure.
provenance.forwarded_chainForwarded-chain details when detected.
cleaning.removed_forwarded_headersWhether forwarded headers were removed.
cleaning.dropped_line_countNumber of lines removed during cleaning.
cleaning.stopped_at_footerWhether cleaning stopped at a detected footer.
parser_config.*Parser configuration used for the run.
source.kindSource kind, such as eml, mbox, or imap.
source.hostIMAP host for synced documents when present.
source.folderIMAP folder for synced documents when present.
source.uidIMAP UID for synced documents when present.
source.uidvalidityIMAP UIDVALIDITY value for synced documents when present.
{
"id": "doc_123",
"source_kind": "imap",
"message_id": "<[email protected]>",
"subject": "Daily market digest",
"sender_name": "Example Sender",
"sender_email": "[email protected]",
"body_text": "Cleaned message text...",
"body_html_path": "html/doc_123.html",
"raw_path": "raw/doc_123.eml",
"content_hash": "<hash>",
"metadata": {
"source": {
"kind": "imap",
"host": "imap.example.com",
"folder": "INBOX",
"uid": "4812",
"uidvalidity": "11"
},
"cleaning": {
"dropped_line_count": 0,
"stopped_at_footer": false
},
"provenance": {
"is_forwarded": false
}
},
"assets": [
{
"id": "asset_1",
"document_id": "doc_123",
"ordinal": 1,
"kind": "inline",
"mime_type": "image/svg+xml",
"file_path": "assets/doc_123/001-route-heatmap.svg",
"cid": "route-heatmap",
"sha256": "<sha256>"
}
],
"created_at": "2026-04-18T15:05:00Z"
}

For single-message files, source_kind is eml. Metadata records the source path or equivalent source reference when available.

For mailbox archives, source_kind is mbox. Metadata identifies the mailbox file and message position or source reference when available.

For synced messages, source_kind is imap. metadata.source.* records the mailbox folder and UID details that produced the stored document.

Paths such as raw_path, body_html_path, and assets[].file_path resolve inside the workspace root unless a specific export mode writes a separate bundle path.