Skip to content

feat: support document/archive extensions in MEDIA: tag extraction#8255

Open
huangke19 wants to merge 1 commit intoNousResearch:mainfrom
huangke19:feat/media-document-extensions
Open

feat: support document/archive extensions in MEDIA: tag extraction#8255
huangke19 wants to merge 1 commit intoNousResearch:mainfrom
huangke19:feat/media-document-extensions

Conversation

@huangke19
Copy link
Copy Markdown

Problem

The extract_media() regex in gateway/platforms/base.py only matched audio/video/image extensions (png|jpe?g|gif|webp|mp4|...|m4a). Document formats like .epub, .pdf, .zip etc. were not explicitly matched, causing MEDIA:/path/to/file.epub to fall through to the generic \S+ branch which can fail silently depending on the path format.

The send routing (line 1705) already has an else branch that calls send_document() for non-audio/video/image files — so the infrastructure was there, just the extraction regex was too narrow.

Fix

Add common document and archive extensions to the extraction regex:
epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa

Testing

Verified the updated regex compiles and correctly matches:

  • MEDIA:/path/to/book.epub
  • MEDIA:/tmp/report.pdf
  • Existing image/video/audio paths ✓
  • Non-MEDIA text (no false positives) ✓

Add epub, pdf, zip, rar, 7z, docx, xlsx, pptx, txt, csv, apk, ipa to
the MEDIA: path regex in extract_media(). These file types were already
routed to send_document() in the delivery loop (base.py:1705), but the
extraction regex only matched media extensions (audio/video/image),
causing document paths to fall through to the generic \S+ branch which
could fail silently in some cases. This explicit list ensures reliable
matching and delivery for all common document formats.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant