An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
-
Updated
Apr 10, 2026 - Python
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
Run generative AI models in sophgo BM1684X/BM1688
🍑 relsim: Relational Visual Similarity | pip install relsim 🌍 (CVPR 2026)
Automated image & video captioning using Qwen-VL, Gemma4 and SAM3.
Use 2 lines to empower absolute time awareness for Qwen2.5VL's MRoPE
A batched implementation for efficient Qwen2.5-VL inference.
A dedicated Colab notebooks to experiment (Nanonets OCR, Monkey OCR, OCRFlux 3B, Typhoo OCR 3B & more..) On T4 GPU - free tier
Qwen-Image-Edit-2509-LoRAs-Fast is a high-performance, user-friendly web application built with Gradio that leverages the advanced Qwen/Qwen-Image-Edit-2509 model from Hugging Face for seamless image editing tasks.
Multimodal-OCR is an experimental, high-performance visual reasoning and optical character recognition suite designed to accurately extract text, analyze visual content, and parse complex document structures. Built upon a diverse ecosystem of cutting-edge vision-language models.
[AAAI'26] Official implementation of CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval
A Gradio-based demo application for comparing state-of-the-art OCR models: DeepSeek-OCR, Dots.OCR, HunyuanOCR, and Nanonets-OCR2-3B.
Qwen3-VL-Outpost is an experimental, high-performance visual reasoning and multimodal inference suite designed for advanced image analysis, optical character recognition, and complex scene understanding. Built around the state-of-the-art Qwen3-VL and Qwen2.5-VL model families.
Multimodal-OCR3 is a highly capable, experimental optical character recognition and visual processing suite designed for precise text extraction, document parsing, and markdown generation. Leveraging a powerful selection of vision-language.
A Gradio-based demonstration for the AllenAI SAGE-MM-Qwen3-VL-4B-SFT_RL multimodal model, specialized in video reasoning tasks. Users upload MP4 videos, provide natural language prompts (e.g., "Describe this video in detail" or custom questions), and receive detailed textual analyses.
QIE-Bbox-Studio (Qwen Image Edit Bounding Box Studio) is an advanced AI-powered image editing interface built on top of the Qwen2.5-VL and Qwen-Image-Edit models. This application allows users to manipulate images with extreme precision by defining bounding boxes and providing natural language prompts.
Qwen-Image-Edit-2509-LoRAs-Fast-Fusion is a fast, interactive web application built with Gradio that enables advanced image editing using the Qwen/Qwen-Image-Edit-2509 model from Alibaba's Qwen team. It leverages specialized LoRA adapters for efficient, low-step inference (as few as 4 steps).
RxLM-Med: A multimodal clinical AI agent featuring System 2 reasoning, cross-lingual hierarchical RAG (BM25 + FAISS + RRF), deterministic medical calculation engine, and Traffic Light Protocol (TLP) safety alignment — built on Qwen-VL with LoRA fine-tuning, SFT/DPO alignment, and INT4 quantization for real-world lab report interpretation.
Add a description, image, and links to the qwen2-5-vl topic page so that developers can more easily learn about it.
To associate your repository with the qwen2-5-vl topic, visit your repo's landing page and select "manage topics."