📚 VibeVoice Audiobook Converter

Transform your text documents into professional audiobooks with Microsoft's VibeVoice TTS technology.

This is a personal fork of Microsoft VibeVoice with added audiobook conversion capabilities. The original VibeVoice framework is an open-source research project for generating expressive, long-form, multi-speaker conversational audio.

✨ What This Project Does

This audiobook converter provides a complete pipeline for converting text files into professional audiobooks. It uses Microsoft's VibeVoice-Realtime-0.5B model for high-quality text-to-speech generation with GPU acceleration, automatic chapter detection, and multiple voice options.

🎯 Key Features

📖 Automatic Chapter Detection - Recognizes patterns like "Chapter 1", "Chapter One", "CHAPTER I"
🎯 Intelligent Text Processing - Removes code blocks, formulas, and special symbols
🧩 Smart Chunking - Splits long text at sentence boundaries with overlap for continuity
🎤 7 Premium Voices - Emma, Grace (female) | Carter, Davis, Frank, Mike, Samuel (male)
⚡ GPU Acceleration - 2-3x faster generation with CUDA support
🔊 High Quality Audio - 44.1kHz WAV output with configurable silence between chapters
📝 Template Scripts - Simple one-command usage for female or male voices

🚀 Quick Start

# Install audiobook dependencies
pip install -r requirements_audiobook.txt
python -c "import nltk; nltk.download('punkt')"

# Create folders and add your book
mkdir -p input_books output_audiobooks
cp your_book.txt input_books/

# Generate audiobook with female voice (Emma)
python demo/audiobook/templates/female_template.py

# Or male voice (Carter)
python demo/audiobook/templates/male_template.py

# Your audiobook is ready!
ls output_audiobooks/

📊 Performance

Device	Speed	Time (300 words)
CPU	2-3 tok/s	~10-15 min
GPU (RTX 4060)	4-9 tok/s	~5 min

📖 Documentation

Quick Start Guide - Installation and basic usage
Architecture - System design and components
Troubleshooting - Common issues and solutions
Technical Reference - Complete implementation details

🎯 Roadmap

Phase 1: TXT → WAV ✅ Complete

Text file parsing with chapter detection
Text cleaning and intelligent chunking
Audio generation with multiple voices
WAV merging with configurable silence
GPU acceleration support

Phase 2: Advanced Formats (Planned)

EPUB support with chapter extraction
Markdown support
M4B output with chapter markers

📘 About the Original VibeVoice Framework

This project is built on Microsoft's VibeVoice technology. For the official implementation and more information, visit:

Official Repository: microsoft/VibeVoice
Project Page: microsoft.github.io/VibeVoice
Model on Hugging Face: VibeVoice-Realtime-0.5B
Technical Report: arxiv.org/pdf/2508.19205

Overview

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

VibeVoice currently includes two model variants:

Long-form multi-speaker model: Synthesizes conversational/single-speaker speech up to 90 minutes with up to 4 distinct speakers, surpassing the typical 1–2 speaker limits of many prior models.
Realtime streaming TTS model: Produces initial audible speech in ~300 ms and supports streaming text input for single-speaker real-time speech generation; designed for low-latency generation.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

🎵 Demo Examples

Video Demo

We produced this video with Wan2.2. We sincerely appreciate the Wan-Video team for their great work.

English

ES_._3.mp4

Chinese

default.mp4

Cross-Lingual

1p_EN2CH.mp4

Spontaneous Singing

2p_see_u_again.mp4

Long Conversation with 4 people

4p_climate_45min.mp4

For more examples, see the Project Page.

Risks and limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.

Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.

Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Areas where help is needed:

Phase 2 Features: EPUB parser, Markdown support, M4B output
Testing: More comprehensive test coverage
Documentation: Additional examples and tutorials
Bug Reports: Found an issue? Please open a GitHub issue

📝 License

This project inherits the MIT License from the original VibeVoice project. See the original repository for license details.

🙏 Acknowledgments

Microsoft Research for developing and open-sourcing VibeVoice
The VibeVoice team for their groundbreaking work in TTS technology
All contributors to this audiobook converter project

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Figures		Figures
demo		demo
docs		docs
vibevoice		vibevoice
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements_audiobook.txt		requirements_audiobook.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 VibeVoice Audiobook Converter

✨ What This Project Does

🎯 Key Features

🚀 Quick Start

📊 Performance

📖 Documentation

🎯 Roadmap

📘 About the Original VibeVoice Framework

Overview

🎵 Demo Examples

Risks and limitations

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📚 VibeVoice Audiobook Converter

✨ What This Project Does

🎯 Key Features

🚀 Quick Start

📊 Performance

📖 Documentation

🎯 Roadmap

📘 About the Original VibeVoice Framework

Overview

🎵 Demo Examples

Risks and limitations

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages