Skip to content

Commit 947bc23

Browse files
RuixiangMagithub-actions[bot]dg845
authored
[chore] Add diffusers-format example to LongCatAudioDiTPipeline (#13483)
* [chore] Add diffusers-format example and seed parameter to LongCatAudioDiTPipeline Signed-off-by: Lancer <maruixiang6688@gmail.com> * Apply style fixes * Apply suggestions from code review Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com> * upd Signed-off-by: Lancer <maruixiang6688@gmail.com> * Apply style fixes --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
1 parent 71a6fd9 commit 947bc23

2 files changed

Lines changed: 39 additions & 16 deletions

File tree

docs/source/en/api/pipelines/longcat_audio_dit.md

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,10 @@ specific language governing permissions and limitations under the License.
1414

1515
LongCat-AudioDiT is a text-to-audio diffusion model from Meituan LongCat. The diffusers integration exposes a standard [`DiffusionPipeline`] interface for text-conditioned audio generation.
1616

17-
This pipeline supports loading the original flat LongCat checkpoint layout from either a local directory or a Hugging Face Hub repository containing:
18-
19-
- `config.json`
20-
- `model.safetensors`
21-
22-
The loader builds the text encoder, transformer, and VAE from `config.json`, restores component weights from `model.safetensors`, and ties the shared UMT5 embedding when needed.
23-
2417
This pipeline was adapted from the LongCat-AudioDiT reference implementation: https://github.com/meituan-longcat/LongCat-AudioDiT
2518

19+
This pipeline supports loading from a local directory or Hugging Face Hub repository in diffusers format (containing `text_encoder/`, `transformer/`, `vae/`, `tokenizer/`, and `scheduler/` subfolders).
20+
2621
## Usage
2722

2823
```py
@@ -31,27 +26,29 @@ import torch
3126
from diffusers import LongCatAudioDiTPipeline
3227

3328
pipeline = LongCatAudioDiTPipeline.from_pretrained(
34-
"meituan-longcat/LongCat-AudioDiT-1B",
29+
"ruixiangma/LongCat-AudioDiT-1B-Diffusers",
3530
torch_dtype=torch.float16,
3631
)
3732
pipeline = pipeline.to("cuda")
3833

34+
prompt = "A calm ocean wave ambience with soft wind in the background."
3935
audio = pipeline(
40-
prompt="A calm ocean wave ambience with soft wind in the background.",
41-
audio_end_in_s=5.0,
36+
prompt,
37+
audio_duration_s=5.0,
4238
num_inference_steps=16,
4339
guidance_scale=4.0,
44-
output_type="pt",
45-
).audios
40+
generator=torch.Generator("cuda").manual_seed(42),
41+
).audios[0, 0]
4642

47-
output = audio[0, 0].float().cpu().numpy()
48-
sf.write("longcat.wav", output, pipeline.sample_rate)
43+
sf.write("longcat.wav", audio, pipeline.sample_rate)
4944
```
5045

5146
## Tips
5247

53-
- `audio_end_in_s` is the most direct way to control output duration.
54-
- `output_type="pt"` returns a PyTorch tensor shaped `(batch, channels, samples)`.
48+
- `audio_duration_s` is the most direct way to control output duration.
49+
- Use `generator=torch.Generator("cuda").manual_seed(42)` to make generation reproducible.
50+
- Output shape is `(batch, channels, samples)` - use `.audios[0, 0]` to get a single audio sample.
51+
- The pipeline outputs mono audio (1 channel). If you need stereo, you can duplicate the channel: `audio.unsqueeze(0).repeat(1, 2, 1)`.
5552

5653
## LongCatAudioDiTPipeline
5754

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,12 +25,35 @@
2525
from ...models import LongCatAudioDiTTransformer, LongCatAudioDiTVae
2626
from ...schedulers import FlowMatchEulerDiscreteScheduler
2727
from ...utils import logging
28+
from ...utils.doc_utils import replace_example_docstring
2829
from ...utils.torch_utils import randn_tensor
2930
from ..pipeline_utils import AudioPipelineOutput, DiffusionPipeline
3031

3132

3233
logger = logging.get_logger(__name__)
3334

35+
EXAMPLE_DOC_STRING = """
36+
Examples:
37+
```py
38+
>>> import soundfile as sf
39+
>>> import torch
40+
>>> from diffusers import LongCatAudioDiTPipeline
41+
42+
>>> pipe = LongCatAudioDiTPipeline.from_pretrained("ruixiangma/LongCat-AudioDiT-1B-Diffusers")
43+
>>> pipe.to("cuda")
44+
45+
>>> prompt = "A calm ocean wave ambience with soft wind in the background."
46+
>>> audio = pipe(
47+
... prompt,
48+
... audio_duration_s=5.0,
49+
... num_inference_steps=20,
50+
... guidance_scale=4.0,
51+
... generator=torch.Generator("cuda").manual_seed(42),
52+
... ).audios[0, 0]
53+
>>> sf.write("output.wav", audio, pipe.sample_rate)
54+
```
55+
"""
56+
3457

3558
def _lens_to_mask(lengths: torch.Tensor, length: int | None = None) -> torch.BoolTensor:
3659
if length is None:
@@ -194,6 +217,7 @@ def check_inputs(
194217
)
195218

196219
@torch.no_grad()
220+
@replace_example_docstring(EXAMPLE_DOC_STRING)
197221
def __call__(
198222
self,
199223
prompt: str | list[str],
@@ -228,6 +252,8 @@ def __call__(
228252
inputs specified by `callback_on_step_end_tensor_inputs`.
229253
callback_on_step_end_tensor_inputs (`list`, defaults to `["latents"]`):
230254
Tensor inputs passed to `callback_on_step_end`.
255+
256+
Examples:
231257
"""
232258
if prompt is None:
233259
prompt = []

0 commit comments

Comments
 (0)