Extract the audio track from raw video bytes using PyAV.
PyAV wraps FFmpeg's C libraries in-process — no subprocess is spawned, which is critical to avoid crashing CUDA-active vLLM worker processes.
The returned waveform is at the native sample rate of the video's audio stream. Resampling to a model-specific rate is left to the downstream :class:AudioResampler in the parsing pipeline.
Parameters:
| Name | Type | Description | Default |
data | bytes | Raw video file bytes (e.g. from an mp4 file). | required |
Returns:
| Type | Description |
NDArray | A tuple of (waveform, sample_rate) suitable for use as an |
float | |
Source code in vllm/multimodal/media/audio.py
| def extract_audio_from_video_bytes(
data: bytes,
) -> tuple[npt.NDArray, float]:
"""Extract the audio track from raw video bytes using PyAV.
PyAV wraps FFmpeg's C libraries in-process — no subprocess is
spawned, which is critical to avoid crashing CUDA-active vLLM
worker processes.
The returned waveform is at the native sample rate of the video's
audio stream. Resampling to a model-specific rate is left to the
downstream :class:`AudioResampler` in the parsing pipeline.
Args:
data: Raw video file bytes (e.g. from an mp4 file).
Returns:
A tuple of ``(waveform, sample_rate)`` suitable for use as an
:class:`AudioItem`.
"""
if data is None or len(data) == 0:
raise ValueError(
"Cannot extract audio: video bytes are missing or empty. "
"Ensure video was loaded with keep_video_bytes=True for "
"audio-in-video extraction."
)
try:
with av.open(BytesIO(data)) as container:
if not container.streams.audio:
raise ValueError("No audio stream found in the video.")
stream = container.streams.audio[0]
native_sr = stream.rate
chunks: list[npt.NDArray] = []
for frame in container.decode(audio=0):
arr = frame.to_ndarray()
chunks.append(arr.mean(axis=0) if arr.ndim > 1 else arr)
except ValueError:
raise
except Exception as e:
raise ValueError(
"Invalid or corrupted video data when extracting audio. "
"Ensure the input is valid video bytes (e.g. a complete MP4)."
) from e
if not chunks:
raise ValueError("No audio found in the video.")
audio = np.concatenate(chunks).astype(np.float32)
return audio, float(native_sr)
|