Article Summary (Model: gpt-5.2)
Subject: PersonaPlex on-device Swift
The Gist:
The post describes adding NVIDIA’s PersonaPlex 7B full‑duplex speech‑to‑speech model to the author’s Swift/MLX library qwen3-asr-swift, enabling on-device “audio in, audio out” generation on Apple Silicon with streaming output. Instead of an ASR→LLM→TTS pipeline, PersonaPlex directly consumes audio tokens and produces audio tokens, allowing simultaneous listening/speaking and lower perceived latency. The author also details converting NVIDIA’s 16.7GB PyTorch checkpoint into an MLX-friendly 4‑bit quantized safetensors package (~5.3GB), plus a set of inference/streaming and performance optimizations.
Key Claims/Facts:
- One-model full duplex: PersonaPlex collapses ASR/LLM/TTS into a single speech-to-speech model operating on audio tokens (17 parallel streams at 12.5Hz) with a Mimi codec front/back end.
- MLX 4-bit port: The NVIDIA checkpoint is converted and quantized (temporal transformer + Depformer) to run on Apple Silicon via MLX; published as
aufklarer/PersonaPlex-7B-MLX-4bit(~5.3GB). - Streaming + speed:
respondStream()emits ~2s audio chunks viaAsyncThrowingStream; on an M2 Max the author reports ~68ms/step (RTF 0.87, i.e., faster than real-time) after optimizations like eval consolidation, batching, and optional MLX compile.
Discussion Summary (Model: gpt-5.2)
Consensus: Cautiously optimistic—people like the low-latency voice tech, but doubt a 7B full-duplex model is useful without a larger “brain” and better orchestration.
Top Critiques & Pushback:
Better Alternatives / Prior Art:
Expert Context: