They're very similar, but they're not the exact same thing.
Llasa uses xcodec2, a much simpler, lossless 16khz wav codec. This makes it superior for one-shot voice cloning.
Orpheus' 24khz snac codec is lossy which makes it difficult to use for zero-shot cloning as the reference audio gets degraded during tokenization. You can test this here:
https://huggingface.co/spaces/Gapeleon/snac_test
But when finetuned on 50+ audio samples, it produces much cleaner 24khz audio than Llasa, and the snac model is much easier to run on consumer hardware than xcodec2 (87t/s for realtime speech, which can be achieved on an RTX3080 for example)
No, you just condition it with text-voice token pairs and then when conditioning further inference w/ text the voice tokens tend to match the pairs further up in the context.
They’re both lossy. They use a VAE-VQ type architecture trained with a combination of losses/discriminators. The differences are mainly the encoder/decoder architecture, the type of bottleneck quantization (RVQ, FSQ, etc.) and of course the training data.
(https://github.com/canopyai/Orpheus-TTS)