Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What's the best open source text to speech? Eleven Labs and others are interesting but closed source. I want to use them mainly for audiobooks as I have a lot of ePubs and I'm just using the basic Google text to speech voices on my Android, via Moon+ Reader. It works fine but it's still more robotic than state of the art.


POST-EDIT, CORRECTED ANSWER

I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].

Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).

IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.

PRE-EDIT, ERRONEOUS ANSWER

Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.

  [0] https://github.com/snakers4/silero-models#text-to-speech
  [1] https://silero.ai
  [2] https://github.com/snakers4/silero-models#standalone-use
  [3] https://github.com/Grumbel/ttsprech#usage


For neutral sounding very fast/efficient voices, I find Coqui TTS VITS models to be very good. For slower, more expressive voice or voice cloning I think the Coqui TTS XTTS is good (or you can look at the mrq/tortoise-tts).

I'm still awaiting a StyleTTS2 implementation. The audio samples sound top notch: https://styletts2.github.io/


You're in luck, the code dropped 6 hours ago :) https://github.com/yl4579/StyleTTS2

Looks promising, I'm going to check it out too! MIT license, even! If it's fast enough for real time, it could be the new best option. The paper claims faster inference than VITS...


Ha awesome! I just checked the repo literally before I posted and it was still empty, thanks for the heads up, will give it a spin now.


Just a followup for those interested, inference implementation notes and comparison clip between StyleTTS2, TTS VITS, and XTTS: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2


Wow you got it working so fast! I'm still stuck in package manager hell trying to debug a million little issues.


In my post I link to my issue where I outline what I needed to do from a clean mamba env that might help.

Pytorch nightly (I use for cuda-12) doesn't work w Python 3.12, but if you stick w 3.11 or 3.10 you should be ok. Rest was just w/o version numbers if you're on a clean venv should be fine, however there's a bug in the Utils lib that requires a 1-line fix if you're trying to inference (also linked). nltk was the only dependency not listed so not bad compared to most code drops!


I spent a couple of hours debugging why jupyter's debugger wasn't working right, so not exactly related to the code. I did also find and fix that utils bug you mentioned. But my current issue is that phonemizer won't find espeak even though I set the environment variables that are supposed to work. I'll figure it out eventually...

Thanks for writing up your experience! Good to know it works! And it's fast!


Are you on Windows? I've had the issue and was able to fix it by manually adding these system variables:

  PHONEMIZER_ESPEAK_LIBRARY = c:\Program Files\eSpeak NG\libespeak-ng.dll

  PHONEMIZER_ESPEAK_PATH = c:\Program Files\eSpeak NG


Yes I'm on Windows at the moment. I did try setting those yesterday but I must have made a typo or something. I'll try again, thanks!

Edit: Got it working, sounds really great and is super fast as advertised. Amazing! Just tried modifying the code to make it speak more quickly and it worked first try and still sounds good too! This is way better than using Coqui TTS. Just need a few more pretrained models and the voice cloning that was in the paper and this will become super popular very quickly.


We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.

How many audio books is 40 hours?

Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.

We're probably several years out from it being something people use personally for audio books.


> We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.

All of these AI as a Service (AaaS?) API companies are going to race each other to razor thin margins. Immediately after ElevenLabs raised, five other TTS services raised nearly the same amount of money.


>How many audio books is 40 hours?

Are you reading War & Peace or Cat In The Hat?


I always assume 200.to 250 pages per book when someone talks about large quantities of books.


That's fairly short. I read about 100 books a year and it includes thousand page tomes like The Count of Monte Cristo.


I always assumed that book to be rather short since it just needs to be a number of sandwiches eaten.

100 books/year. That's an impressive feat regardless the number of pages. Are these downloaded ebooks or physical printed copies of books?


It's mostly audiobooks, I have some ePubs that don't have audiobooks anywhere, such as many Japanese light novel fan (or official) translations into English for example. I can get through them as I can understand audio faster than I can read text, as I play back at 3 to 5x speed.


what's your retention/comprehension of the content at those speeds? i find that those speeds allows me to understand the concept as it's whizzing by, but the retention of it is not good. everything i've ever been taught and personal experience about long term retention all say speed is not the most conducive.


Retention is pretty good but that's because I've been training myself for the past 5 to 10 years to get to that speed. It's similar to how blind people's TTS are incomprehensible to most hearing-able people.


I like to read with my eyes, not listen. I honestly have no idea how long an audio book is, hours-wise.

I've seen a few for download, and they're always like hundreds of meg, if not over a gig. And that's in mp3, where it should be compressed heavily.


In my audible library, the shortest is the first Hitchhiker's Guide to the Galaxy a 5h51m. The longest is The Power Broker at 66h9m. Most of the books I have are in the 15-25 hour range, but I also have a lot of fantasy stuff that gets near 50 hours (Game of Thrones, Brandon Sanderson...).


Well, then we're talking $300 to have ElevenLabs do a single GoT book, but maybe as many as 8 books for HHGTG-style stuff.

That's just not good value. Was sort of my point.


I've tried a few, not an expert, but I think Coqui's new XTTS models are decent performance and quality wise (just in terms of how the speech sounds, can't speak to the voice cloning fidelity as I don't care about that). Open source code but non-commercial license for the model. They also have a bunch of models with more permissive licenses that aren't as good.

I doubt they're better than Google's TTS though.


Bark seems pretty good

https://github.com/suno-ai/bark Demo at https://huggingface.co/spaces/suno/bark

In the couple samples I tried it was substantially better at picking up meaning compared to VALL-E-X


> What's the best open source text to speech?

I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:

* https://github.com/rhasspy/piper

I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.

The official samples are here: https://rhasspy.github.io/piper-samples/

Here's a small number of pre-rendered samples I've used that were generated from a WIP Piper port of my Dialogue Tool[0] project: https://rancidbacon.gitlab.io/piper-tts-demos/

While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.

(Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)

[0] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...

----

Edit to add links to some of my notes related to FLOSS TTS, in case they're of interest:

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...


Would also like to know this. Can't seem to find an open source tts engine that works on mobile to read muh books




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: