It's kind of wild that these tools just transfer a copy of these models every time they're spun up (whether it's to a Google Colab notebook or a local machine.)
This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)
Their Python module caches the downloads, which is checked before downloading them again...but you're probably not wrong on the crazy bandwidth bill. Looks like they have crazy VC money though, considering the current climate.
Unmetered 10+ gigabit connections were on the order of $1/mbit/mo wholesale over a decade ago when I priced out a custom CDN so for the cost of 100 TB of data transfer out of AWS you could get a 24/7 sustained 10gbit/s (>3 PB per month at 100% utilization).
Not all connections are created equal. Even some big providers clearly have iffy peering agreements upstream that’ll manifest as terrible performance if you have a widely-geographically-distributed bandwidth-heavy load.
That’s pretty expensive. Sonic offers 1-10gbps (depending on where you live) unmetered symmetric connections for $60/mo to the Bay Area… they’re also the only ISP that petitioned the FCC in favor of net neutrality.
For work I end up transferring 50-150 gigs often, sometimes daily. Never heard a word from them that this has been a problem.
If you host copies of your data with a few big providers could you do something smart like detect and redirect requests from AWS to an S3 bucket and not pay for bandwidth leaving the provider?
1. AWS is far behind Azure and GCP in AI, so they gotta partner up to gain credibility.
2. Huggingface probably does face insane bills compared to github. But AWS can probably develop some optimizations to save bandwidth costs. There's 100% some sort of generalized differential storage method being developed for AI models.
Is hugging face just a model repository like GitHub is a code repository? Seems you can rent compute both cpu & gpu, but you are right that most models seem to be run elsewhere.
I haven’t used windows in a while but I thought it supported some form of cross-volume symlink? Or at least mounting an image stored on another volume to an arbitrary path.
So not-well-known that several tools that really should know better don't check for junctions with occasionally disastrous results in a fs walk. (Using junctions sounded really clever to me until this had me up all night figuring out why the backup system crashed.)
This article only covers the musical aspects of AI voice cloning, but there's another dynamic to AI voice cloning that's more complicated: replacing general voice actors in movies/video games/anime (example: https://www.axios.com/2023/07/24/ai-voice-actors-victoria-at... )
Unlike musicians who can't be replaced without significant postprocessing, have enough money to not be impacted by competition, and have legal muscle, voice over artists:
- Can be reproduced with good-enough results from out-of-the-box voice cloning settings on ElevenLabs or an open source equivalent (Bark, VALL-E X)
- Are already underpaid for their work as-is
- Have no legal ownership of their voice since they are contractors, and their voicework is owned by their clients who may not be as incentivised in protecting the VO.
I want to write a blog post about it but I suspect most people on Hacker News won't be interested in a treatise on the cultural impacts of the voicework in Persona 5 and Genshin Impact.
What I find interesting is this aspect that eventually, these companies will hire some college kids who needs a couple thousand bucks and a free pizza. Have them read the right scripts. Sign the right 'give everything away' contract and just forever use their voice. Or do it sneaky. Have a voice assistant and in your ToS 'we can use a copy of your voice for anything'.
The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.
Don't expect that to last more than a year or two, assuming it's even still a problem for the best voice-generation AIs. Generating high-quality is the hard problem; generating specific high-quality samples is, by comparison, a lot easier.
Remember when Stable Diffusion was released a year ago and one of the big artist copes was "sure, it can generate random images, but it'll never be able to generate the same character repeatedly!" They were already wrong because Textual Inversion and DreamBooth were already published, and soon enough, ported to SD and now people could dump out thousands of images of the same character in the same consistent style etc (and did).
The issue is more that I can’t get the equivalent of a slider control to adjust one or more properties of the voice from the AI in real time… like a vocal fry slider to use an example of something most people are capable of deliberately doing when they want to… but the currently available models are pre-trained to sound like the average/median of one specific person (or character) and while I imagine tools will improve to control and customise the training of the models to customise this vocal output I don’t see a clear path from the current model architectural design to one where I can freely control the stylistic expression aspects of the vocal output without loading in a completely different set of model data trained for that new desired output.
No, that's easy. We had the equivalent of that in GANs many years ago. If you've never seen GAN editing, here's a quick video: https://www.youtube.com/watch?v=Z1-3JKDh0nI (Background: https://gwern.net/face#reversing-stylegan-to-control-modify-... ) You just classify the latents and then you can edit it. These days, with pretrained models like CLIP, you don't necessarily even need a latent space: you can take a model which has been trained on sound/text descriptions, like AudioCLIP, prompt it with a text like "vocal fry", and then the generated samples are subtly skewed to try to maximize similarity with "vocal fry". You put a slider on that for how much weight/skewing it does, and now you have a slider control to adjust properties of the voice from the AI. If something like this doesn't exist, it's obvious how to do it. (Even the realtime problem is being solved by figuring out how to train diffusion models to do a GAN-like single pass: https://arxiv.org/abs/2309.06380 )
I didn’t get to really explore the GAN generation of ML work particularly well since I had no supported hardware (no desire to support the nVidia monopoly on ML work) and refused to blow money on cloud instances I’d probably forget at some point and wind up with a giant bill.
It’s a really different world now I’ve got massive models running on my laptop thanks to Apple Silicon and the unified memory architecture, and the c++ ports of various diffusion image models and several families of large language text models work well on my AMD gpu too… it’s so much easier to participate in the current generation of applied ML work without having to go out of my way to have specific ML supported hardware.
Not sure if your hypothetical was meant to be a reference to the absolutely hilarious classic “Gilbert Gottfried reads 50 Shades of Gray”, but it has me wondering how much of the inherent comedy comes from “the voice” and how much comes from the idea that the man himself sat down and recorded those lines.
> wondering how much of the inherent comedy comes from “the voice” and how much comes from the idea that the man himself sat down and recorded those lines
For me it came from the voice; I hadn't heard of Gilbert Gottfried as a specific person until I read this discussion. The reaction faces of the women listeners were also amusing.
I still like getting surprised when a new or unorthodox narrator knocks it out of the park but I’d really enjoy a “salvage this purchase” exit hatch with a AI voice alternative. I’d even pay a buck or two on top of an existing purchase to automatically fix a bad narration.
Head over to Audible reviews, some books are widely considered to be great books as written but the audiobook is reviewed as one to be avoided because it was recorded poorly, the narrator paced it wrong, they had an annoying voice, they couldn’t do a voice of the opposite gender, whatever.
Plus it seems like a great accessibility feature. Many books are recorded for the vision impaired community by volunteers and that’s admirable, but some of the AI today does a much better job.
These are some very fair points. There was one book 'Electron Fire' all about the creation of the transistor, I think. I say that because never have I heard a more unenthused narrator. Makes Henry Kissinger sound like a dramatic actor.
Any AI voice could save that one. Any of them! Heck the original voice on the 1984 Machintosh could do better.
Recent voice models by OpenAI, Meta, and ElevenLabs all state upfront they work with paid professional voice actors, so this space will get intetesting fast.
HN isn't the only community to write for. While most people here seem to be unsympathetic to such job concerns, unconventional articles do hit the front page from time to time.
The get rich at any cost type like to post on these articles at a higher rate I think. When you read a larger and broad range of HN posts you see a substantial part of the population here has concerns about this.
I would as well. It isn't that I'm unsympathetic, it is just that we haven't outlawed technology that put others out of work, and I'm curious why we would decide as a society this time should be different. If there are good reasons I want to know.
Putting people out of work is one thing, that's bad enough and societies should take care to guide change and support those affected.
The danger behind AI and other manipulative technology is that it erodes trust. We already have serious issues with trust in media, and not just the obvious cases of Russian/Chinese propaganda, but also stuff like kids getting anorexia from extremely photoshopped advertising.
Add AI on top and no one can be certain about anything anymore. Say someone distributes a fake "recording" of the US President calling for glassing Moscow, or the Serbian President declaring war on Kosovo? That has the potential to actually cost lives on a massive scale.
Yeah, all that is bad, but those consequences are already here aren't they? Restricting further research just means it will be done in clandestine government labs like chemical or biological weapons except with equipment costs orders of magnitude lower. I can imagine policies that would save the jobs of voice actors, but none that would prevent the wave of deepfake propaganda that is coming.
> voice actors are fearing that the ability for generative AI to replicate their voices may cost them work
I'm not sure how to feel about that. I'm against the idea that some people "deserve" being paid for being lucky born with an interesting voice.
On the other hand, the world always worked like that. And, say, hard-working farmer or doctor were also lucky being born with necessary traits to make for their living, while others weren't.
A lot of skills are not simple, but computers have taken over them anyways. For example, financial bookkeeping is not just writing and storing the books, it's a professional skill with many tricks to learn. However, databases and spreadsheets have taken the major part from those jobs. Same could be said about programmers who learned the skill of programming Assembly language. Or performing -- vinyl records and CDs has largely taken over orchestras and traveling musicians.
I would vote for it only if it somehow encouraged voice actors to experiment and create new interesting styles. Kinda like patents were designed to do -- encourage inventors (although recently it became controversial in IT world).
Yes, everyone has a voice. The amount of people who can convincingly act with said voice is remarkably small and requires a good deal of innate ability or training, generally both.
You could have made that argument more effectively in the past when voice actors had to be able to mimic multiple voices (Dan Castlenetta, Mel Blanc, etc.). Nowadays, we're seeing more and more shows where the voices of the characters are just... the normal voice of the voice actor.
Of course it's not totally devoid of skill, you need to be able to emote, inflect, and convey emotion, but the bar is far far lower.
> that some people "deserve" being paid for being lucky born with an interesting voice
Majority of success is attained like this though. Athletes paid for being born strong tall and fast, models paid for being pretty, rich families being paid for being born rich, smart people being paid for being born smart, or hardworking, etc. It's the most dominant factor everywhere.
It's always funny to me when people cite old American case law and try to wrangle their heads around how that can apply to a situation which the case's participants couldn't have possibly imagined. Shouldn't the correct way to do this be new legislation being created after consulting interest groups to answer the modern problems which exist due to modern realities, like what the EU is doing? It seems much more sensible of an approach instead of wondering how a 15th century ruling's ruler would have applied his thinking about something they couldn't even dream of.
Well yes, you need to ask representatives of the people that will be impacted by a law what the impact will be, assess expert opinions, etc. Lobbying isn't only the American political bribery system, there's legitimate reasons behind it.
Of course! And that those with the deepest pockets are able to afford to have the most convincing folks spend the most time waiting for an opening in the various Representatives calendars is not surprising, and only natural.
That it often results in them getting an equivalent mindshare (or more) of the Representatives views is also not surprising, and only natural.
It doesn't inspire warm fuzzies in those too busy working to survive though.
You probably mean common law, also sometimes known as case law, vs civil law which traces it's origins to the Napoleonic civil code, and which is used in all of the world outside of the former British colonies.
My law classes did cover common law, yes, but not favourably(can you guess I come from a civil law country?). Sounds like a system that made sense in 15th century Britain, but is quite the complex beast with many issues nowadays when it doesn't need to be.
However that still doesn't answer my original question, why is there no new legislation to cover the newly existing scenarios talked about? It seems to me that even the UK does that at least for some things, and they're the original common law country.
As long as they don’t claim the voice is the original actor (misspell the name perhaps, or the Hollywood classic ‘based on’), they won’t be impersonating no?
The Ford ad didn't say it was Midler, they just implied it by using her song with a soundalike. There was another similar case with a parody ruled as impersonation. I don't think there's good precedent for exactly where that line is drawn.
AI voice is cheaper, but it's also a more boring and generic performance. There is zero progress made towards any sort of creative AI that produces good unique work.
The market for this then is small businesses who can't afford a professional voice actor. AI is opening up new markets, not killing the jobs of the truly talented.
This is the case for all generative "art." The people at the high end will still get paid well. The people who specialize in more utilitarian or low budget tasks in higher volume will take the biggest hit. Nobody who'd planned on hiring Morgan Freeman to do a voice over will be tempted to use AI Morgan Freeman instead.
>There is zero progress made towards any sort of creative AI that produces good unique work.
It's only been a year. Give it some time and I'm sure AI will have much better results. Right now, you can get some of that unique work by finetuning the AI off of a person's existing portfolio.
It saddens me because of how much impact they had on my family as we played through the story line in Genshin and immersed in the world. At some point we met a few of the voice actors at a convention and they were like stars to us, while I'm sure their circumstances are as you describe.
Most likey you'd see a lot of people saying that somehow getting rid of voice actors is good for "progress". Whatever that means.
Random aside someone really needs to make a hackernews that focuses more on game development and other arts so blog posts like your talking about would have a proper community to discuss them with.
Which are, in my view, really minor advantages when compared to the disadvantages. Not only in terms of putting people out of work, but in terms of increasing the artifice of the world around us and decreasing its humanity.
> "putting people out of work" by automating jobs is also a good thing
Unless you're one of the people out of work. And even if you don't care anything about them, if there's enough of them then the resulting unrest will be your problem anyway.
There's little nothing more important to the happiness of humanity than increased productivity per capita. That sounds crazy but when you think about it it's true.
Well, this is a very one sided view on the world I'd say. From personal experience, I can surely tell you that I was much happier in countries where productivity was lower. The people there are just so much more pure of heart.
Porperty is a bundle of rights, and often hard to pin down. In the case of voices, if a company owns enough of your data to train a good simulacrum, and they have the right to do it, then they kind of do own your voice -- or more precisely, a damn good substitute.
> Belyaev is a 29-year-old synthetic-speech artist at the Ukrainian start-up Respeecher, which uses archival recordings and a proprietary A.I. algorithm to create new dialogue with the voices of performers from long ago. The company worked with Lucasfilm to generate the voice of a young Luke Skywalker for Disney+’s The Book of Boba Fett, and the recent Obi-Wan Kenobi series tasked them with making Darth Vader sound like James Earl Jones’s dark side villain from 45 years ago, now that Jones’s voice has altered with age and he has stepped back from the role.
Copyright is complex. And artist's rights are outside of copyright, in some respects. An example.. in the past, painters have had their works bought, and then hung in unfavourable conditions. Or in places/locations, which reflect poorly upon the work of art.
Artists have sued, and won, to have artwork moved, shown differently, or force-sold back to the artist.
Now, everything you say is copyright... you. At least in my legal jurisdiction! Even my image is, in Quebec! Yes, that includes if you take my picture outside.
So what of one's voice? And if you don't have a real agreement, to use that voice in any way desired. And then you use that voice to.. I don't know, advocate for terrorists or something weird.
What then?
I don't think it's completely clearcut, and I think there will be changes, decisions on this going down the road.
We've seen plenty of examples of famous people suing companies for using their likeness in ads as if they are promoting a product. Tom Hanks' name is currently in the news for this.
If a company uses an actor's previously recorded dialog to be edited in a way that makes them sound in favor of terrorism on the attempt to have people think the actor said the words, we have issues on so many levels. If the dialog is chopped/re-edited to use as dialog for the same character in later works, then I really don't have issues with it.
I pay little attention to SAG contracts, but after the Writer's Guild strike, I'd be expecting SAG to follow suit with major asks to protect its members from AI if they have not already covered it.
thanks. i have recently been asked by a couple of acquaintances that have done a few character voices in the past what I thought on AI and what can really be done with it. because of their infrequent performances, they aren't union members, but I'll pass along these links.
"Independent of the author's economic rights, and even after the transfer of the said rights, the author shall have the right to claim authorship of the work and to object to any distortion, modification of, or other derogatory action in relation to the said work, which would be prejudicial to the author's honor or reputation."
"The authors of dramatic works (plays, etc.) also have the right to authorize the public performance of their works (Article 11, Berne Convention)."
"The protection of the moral rights of an author is based on the view that a creative work is in some way an expression of the author's personality: the moral rights are therefore personal to the author and cannot be transferred to another person except by testament when the author dies."
"“Author” is used in a very wide sense, and includes composers, artists, sculptors and even architects"
Architects can deny changes in interior design: Lighting, artwork, etc., long after the building is finished. Just a few days ago I talked with a theater director: The author of the original work has the right to deny a production, for whatever reason, e.g. if they don't like the nose of an actor.
I bet my voice is mine under most jurisdictions (and I mean most; the Berne convention has been signed by 181 countries), even if I signed a contract that gives you wide permission to use it. And if I didn't, you can't use it outside of the very narrow scope of the work I produced for you. Even if you simply want to reuse an existing recording in another context.
If they are using their real voice, then they kind of screwed themselves. If they are performing a character voice, then at least they only lose out on that kind of work.
I'm guessing contracts will need to be updated to say that a character's voice made from AI can't be used so a completely different production cannot say they have the actor attached for publicity purposes.
A person's voice is effectively owned by the corresponding person through right of publicity, which includes voice depending on jurisdiction.
California, for example:
"Any person who knowingly uses another’s name, voice, signature, photograph, or likeness, in any manner, on or in products, merchandise, or goods, or for purposes of advertising or selling, or soliciting purchases of, products, merchandise, goods or services, without such person’s prior consent, or, in the case of a minor, the prior consent of his parent or legal guardian, shall be liable for any damages sustained by the person or persons injured as a result thereof."
Voices can sound very similar, they're far from unique. Clearly if you say or somehow strongly imply that a voice belongs to a specific person then that is protected. But what if you use someone's voice, someone not especially well known, and don't make any claims about where it comes from?
I don't think it's that clear at all. You own your "likeness", but the limits of what that means is highly untested. Of the similar examples that have been tested in court thus far, the Ford v Midler case is the closest, but the court specifically called out the fact that as a singer her voice is a distinctive part of their identity, and so it is protected.
It's sad if the only way voice actors are going to be able to make a living is by doing stuff like Critical Role on Youtube. I love Critical Role but it likely wouldn't be the same if those guys hadn't spent years honing their craft. Watching people play RPGs online has replaced a lot of my streaming viewing now, but the market is much smaller and I imagine it can only sustain a much smaller pool of creatives than the current voice over market can.
Wow. I just realized any one of us could redo Weird Al's songs with his lyrics, but with the original singer's voice. We could be listening to Michael Jackson singing "Just Eat It" by lunchtime.
I am constantly amazed at how the new AI tech can be used.
Of course this would be illegal under most countries copyright laws.
There's also a Weird Al piece "I think I'm a clone now", for which an AI clone voice performance would definitely be fitting. (The original song was "I think we're alone now" by Tommy James and the Shondells, but it seems Weird Al was parodying the cover by Tiffany in the 1980's.)
While Weird Al himself asks for permission, it's well established that parody is not copyright infringement. There should be room for parody performances by AI voices as well, especially if argued by a good lawyer.
My absolute favorite application of this tech so far is The Beach Boys singing 'Hurt'. It's the first time I seriously didn't notice any artifacts, and it just works so well even though it really shouldn't.
I don't know what I was expecting but that isn't Hurt, it's Surfin' USA with Hurt's lyrics that sound extremely jittery and grainy.
I'm curious though if some AI soon could in fact synthesize the Beach Boys' style with the actual chords and melody from the NIN song, possibly with some of the pathos of Johnny Cash as well.
This is potentially something that generative AI could be good at doing (at least recreating vocals), but this parody of the Talking Heads required a lot of very clever insight into what made a good Talking Heads song and returned a convincing and novel melody. And I think we are still a ways off.
Yeah, I hate it to the point of being personally offended. It has nothing to do with Johnny Cash's rendition. I'd probably feel a bit better, but not much, if it were advertised as a NIN mashup.
Yeah, based on the parent, and the genius of the musicians involved, I was expecting something more than the sum of its parts. Hurt is an incredibly powerful song, and the Cash rendition imbues it with another beautiful layer.
As a joke, I can see it being funny, but it was a jarring way to experience it.
This account is one of the absolute top tier creators for weird music mixes. The recent deep faking stuff has been shockingly good. I think this is a good example of an "acceptable" use of AI, as long as artists/composers etc rights are all settled.
its always more fun when its a real group of talented people being silly, but I'd listen to an album of weird mashup like this for sure.
The graininess of the recording covers over a lot of potential problems. But given that this attempt keeps the Beach Boy’s tempo and enunciation, I think this technique, whatever it is, would make a much more compelling version of Michael Jackson covering Eat It.
The sampled voices sound neither like Michael Jackson nor Weird Al. A good effort, but a professional impersonator could likely do better on either front.
Sometimes I’ll watch a movie with voiceover work, where some character has a very specific accent, and I’ll be watching along for twenty minutes and the VA will let slip just a couple syllables of their real voice and my ears will prick up and I’ll think, hey I know this guy. Isn’t that… oh the guy from the thing. From <wrong movie>, no wait I mean <other movie>? Yes, it is.
That’s what this sounds like. Five syllables of Michael Jackson while he’s trying to be Action Hero or Big Villain, or Funny Sidekick (a problem Eddie Murphy has never had, all evidence from Coming to America notwithstanding).
I know what you mean. Its more noticeable (imo) on the Michael one.... but its definitely in there. I think the pitch correction is to blame for a bit of the weirdness.
I did not know about this: "The center of the A.I. cover songs community is a massive 500,000+ member Discord called A.I. Hub, where members trade new tips, tools, techniques, and links to their original and cover songs."
Me neither. That’s what’s so weird about the internet.
Imagine half a million people out in the streets together. You’d definitely notice that. Meanwhile, we can have these massive online communities and you’d never know unless you accidentally stumbled across it or someone told you about it.
more accurate to say that, while 500,000 people joined the discord by clicking a link, some much, much smaller number are actually active on any sort of a regular basis
Yeah, one of the "worst" (good for metrics, bad for legibility) parts of the trend of moving to discord for any sort of online community is that you have to "join" the community to even view any of the resources ensconced within. Meaning it's poorly indexed (discord search is okay, but not great) and not available at all to external crawlers.
If this community was available for crawling then LLM would crawl it and there would be no value in participating in the community because you can just ask the LLM about all that, no?
If the value your community provides is low enough that it can be effectively replaced by a general purpose LLM, then it should be. The value of a community should be pushing the boundaries of knowledge, not gatekeeping it.
C'mon, this is hacker news, what happened to "information should be free"?
So to continue the analogy comparison, 500,000 people walked in that street at some point. Some unknown percentage of that number is made of unrecognized duplicates (same person new username).
> Imagine half a million people out in the streets together. You’d definitely notice that.
In the streets, sure. Meeting up at out of town conference centers a few times a year, probably not. Most real communities have always been "dark matter" to those outside them; Discord working the same way feels more authentic than most of the internet.
Something I think we're slowly coming to terms with is that the current generation of techies (the ones who can afford to spend hours upon hours tweaking models and sharing results) really prefer Discord over our Web 2.0 forum type communities like this one. Even reddit on, which is lagging in popularity amongst Gen-Z when compared to Discord or TikTok, you can immediately tell upon reading /r/LocalLLMs that a really big chunk of this community are underaged. To be clear, I think this is a good thing!
There was a generation that preferred mailing lists. There was a generation that preferred IRC and BBS, and "my" generation which likes forums and lengthy comment threads. One would be naiive to think this style (the one we're engaging in here) would last forever.
There are definitely very real criticisms of Discord, searchability and discoverability being the most common, but at this point I think the die has been cast. Young people have made their choice.
Agree, im in my early 30s and jump through most platforms, but very little with tiktok/discord.
but i have to admit a lot of newer content (and tech framework support) has migrated to discord channels. Even some YouTube sports talk shows have their own discord for call ins, etc...
These big teleconference apps are usually hit or miss but discord seems to be the winner currently for actual "social networking", also add in its trend in the gaming community
I kind of disagree? I am gen Z myself, and have used reddit extensively. While I like Discord a lot, I strongly disagree with using it to host content, essentially gating non-members from getting what they want (which is what leads to these communities with ludicrously inflated member counts). And this sentiment definitely isn't just me, a lot of the techie "CS major" people I know lean towards using slightly older services - which is also probably why the aforementioned /r/localllama community still has more than 60 thousand members.
That being said, Discord does have some advantages over older forum-type communities - it's usually way better for cultivating smaller communities, and its no-effort-required chat systems means that you can always hop on and discuss things that are on the cutting edge. This is quite important in a field like AI, where it feels like something revolutionary happens every other week.
(Also, I don't know if that implication was intentional, but gen Z and "underaged" haven't meant the same thing for many years now)
I poked around there for a while, and my takeaway was "sub-par" all around, which might be the reason for it's relative obscurity? The thing is, I can't tell to what extent it's the tech, and to what extent it's just "very uninteresting source material."
Like, there's a whole lot of "classic song done by presently popular rapper," and I'll be the first to insist that there is nearly nothing vocally interesting at all coming from todays popular hip-hop artists (and I say this as an extreme long-time hip-hop aficionado)
What's the best open source text to speech? Eleven Labs and others are interesting but closed source. I want to use them mainly for audiobooks as I have a lot of ePubs and I'm just using the basic Google text to speech voices on my Android, via Moon+ Reader. It works fine but it's still more robotic than state of the art.
I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].
Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).
IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.
PRE-EDIT, ERRONEOUS ANSWER
Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.
For neutral sounding very fast/efficient voices, I find Coqui TTS VITS models to be very good. For slower, more expressive voice or voice cloning I think the Coqui TTS XTTS is good (or you can look at the mrq/tortoise-tts).
Looks promising, I'm going to check it out too! MIT license, even! If it's fast enough for real time, it could be the new best option. The paper claims faster inference than VITS...
In my post I link to my issue where I outline what I needed to do from a clean mamba env that might help.
Pytorch nightly (I use for cuda-12) doesn't work w Python 3.12, but if you stick w 3.11 or 3.10 you should be ok. Rest was just w/o version numbers if you're on a clean venv should be fine, however there's a bug in the Utils lib that requires a 1-line fix if you're trying to inference (also linked). nltk was the only dependency not listed so not bad compared to most code drops!
I spent a couple of hours debugging why jupyter's debugger wasn't working right, so not exactly related to the code. I did also find and fix that utils bug you mentioned. But my current issue is that phonemizer won't find espeak even though I set the environment variables that are supposed to work. I'll figure it out eventually...
Thanks for writing up your experience! Good to know it works! And it's fast!
Yes I'm on Windows at the moment. I did try setting those yesterday but I must have made a typo or something. I'll try again, thanks!
Edit: Got it working, sounds really great and is super fast as advertised. Amazing! Just tried modifying the code to make it speak more quickly and it worked first try and still sounds good too! This is way better than using Coqui TTS. Just need a few more pretrained models and the voice cloning that was in the paper and this will become super popular very quickly.
We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.
How many audio books is 40 hours?
Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.
We're probably several years out from it being something people use personally for audio books.
> We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.
All of these AI as a Service (AaaS?) API companies are going to race each other to razor thin margins. Immediately after ElevenLabs raised, five other TTS services raised nearly the same amount of money.
It's mostly audiobooks, I have some ePubs that don't have audiobooks anywhere, such as many Japanese light novel fan (or official) translations into English for example. I can get through them as I can understand audio faster than I can read text, as I play back at 3 to 5x speed.
what's your retention/comprehension of the content at those speeds? i find that those speeds allows me to understand the concept as it's whizzing by, but the retention of it is not good. everything i've ever been taught and personal experience about long term retention all say speed is not the most conducive.
Retention is pretty good but that's because I've been training myself for the past 5 to 10 years to get to that speed. It's similar to how blind people's TTS are incomprehensible to most hearing-able people.
In my audible library, the shortest is the first Hitchhiker's Guide to the Galaxy a 5h51m. The longest is The Power Broker at 66h9m. Most of the books I have are in the 15-25 hour range, but I also have a lot of fantasy stuff that gets near 50 hours (Game of Thrones, Brandon Sanderson...).
I've tried a few, not an expert, but I think Coqui's new XTTS models are decent performance and quality wise (just in terms of how the speech sounds, can't speak to the voice cloning fidelity as I don't care about that). Open source code but non-commercial license for the model. They also have a bunch of models with more permissive licenses that aren't as good.
I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:
I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.
While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.
(Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)
> Artifacts aside, it sounds like Michael Jackson doing a Weird Al impression?! Every line has a distinctly “white and nerdy” vibe: it loses any seriousness and edge, exaggerating words for comic effect and enunciating lyrics really clearly so the punchlines can be heard.
No, it sounds like someone doing doing an impression of Weird Al doing an impression of Michael Jackson. Someone whose mom told them they were special and they believed it.
These examples are standing on a ridge line, surveying the uncanny valley and looking for the best way to cross.
I have an accent. If not for that, I'd be a great presenter.
If I could translate my voice into a poor Neil deGrasse Tyson, a poor Patrick Steward, a poor Carl Sagan, a poor Morgan Freeman, etc., my presentations would be... better.
If it makes you more comfortable and confident, that is helping you.
This isn't autotune for the spoken word, though. It's not fixing pacing or vocabulary, and in the audio above it isn't even fixing intonation. Yes, a thick German accent will give you away as being of German extraction. But you're also using the word 'since' when Brits and Americans would use 'for', and it's not going to fix that. Any more than it'll fix my french when I make the exact same mistake going the other direction (for=duration vs for=purpose vs for=interval). If I hear 'since one month' you're likely German or Indian. If you ask how long I've been in Marseille you'll know I'm American in about half that time.
Finally a way to not have to fix societies Prejudices just give everybody the tools to emulate the ideal of perfection no matter what color their skin or what their accent sounds like.
This must mean Hugging Face's bandwidth bill must be crazy, or am I missing something (maybe they have a peering agreement? heavily caching things?)