I just tried a few and nothing in the open space seems complete with an easy checkpoint setup freely available and good documentation. Do they all require proprietary weights or worse?
I just tried a few and nothing in the open space seems complete with an easy checkpoint setup freely available and good documentation. Do they all require proprietary weights or worse?
Bark, SpeechT5, MMS, and looked at Elevenlabs and Silero - the last 2 because they are enable options in Oobabooga, the first 3 because they are on hugging face.
I have used all of the above. In my experience, Elevenlabs is the most natural sounding (and easy-to-use) with open-source alternatives (kind of) close behind it.
Unfortunately, Elevenlabs code is proprietary, so there’s a bit of a compromise there (unless you want to use one of the open-source alternatives you mentioned). To your point though, they aren’t the most user friendly.
TTS has definitely been a neglected field of interest for some of the new tech to accompany this wave of AI development, but I think it’s only a matter of time before new options emerge as startups and other projects take flight this year and next. It will be a crucial area to nail for immersive video game dialogue, I’m sure someone will come up with a new platform or approach. Fingers crossed they make it open-source.
For now, my suggestion is sticking to whatever TTS workflow works best with your current tech stack until something new comes out.
If you end up finding something worth sharing, let us know! I’m very curious to see how audio and speech synthesis develops alongside all of this other fosai tech we’ve been seeing.
Well I tried tortoise TTS today and got a bit farther than others but it still doesn’t work for me. I almost have it working, but figuring out the API and playing the audio from a conda container inside a distrobox container just to shield my system from the outdated stuff used in the project may prove to be too much for my skills. The documentation for offline execution is crap.
I’m actually getting farther into these configurations by keeping a Wizard LM 30B GGLM running in instruct mode the whole time and asking it questions. It is quite capable of taking in most output errors from a terminal and giving almost useful advice in many cases. That 30B model in GGML setup with 10 CPU threads and 20 layers on a 3080Ti-16GB is very close to the speed of a Llama2 7B running on just the GPU. It only crashes if I feed it something larger than what might fit on a single page of a PDF. My machine has 32GB of system memory. I think I need to get the max 64GB. As far as I have seen, a 7B model lies half the time, a 13B lies 20% of the time and my 30B lies around 10% at 4 bit. With a ton of extra RAM I want to see how much better a 30B is at 8 bit, or if a 70B is feasible and maybe closes the gap.