If you've had a chance to play with ChatGPT's recently released human speech synthesis, hopefully you've noticed how incredibly realistic it is. I think its the addition of umms and ahhs and other regular speech patterns that really help to sell it.
Check out the samples from their announcement blog article: https://openai.com/blog/chatgpt-can-now-see-hear-and-speak
Does anyone have an idea of what tech stack they're using to generate it? They indicate they hired professional voice actors to do the training. What about the software stack?