I’ve been using these prompts
To compare how different LLMs perform, and the results have been surprisingly staggering.
The toughest one is Wheel of Fortune, which only works consistently on GPT4.
3.5 turbo rarely works, or it does with surface level misunderstanding gameplay.
Bard never works.
BingChat kinda works, but sometimes gets sassy and ends the chat.