I've created a very simple package for the idea introduced in the Databricks blog post (https://www.databricks.com/blog/enhancing-llm-as-a-judge-wit...).
It proved to be quite useful for the use-cases I've worked on since with grading notes you can leave small details on around domain concepts that the LLMs make mistakes on rather than have a full answer which consumes a lot more time labeling time.
I'd like to learn more if such an approach or similar has been useful for others too.