Skip to content

Benchmarks

Benchmarking LLMs for data labeling

Key takeaways from our technical report:

  • State of the art LLMs can label text datasets at the same or better quality compared to skilled human annotators, but ~20x faster and ~7x cheaper.
  • For achieving the highest quality labels, GPT-4 is the best choice among out of the box LLMs (88.4% agreement with ground truth, compared to 86% for skilled human annotators).
  • For achieving the best tradeoff between label quality and cost, GPT-3.5-turbo, PaLM-2 and open source models like FLAN-T5-XXL are compelling.
  • Confidence based thresholding can be a very effective way to mitigate impact of hallucinations and ensure high label quality.