Evaluating large language models trained on code
OpenAI's HumanEval benchmark for evaluating code generation models introduced alongside Codex release, measuring functional correctness.
Read at source: https://openai.com/index/evaluating-large-language-models-trained-on-code