Eval Quadratic Python Code Tutorial

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

👋 Welcome to RefineBench — a comprehensive evaluation library for testing refinement capabilities of language models across multiple settings and domains. To reproduce the full results reported in ...

GitHub

CATArena: Engineering-Level Tournament Evaluation Platform for LLM-Driven Code Agents

CATArena (Code Agent Tournament Arena) is an open-ended environment where LLMs write executable code agents to battle each other and then learn from each other. CATArena is an engineering-level ...

IEEE

Methodology for Code Synthesis Evaluation of LLMs Presented by a Case Study of ChatGPT and Copilot

Abstract: Large Language Models (LLMs) have grown in popularity in recent years and are now employed in a variety of software engineering domains thanks to their Natural Language Processing (NLP) ...

IEEE

Development of Automatic Source Code Evaluation Tests Using Grey-Box Methods: A Programming Education Case Study

Abstract: Increasing the effectiveness of programming education has emerged as an important goal in teaching programming languages in the last decade. Automatic evaluation of the correctness of the ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results