Language Model Evaluation

AI can feign moral reasoning by repeating online language patterns

Scientists warn that current AI tests reward polite responses rather than real moral reasoning in large language models.

Communications of the ACM

Measuring What Matters in Large Language Model Performance

As large language models (LLMs) gain momentum worldwide, there’s a growing need for reliable ways to measure their performance. Benchmarks that evaluate LLM outputs allow developers to track ...

ascopubs.org

Simulation-Based Evaluation of a Large Language Model–Enabled Clinical Decision Support Platform in Oncology

In a remote, within-participant simulation, 26 oncologists from the United Kingdom, United States, Spain, and Singapore reviewed synthetic breast cancer cases and created comprehensive summaries for ...

EurekAlert!

Comprehensive evaluation of large language models in mining gene relations and pathway knowledge

Understanding complex biological pathways, such as gene-gene interactions and gene regulatory networks, is crucial for exploring disease mechanisms and advancing drug development. However, manual ...

European Medical Journal

Large Language Models in Glaucoma Need Guardrails

Scoping review finds large language models can support glaucoma education and decision support, but accuracy and multimodal limits persist.

Communications of the ACM

LLM Evaluation is Key to Accurate, Reliable, Effective GenAI

Enter large language model (LLM) evaluation. The purpose of LLM evaluation is to analyze and refine GenAI outputs to improve their accuracy and reliability while avoiding bias. The evaluation process ...

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

FedScoop

US AI Safety Institute taps Scale AI for model evaluation

Scale AI founder and CEO Alexandr Wang testifies during a House Armed Services Subcommittee on Cyber, Information Technologies and Innovation hearing about artificial intelligence on July 18, 2023, in ...

News Medical

Study finds health care evaluations of large language models lacking in real patient data and bias assessment

A new systematic review reveals that only 5% of health care evaluations for large language models use real patient data, with significant gaps in assessing bias, fairness, and a wide range of tasks, ...

The Robot Report

Vision-language-action models are the next leap in autonomous robotics

Explore how vision-language-action models like Helix, GR00T N1, and RT-1 are enabling robots to understand instructions and act autonomously.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results