ELOQUENT lab for evaluation of generative language model quality

The ELOQUENT lab for evaluation of generative language model quality and usefulness addresses high-level quality criteria through a set of open-ended shared tasks implemented to require minimal human assessment effort.

Tasks:

  • Task 1 - Voight-Kampff

    Generate text samples for a classifier to distinguish between human-authored and machine-generated text.

  • Task 2 - Robustness and Consistency

    Explore how much a generative language model's output is affected by stylistic, dialectal, or other non-topical variation in the input.

  • Task 3 - Dictionary Definition

    Have a generative language model provide a definition for a previously unseen word and identify false or erroneous definitions.

  • Task 4 - Preference Score Prediction

    Predict human preferences between sets of LLM-generated responses collected from human assessors, and generate judgments to explain the choice made.

  • Task 5 - Sensemaking

    Given a set of possibly noisy texts, generate questions and answers about the topic.

Organizers

  • Jussi Karlgren, Silo AI
  • Ekaterina Artemova, Toloka AI
  • Ondřej Bojar, Charles University
  • Timothee Mickus, University of Helsinki
  • Vladislav Mikhailov, University of Oslo
  • Aarne Talman, University of Helsinki
  • Magnus Sahlgren, AI Sweden
  • Erik Velldal, University of Oslo
  • Lilja Øvrelid, University of Oslo

Contact

  • eloquent-clef2025-organizers@gmail.com