Homepage - Lorena Calvo-Bartolomé

Lorena Calvo-Bartolomé

PhD Student at Universidad Carlos III de Madrid

Hi, I'm Lorena.

I'm finishing my PhD at Universidad Carlos III de Madrid, where my research focuses on making topic models more usable for end users, particularly within the Science, Technology, and Innovation domain.

My work lies at the intersection of NLP, human-computer interaction, and usable software design. I'm passionate about creating language technologies that truly serve people — bridging the gap between technical innovation and real-world usability.

As I complete my PhD (defending on November 28th!), I'm excited to shift my focus toward Clinical NLP, continuing to explore how we can make NLP tools more effective, transparent, and user-centered.

lcalvo(at)pa.uc3m.es Google Scholar GitHub Twitter

Education

Universidad Carlos III de Madrid
Nov. 2021 - Nov. 2025 (Expected)

Department of Signal Processing and Communications Engineering
Ph.D. Student in Signal Processing and Communications Engineering
- Specialization: Signal and Data Processing
- Dissertation: Topic Analysis Incorporating Expert Knowledge: An Application to Science Analysis
- Advisor: Prof. Jerónimo Arenas-García
Universidad Carlos III de Madrid
Sept. 2019 - Oct. 2021

Master in Health Engineering
- Thesis: Analysis of Health Text Documents Using Hierarchical Topic Models
- Advisor: Prof. Jerónimo Arenas-García
Universidad Carlos III de Madrid
Sept. 2019 - Oct. 2021

Master in Telecommunications Engineering
- Thesis: Hierarchical Topic Model Graphical User Interface for Training and Visualization
- Advisor: Prof. Jerónimo Arenas-García
Universidad Carlos III de Madrid
Sept. 2014 - Oct. 2018

Bachelor in Telecommunication Technologies Engineering
- Thesis: Using Fourier-Motzkin Variable Elimination for MC-SAT Explanations in SMT-RAT
- Advisor: Prof. Dr. Erika Abrahám

Experience

Universidad Carlos III de Madrid
Nov. 2021 - Present

NLP Researcher
Universidad Carlos III de Madrid
Sept. 2020 - Sept. 2021

Research Support Technician
EDC at Deloitte
Sept. 2019 - Sept. 2020

Delivery Analyst
RWTH Aachen University
Nov. 2018 - May. 2019

Student Helper in Knowledge-Based Systems Group
DTS-Movistar+
Jul. 2017 - Sept. 2017

Testing Technician (Summer Intern)
Universidad Carlos III de Madrid
Jul. 2016 - Jul. 2017

Lab Technician in Telematics Department

Academic Experience

University of Maryland
Mar. 2024 - Jul. 2024

Visiting Scholar
- Supervisor: Prof. Jordan Boyd-Graber
ETH Zürich
Mar. 2025 - Apr. 2025

Visiting Scholar
- Supervisor: Dr. Alexander Hoyle
RWTH Aachen University
Oct. 2017 - Aug. 2018

Erasmus Student

Research Projects

NEXTPROCUREMENT: Open Harmonized and Enriched Public Procurement Platform
Sept. 2022 - Mar. 2023, Apr. 2024 - Mar. 2025

European Commission

Main Researcher: Jerónimo Arenas García
Massive Text Processing and AI for the Management of Research and Training at the University
Mar. 2023 - Apr. 2024

TED2021-132366B-I00

Main Researcher: Jerónimo Arenas García
INTELCOMP: A Competitive Intelligence Cloud/HPC Platform for AI-based STI Policy Making
Sept. 2022 - Mar. 2023

European Commission

Main Researcher: Jerónimo Arenas García
ERA4TB: European Regimen Accelerator for Tuberculosis
Sept. 2022 - Mar. 2023

European Commission

Main Researcher: Juan José Vaquero López

Teaching & Service

Natural Language Processing
Winter 2022, Winter 2023, Winter 2024

Co-instructor
Modern Theory of Detection and Estimation
Winter 2022

Co-instructor
Data Processing
Winter 2023

Co-instructor

News

2025

Our EMNLP 2025 paper, Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answerin is selected for an oral presentation (top 8%)!

Nov 28

Our paper, Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering is accepted to EMNLP 2025!

Nov 28

Selected Publications (view all )

Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Lorena Calvo-Bartolomé, Valérie Aldana, Karla Cantarero, Alonso Madroñal de Mesa, Jerónimo Arenas-García, Jordan Lee Boyd-Graber

EMNLP 2025

Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.

[Paper] [Code] [Huggingface]

Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Lorena Calvo-Bartolomé, Valérie Aldana, Karla Cantarero, Alonso Madroñal de Mesa, Jerónimo Arenas-García, Jordan Lee Boyd-Graber

EMNLP 2025

[Paper] [Code] [Huggingface]

Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Large Research Language Models

Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Daniel Stephens, Alden Dima, Juan Francisco Fung, Jordan Lee Boyd-Graber

ACL 2025

A common use of NLP is to facilitate the understanding of large document collections, with models based on Large Language Models (LLMs) replacing probabilistic topic models. Yet the effectiveness of LLM-based approaches in real-world applications remains under explored. This study measures the knowledge users acquire with topic models—including traditional, unsupervised and supervised LLM- based approaches—on two datasets. While LLM-based methods generate more human- readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to LLM-based topic models improves data exploration by addressing hallucination and genericity but requires more human efforts. In contrast, traditional models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. This paper provides best practices—there is no one right model, the choice of models is situation-specific—and suggests potential improvements for scalable LLM- based topic models.

[Paper] [Huggingface]

Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Large Research Language Models

Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Daniel Stephens, Alden Dima, Juan Francisco Fung, Jordan Lee Boyd-Graber

ACL 2025

[Paper] [Huggingface]

ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Boyd-Graber, Philip Resnik

ACL 2025

Topic model and document-clustering evaluations either use automated metrics that align poorly with human preferences or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners' real-world usage of models. Annotators -- or an LLM-based proxy -- review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations. Package, web interface, and data are at https://github.com/ahoho/proxann

[Paper] [Code]

ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

Alexander Hoyle, Lorena Calvo-Bartolomé, Jordan Boyd-Graber, Philip Resnik

ACL 2025

[Paper] [Code]

CASE: Large Scale Topic Exploitation for Decision Support Systems

Lorena Calvo-Bartolomé, Jerónimo Arenas-García, David Pérez-Fernández

COLING 2025

In recent years, there has been growing interest in using NLP tools for decision support systems, particularly in Science, Technology, and Innovation (STI). Among these, topic modeling has been widely used for analyzing large document collections, such as scientific articles, research projects, or patents, yet its integration into decision-making systems remains limited. This paper introduces CASE, a tool for exploiting topic information for semantic analysis of large corpora. The core of CASE is a Solr engine with a customized indexing strategy to represent information from Bayesian and Neural topic models that allow efficient topic-enriched searches. Through ad hoc plug-ins, CASE enables topic inference on new texts and semantic search. We demonstrate the versatility and scalability of CASE through two use cases: the calculation of aggregated STI indicators and the implementation of a web service to help evaluate research projects.

[Paper] [Code]

CASE: Large Scale Topic Exploitation for Decision Support Systems

Lorena Calvo-Bartolomé, Jerónimo Arenas-García, David Pérez-Fernández

COLING 2025

[Paper] [Code]

Warning

Action required

Education

Experience

Academic Experience

Research Projects

Teaching & Service

News

Selected Publications (view all )

Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Large Research Language Models

Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Large Research Language Models

ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering

CASE: Large Scale Topic Exploitation for Decision Support Systems

CASE: Large Scale Topic Exploitation for Decision Support Systems

All publications