Mapping the State of Open Infrastructure: Insights from OSMI Working Group 4
Cristina Huidiu & Pragya Chaube, Working Group 4 Co-chairs
When we talk about open science, we often think about open-access journals or FAIR data. But there’s another layer that quietly powers it all — the infrastructure that captures, connects, and analyses research metadata.
Earlier this year, members of OSMI Working Group 4 set out to understand what this landscape really looks like in practice. To this effect, an internal survey was undertaken with the members of the working group. Overall, twenty-two respondents responded to our call. Their answers paint a surprisingly candid picture of where we stand.
The Survey
The respondents came from 15 countries, though the core concentration was in Europe — especially France and Germany, with additional voices from the Netherlands, Italy, Canada, the U.S., India, Chile, Tanzania, Australia, and others.
About half (10 of 22) said they build software, models, or lead engineering teams, while the rest are policy officers, administrators, or publishers involved in research dissemination.
This mix matters. It means the survey captures both technical realities (how infrastructure is actually built) and policy-level concerns (how it’s governed, funded, and sustained).
The participants were asked how they monitor research output (publications, datasets, grants, etc.), the tools they use for text and data mining (TDM), the challenges they face (technical, legal, and organizational), their readiness to share metadata, and what a ‘successful’ shared infrastructure might look like. It wasn’t a tick-box exercise. It was a conversation — one that revealed not only how far we’ve come, but also how unevenly.
Uneven Infrastructure Readiness
One of the clearest messages is that institutions are at very different stages of maturity. Some, like INRIA or the French Ministry of Higher Education and Research, already manage large-scale systems such as HAL — processing millions of publications each month. Others, including some smaller universities or publishers, simply don’t monitor open science outputs at all. When asked why, many cited limited computing power, skill gaps, or the absence of institutional mandates. As one respondent put it, ‘Not our role’ — a reminder that open science is still perceived unevenly across sectors.
The data shows a strong reliance on open-source tools such as GROBID, FastText, Entity-Fishing, and other machine-learning-based frameworks for parsing and classification. Closed-source or commercial tools are rare — often dismissed as too expensive, too opaque, or legally complicated.
The Legal Grey Zone
Even when content is technically open, the legal landscape isn’t clear. Respondents described a maze of licensing ambiguities, publisher-specific API limits, and inconsistent metadata about reuse rights. Some can’t share extracted metadata at all because of national copyright laws or publisher clauses limiting how much text can be quoted. So while open access has grown, ‘openly usable’ data for mining and analytics remains limited. The infrastructure is open, but the permissions often aren’t.
Metadata: Abundant but Messy
Most institutions extract standard elements like references, affiliations, funder IDs, code or data mentions, and grant identifiers. A few go further, mining conflicts of interest or patent links. Yet almost everyone reports problems with data quality — incomplete metadata, poor OCR in PDFs, inconsistent formats across publishers.
Verifying accuracy often means manual checks rather than an automated pipeline. As one respondent put it: “We extract metadata from the article and from the metadata source & compare them”.
Collaboration, but Conditional
Despite the challenges, there’s a strong appetite for collaboration. Most institutions said they’re willing to share extracted metadata under shared governance frameworks such as COMET, OpenAlex, or ScanR. However, that willingness comes with a caveat: they need clarity on licensing, data protection, and attribution. No one wants to risk violating copyright or GDPR — even accidentally.
Imagining Success
When asked what ‘success’ would look like for a shared metadata infrastructure, respondents described a common vision: a global, community-driven system that ensures equitable access to metadata tools and enhances the visibility of all researchers.
Others emphasised sustainability, interoperability, and fair participation — not just for well-funded Western institutions, but for the entire global research community. In other words, success isn’t only about better technology; it’s about shared stewardship.
Final Reflections
Despite brilliant tools, dedicated developers, and passionate open-science advocates, we’re all navigating a web of policy, legal, and funding hurdles that no single institution can solve alone. The survey, although small sample size, captures a snapshot of a global movement still figuring itself out: a movement that’s technically capable, legally uncertain, and quietly determined. If there’s one takeaway, it’s this: The future of open science won’t be built by any one lab or nation — it will be co-created in the shared infrastructure we all depend on.