CEDA dataset expertise informs Google’s new dataset search
Posted on September 10, 2018 (Last modified on October 19, 2023) • 4 min read • 796 wordsCEDA staff have contributed to a new search tool launched on 5th September by Google that aims to help scientists, policy makers and other user groups more easily find the data required for their work and their stories, or simply to satisfy their intellectual curiosity.
In today’s world, scientists in many disciplines and a growing number of journalists live and breathe data. There are many thousands of data repositories on the web, providing access to millions of datasets; and local and national governments around the world publish their data as well. CEDA’s own archive holds over 6 Petabytes of data with over 154 million files; all with a focus on atmospheric and earth observation fields.
UK Research and Innovation, the umbrella organisation that CEDA sits beneath, has made commitments for easy access to data. CEDA’s Data and Programme Manager, Dr Sarah Callaghan, was one expert who worked with Google to help develop the Dataset Search, launched on 5th September.
Similar to how Google Scholar works, Dataset Search lets users find datasets wherever they’re hosted, whether it’s a publisher’s site, a digital library, or an author’s personal web page.
Google approached UK Research and Innovation’s Natural Environment Research Council (NERC) and Science and Technology Facilities Council (STFC) to help ensure their world-leading environmental datasets were included. The heritage in these organisations managing huge complex datasets on the atmosphere, oceans, climate change, and even data about the solar system, led Sarah, to work with Google on the project.
Dr Sarah Callaghan said: “In CEDA we manage, archive and distribute petabytes of data to make it available to scientific researchers and other interested parties. My experience making datasets findable, usable and interoperable enabled me to advise Google on their Dataset Search and how to best display their search results.”
“I was able to draw on my work with NERC and STFC datasets, not only in just archiving and managing data for the long term and the scientific record, but also helping users to understand if a dataset is the right one for their purposes.”
To create Dataset Search, Google developed guidelines for dataset providers to describe their data in a way that search engines can better understand the content of their pages. These guidelines include salient information about datasets: who created the dataset, when it was published, how the data was collected, what the terms are for using the data, etc. This enables search engines to collect and link this information, analyse where different versions of the same dataset might be, and find publications that may be describing or discussing the dataset. The approach is based on an open standard for describing this information (schema.org). Many environmental data, held in the CEDA archive, are already described in this way and are particularly good examples of findable, user-friendly datasets.
“Standardised ways of describing data allows us to help researchers by building tools and services to make it easier to find and use data” said Sarah, “If people don’t know what datasets exist, they won’t know how to look for what they need to solve their environmental problems. For example, an ecologist might not know where to go to find, or how to access the rainfall data needed to understand a changing habitat. Making data easier to find, will help introduce researchers from a variety of disciplines to the vast amount of data I and my colleagues manage for NERC and STFC.”
The new Google Dataset Search offers references to most datasets in environmental and social sciences, as well as data from other disciplines including government data and data provided by news organisations.
Professor Tim Wheeler, Director of Research and Innovation at NERC, said: “NERC is constantly working to raise awareness of the wealth of environmental information held within its Data Centres, and to improve access to it. This new tool will make it easier than ever for the public, business and science professionals to find and access the data that they’re looking for. We want to get as many people as possible interested in and able to benefit from data collected by the environmental science that we fund.”
Dr Chris Mutlow, Director of STFC RAL Space said, “This work builds on RAL Space experience in data management and commitment to making it easily accessible. The expertise that Sarah and our other CEDA data scientists have in this area is becoming an ever more important global resource to call upon. The data centres we manage for NERC and STFC play an important role in scientific research and are a facility available to all.”
Image: An example search for weather records in Google Dataset Search. Credit: Google
For more information contact Poppy Townsend. Via email: poppy.townsend@stfc.ac.uk or phone: 01235 446252