New technical blog sharing expertise about managing petabytes of data
Posted on September 15, 2021 (Last modified on October 19, 2023) • 3 min read • 451 wordsThe CEDA team has a wealth of expertise in world-leading data management practices and software engineering that we use to support the environmental science community. We want to share our knowledge, best practices and lessons learnt with others - so have launched a new CEDA Technical Blog. This blog aims to showcase how our team uses open source tools to efficiently manage multi-petabytes of environmental data and collaborate with international research projects working on society’s most pressing environmental issues.
CEDA staff have research backgrounds from environmental science domains such as atmospheric and climate science, earth observation, geography/geology, oceanography, physics, computer science and more. The team is made up of various job roles including; software engineers, data scientists, dev-ops, user support roles, to name a few. The work our experts do is world-leading in the environmental science and data curation domains, so they are well placed to share technical information with interested communities.
To kick off the blog, we’ve written about some technical challenges our Software Engineers are working on solving. The topics discussed in the blog may be experimental and unfinished.
Search Futures - by Richard Smith, Senior Software Engineer
We have been looking around for a flexible, scalable standard that would allow us to expose the bulk of the CEDA Archive via faceted search. This could then be used to build user interfaces and enhance search services at CEDA. In this blog post, we consider the feasibility and suitability of STAC and discuss progress into an Elasticsearch-based implementation.
What is a user? Removing anomalous behaviour from Anonymous access logs - by Mahir Rahman, Undergraduate Year-in-Industry student from University of York
The Climate Change Initiative (CCI) project’s goal is to provide open, registration-free, access to essential climate variables. CEDA runs the CCI open data portal, a suite of services to provide access to the CCI datasets held in the CEDA Archive including download and metadata services. Dataset usage is an important metric in understanding uptake and usage of the different datasets however, without requiring users to register, it is difficult to determine distinct users. Recent changes in access patterns have led to spurious user counts when thinking 1 IP = 1 USER. This article looks at methods to determine “normal” thresholds to reduce the impact of the different access patterns on our usage statistics.
We have a range of blog posts related to data management and software engineering planned - but if you have any suggested topics, then please get in touch.
We hope you enjoy reading more of our blog posts in the future!