• English 
  • Spanish 

The Climate-G Testbed

Conference: UK e-Science All Hands Meeting 2009
Year: 2009
Contribution type: Oral
PDF file: AHM-Oxford-Climate-G.pdf
Authors:
Fiore, S., Aloisio, G., Blower, J., , Denvil, S., Fox, P., Petitdidier, M., Schwichtenberg, H.

In the context of the EGEE Earth Science Cluster Community there are several research efforts addressing data management issues at several levels and in different geosciences domains. One of them is Climate-G, that is a research effort devoted to the Climate Change community. It is a distributed testbed for climate change addressing challenging data and metadata management issues at a very large scale. The testbed is an interdisciplinary effort joining expertise in the field of climate change and computational science. The main goal of Climate-G is to allow scientists carrying out geographical and cross-institutional data discovery, access, visualization and sharing of climate data.
The involved partners are: Centro Euro-Mediterraneo per i Cambiamenti Climatici (CMCC, Italy), Institut Pierre-Simon Laplace (IPSL/CNRS, France), Fraunhofer Institut für Algorithmen und Wissenschaftliches Rechnen (SCAI, Germany), National Center for Atmospheric Research (NCAR, USA) and Rensselaer Polytechnic Institute (RPI, USA), University of Reading (Reading, UK), University of Cantabria (UC, Spain), and University of Salento (UniSalento, Italy).
In the first phase we addressed two main challenges, starting from our user requirements: (i) distributed data/metadata management (hundreds of Petabytes of climate datasets) and (ii) scientific gateway (Climate-G Portal). The architectural design as well as the infrastructural implementation must provide the right answers to the two challenges.

Data and Metadata Distribution
A key point for Climate-G is the full data and metadata distribution.
Data distribution comes from the need of sharing data among centres without moving it in a central repository. Each partner can contribute with new datasets to the testbed just adding a new data service into the infrastructure and mapping it on a specific metadata server. The distributed approach clearly addresses scalability, performance and autonomy.
On the other hand, metadata distribution comes from the need of addressing local autonomy (in terms of access policy and rights granted to the users), scalability (the whole set of metadata is split across several metadata services), fault tolerance (a central metadata service can result in a single point of failure) and performance (a central point to store data and metadata can result in a performance bottleneck for the whole system). It is worth noting here that metadata management plays a critical role in such a distributed environment, since it (i) enables search and discovery activities, (ii) allows describing and cataloguing datasets, (iii) makes data effectively accessible and shareable by the scientific community.

Climate-G Infrastructure
Climate-G exploits both grid technologies connected with the EGEE project (gLite middleware and EGEE RESPECT tools) and domain-based tools and services (i.e. OPeNDAP and OGC compliant implementations). The former (general purpose services) provide a solid basis at an infrastructural level ensuring great flexibility, scalability and manageability (this has been proven in several testbed, projects and production-based environments such as EGEE); the latter provide support for domain specific activities (i.e. subsetting of data, map retrieving), are well known, well tested and widely adopted in the climate change community. The coexistence of grid and domain-related services is an important point to satisfy user requirements on a robust and mature infrastructure.
Data virtualization is a key point to build a transparent environment for the climate community and it is strongly connected with: metadata and data management.
Going into detail, concerning the metadata management, we adotped a distributed CMCC metadata solution (leveraging P2P and grid technologies) named GRelC. This service (included into the EGEE RESPECT Program) enables geographical data sharing, search and discovery and it is able to manage metadata information held in heterogeneous grid data sources. It exposes a GSI and VOMS enabled interface providing full security support (in terms of authentication, authorization, data integrity, data confidentiality and delegation). In the Climate-G testbed, metadata are stored both in relational databases (Index-DB) and XML ones (Metadata-DB) and are all available through the same grid-enabled GRelC interface. While the Index-DBs contain just key information, the Metadata-DBs contain full metadata descriptions stored into XML files. Harvesting of metadata is performed through the GRelC Harvester component built on top of the GRelC SDK. Additional details about the GRelC Harvester are out of the scope of this work. Presently, the metadata schema managed by the GRelC metadata services is compliant with ISO 19115/19139 Standards.
The GRelC metadata services (see Fig. 1) are P2P connected in a self-monitored and self-controlled network. Monitoring facilities are also available in the Climate-G Portal (see Fig. 2) giving the administrators full control of the underlying metadata system with real time monitoring capabilities, reports and statistics about the involved resources.

Keywords: