Skip to main content

Pelican: Advancing the Open Science Data Federation Platform

Powered by the Pelican Platform, the Open Science Data Federations (OSDF) enables AI workloads to access their training data, regardless of where they are executed. OSDF is a national cyberinfrastructure service that federates the nation’s scientific data repositories to facilitate effective, dependable, and scalable access to diverse datasets, from climate datasets to particle physics, with a uniform interface.

Brian Bockelman (Morgridge Institute for Research, PI), Miron Livny (UW-Madison, co-PI), Frank Würthwein (UC San Diego, co-PI)

The OSDF has a data delivery layer that blankets the nation’s R&E infrastructure. By co-locating hardware within the R&E networks and at computing center, users of the NAIRR CI always have a nearby OSDF presence that will deliver their scientific data.

Scalable data access is necessary to feed AI training and inference. OSDF adds an element of portability: AI researchers can easily stream their data to multiple compute resources, allowing workflows to run across distributed, heterogeneous compute capacity.

Pelican simplifies the job of connecting dataset repositories to the national distribution infrastructure. Institutions – or even individual researchers – wanting to make their unique datasets available to the AI research community can leverage OSDF to open their data without major infrastructure investments.

OSDF also enables institutions, large or small, to stream AI datasets to their local resources, empowering them to use local hardware alongside NAIRR allocations.

Pelican collaborates closely with NCAR to make NCAR’s Research Data Archive datasets available through the OSDF. Pelican is streamlining the use of OSDF for common climate science packages, including the Pangeo stack, and NCAR is developing corresponding science tutorials.

The OSDF enables workflows for a broad range of science domains, including gravitational wave physics, biology, mathematics, chemistry, and nuclear and particle physics. In June 2024, over 20PB of data was delivered to science users.

For examples on the domain science impact, see: