NAIRR Pilot - Throughput Machine Learning with the Partnership to Advance Throughput Computing (PATh) (Supplement)

Projects / Demonstration projects / Throughput Machine Learning with the Partnership to Advance Throughput Computing (PATh)

Throughput Machine Learning with the Partnership to Advance Throughput Computing (PATh) (Supplement)

The Partnership to Advance Throughput Computing (PATh) is the NSF’s premier investment in throughput computing. PATh innovates in technologies, developing the HTCondor Software Suite (HTCSS), and uses a translational computer science approach to move new ideas into the S&E community. PATh will launch a new collaboration with a group of domain scientists and ML researchers to run a pathfinder project profiling the effects of throughput training and inference on distributed, heterogeneous capacity. The goals of this activity will be threefold:

Characterize the impact of training ensembles across heterogeneous resources, as opposed to the traditional approach of “single homogeneous cluster”.
Use the workloads developed in (1) to improve capabilities and services, reducing the barrier of entry for new ML researchers to use the PATh services to train ensembles with capacity powered by the NAIRR pilot.
Demonstrate running single workloads managed by a single Access Point effectively across as many of the NAIRR pilot resources as possible.

By the end of the proposed supplemental effort, PATh will be positioned to support ensemble training of models with NAIRR capacity effectively. ML researchers will be positioned to use distributed high-throughput computing and be able to show how the proposed high throughput approach can impact their science. For the vision driving NAIRR, it will be a demonstration how the federal government’s investments can be unified into an “AI data commons”.

Project leads, key team members

Miron Livny (PI, University of Wisconsin-Madison, Morgridge Institute for Research), Ian Ross (Technical Lead, University of Wisconsin-Madison), Tony Gitter (Co-PI, University of Wisconsin-Madison, Morgridge Institute for Research), Brian Bockelman (Co-PI, Morgridge Institute for Research)

Advancing NAIRR Infrastructure

We aim to understand the impact that heterogeneity in resources has on training ML models while also enhancing the HTCondor Software Suite to enable efficient migration of computing tasks into (and between) NAIRR resources.

Advancing AI research

We hope that researchers will be able to leverage distributed high throughput computing to train and utilize ML models to transform their research by providing workflows and software updates that allow them to access NAIRR resources from a single Access Point.

Innovative partnerships

We are partnering with Anthony Gitter to use protein AI as an exemplar scientific domain. Protein structure prediction and engineering has wide potential impact, including cancer therapeutics, drought- and herbicide-resistant crops, novel biomanufacturing, and environmental benefits through plastic removal.

Broader Impacts

Traditional ML training approaches are incredibly expensive and, as a byproduct, often exclusionary of all but the largest entities (even the largest academic institutions and flagship NSF cyberinfrastructure investments are small compared to the larger tech industry). PATh innovates technologies that allow distributed high throughput computing (dHTC), which empowers over 60 institutions to participate in single workflows on the OSPool. This supplemental budget will enable PATh to demonstrate the power and potential impact of dHTC to the AI community and shift more focus towards ensembling the training of models.

More information

Learn more about how the Partnership to Advance Throughput Computing (PATh) project can meet your needs by contacting our team directly at leadership@path-cc.io. Visit our website to learn more at https://path-cc.io/.

This work is supported by supplemental funding to National Science Foundation Grant No. (#2030508).