Advanced Computing Allocations to Advance AI Research and Education
As part of the launch of the NAIRR Pilot, the National Science Foundation (NSF) and the Department of Energy (DOE) are collaborating to create an early opportunity for the research community to request access to a set of advanced computing resources for projects related to the focus of this call. This initial call will be open from January 24 to March 1, 2024. We anticipate a follow-on open call for allocations in Spring 2024.
Eligibility
This call is generally open to meritorious advanced computing proposals by US-based researchers and educators. Note that the individual resources available via this call have differing eligibility rules. Assignment of supported proposals will be guided by these constraints.
Thematic focus of this allocation call
This initial call for allocations emphasizes interest in projects focused on advancing Safe, Secure and Trustworthy AI, including but not limited to projects with the following types of research aims:
- Testing, evaluating, verifying, and validating AI systems.
- Improving accuracy, validity, and reliability of model performance, while controlling bias.
- Increasing the interpretability and privacy of learned models.
- Reducing the vulnerability of models to families of adversarial attacks.
- Advancing capabilities for assuring that model functionality aligns with societal values and obeys safety guarantees.
Other projects that align with the research thrusts of the NAIRR Pilot may secondarily be considered for allocation, including in the areas of healthcare, environment and infrastructure sustainability, and AI education, as well as projects in other areas of AI research and domain applications.
Available computational resources for this specific call for allocations
Summit is an IBM system located at the Oak Ridge Leadership Computing Facility. With a theoretical peak double-precision performance of approximately 200 PF, it is one of the most capable systems in the world for a wide range of traditional computational science applications. The basic building block of Summit is the IBM Power System AC922 node. Each of the approximately 4,600 compute nodes on Summit contains two IBM POWER9 processors and six NVIDIA Tesla V100 accelerators and provides a theoretical double-precision capability of approximately 40 TF. Each POWER9 processor is connected via dual NVLINK bricks, each capable of a 25GB/s transfer rate in each direction. Most Summit nodes contain 512 GB of DDR4 memory for use by the POWER9 processors, 96 GB of High Bandwidth Memory (HBM2) for use by the accelerators, and 1.6TB of non-volatile memory that can be used as a burst buffer. A small number of nodes (54) are configured as “high memory” nodes. These nodes contain 2TB of DDR4 memory, 192GB of HBM2, and 6.4TB of non-volatile memory. The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The processor provides 22 SMCs with separate 32kB L1 data and instruction caches. Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache. SMCs support Simultaneous Multi-Threading (SMT) up to a level of 4, meaning each physical core supports up to 4 Hardware Threads. Summit nodes are connected to a dual-rail EDR InfiniBand network providing a node injection bandwidth of 23 GB/s. Nodes are interconnected in a Non-blocking Fat Tree topology. This interconnect is a three-level tree implemented by a switch to connect nodes within each cabinet (first level) along with Director switches (second and third level) that connect cabinets together.
Special requirements for Summit allocations: Researchers supported by active research awards from federal agencies may submit a request for an allocation on the Summit system. Projects are each expected to be at a scale that require at least 20% of the Summit system, per the mission requirements of DOE leadership computing.
The Delta GPU resource comprises 4 different node configurations intended to support accelerated computation across a broad range of domains such as soft-matter physics, molecular dynamics, replica-exchange molecular dynamics, machine learning, deep learning, natural language processing, textual analysis, visualization, ray tracing, and accelerated analysis of very large in-memory datasets. Delta is designed to support the transition of applications from CPU-only to using the GPU or hybrid CPU-GPU models. Delta GPU resource capacity is predominately provided by 200 single-socket nodes, each configured with 1 AMD EPYC 7763 (“Milan”) processors with 64-cores/socket (64-cores/node) at 2.55GHz and 256GB of DDR4-3200 RAM. Half of these single-socket GPU nodes (100 nodes) are configured with 4 NVIDIA A100 GPUs with 40GB HBM2 RAM and NVLink (400 total A100 GPUs); the remaining half (100 nodes) are configured with 4 NVIDIA A40 GPUs with 48GB GDDR6 RAM and PCIe 4.0 (400 total A40 GPUs). Rounding out the GPU resource is 6 additional “dense” GPU nodes, containing 8 GPUs each, in a dual-socket CPU configuration (128-cores per node) and 2TB of DDR4-3200 RAM but otherwise configured similarly to the single-socket GPU nodes. Within the “dense” GPU nodes, 5 nodes employ NVIDIA A100 GPUs (40 total A100 GPUs in “dense” configuration) and 1 node employs AMD MI100 GPUs (8 total MI100 GPUs) with 32GB HBM2 RAM. A 1.6TB, NVMe solid-state disk is available for use as local scratch space during job execution on each GPU node type. All Delta GPU compute nodes are interconnected to each other and to the Delta storage resource by a 100 Gb/sec HPE Slingshot network fabric.
If the project also has CPU node requirements in addition to GPU nodes, this should be noted in the request and would require separate allocation of time under NCSA Delta CPU which provides access to CPU-only compute nodes.
Frontera is for CPU-intensive projects at the largest scale, requiring either very large memory/node count (up to 8,192 nodes per run), or very large numbers of smaller runs. Containers are supported. Frontera is a large-scale Dell/Intel CPU system, with 8,392 nodes (470,000 cores) of Intel "Cascade Lake" 8280 Xeon processors. Each node has 56 cores and 192GB of RAM, and all are connected via an NVIDIA HDR-100 InfiniBand interconnect. Frontera has >50PB of shared disk filesystem, and 3PB of NVMe solid state filesystem. A large memory queue offers 16 nodes with 6TB of non-volatile DIMMs, configurable as memory or storage. GPU resources are available in the Frontera project through the Frontera-GPU and Lonestar resources. Frontera-RTX is a GPU subsystem of the main Frontera supercomputer. It shares access to the same filesystems, interconnects, accounts, queue system, and login nodes as Frontera. Frontera-RTX consists of 90 nodes, each with 4 NVIDIA RTX-5000 GPUs (360 GPUs total). Frontera-RTXX is for GPU workloads that do not require double precision and can make use of multiple GPUs per node.
Lonestar-6 is a Dell/AMD/NVIDIA system, with both CPU and GPU nodes. The current configuration contains 85 nodes with three NVIDIA A100 GPUs apiece (255 GPUs total), and 4 GPU nodes with two NVIDIA H100 GPUs (8 total). The CPU portion contains an additional 530 nodes with two AMD "Milan" EPYC 7763 processors, each with 64 cores (128 cores/256GB per node). All nodes are interconnected via an NVIDIA HDR InfiniBand interconnect. A 5PB parallel filesystem is available for scratch data. Containers are supported. Virtual Machine queues allow access to for individual jobs to run on portions of a node (one GPU per VM, or 16 cores/32GB for small CPU runs). For NAIRR users, the GPU nodes are available for allocation for AI-based workloads. Small CPU requests in addition to GPU requests will be allowed.
The Argonne Leadership Computing Facility (ALCF) AI Testbed is a resource platform supporting AI and data-centric workloads. The resources at ALCF AI Testbed are made available to NAIRR Pilot projects primarily for science applications involving AI models to run either in isolation or in tandem with traditional AI-driven HPC codes that run on large-scale supercomputers. These AI models could be diverse, such as Generative AI, Graph neural networks, and Vision transformers for training and/or inference-based modes. The following AI accelerators are currently deployed at the ALCF AI Testbed:
- Cerebras CS-2: The Cerebras CS-2 is a wafer-scale deep learning accelerator comprising 850,000 processing cores, each providing 48KB of dedicated SRAM memory for an on-chip total of 40GB and interconnected to optimize bandwidth and latency. The ALCF CS-2 systems are configured as a Cerebras Wafer-Scale Cluster, designed to support large-scale models (up to and well beyond 1 billion parameters) and large-scale inputs. The cluster contains two CS-2 systems and can distribute jobs across one or both CS-2 systems in a data-parallel framework. The supporting CPU cluster consists of MemoryX, SwarmX, management, and input worker nodes. The Cerebras Wafer-Scale cluster is run as an appliance: a user submits a job to the appliance, and the appliance manages the preprocessing and streaming of the data, IO, and device orchestration within the appliance. It provides programming via PyTorch, with data-parallel distribution when using more than one CS-2.
- SambaNova DataScale SN30: The SambaNova DataScale SN30 system is architected around the next-generation Reconfigurable Dataflow Unit (RDU) processor for optimal dataflow processing and acceleration. Each RDU has 1280 Pattern Compute Units (PCU) and 1 TB of off-chip memory. The AI Testbed's SambaNova SN30 system consists of eight nodes in 4 full racks, each node featuring eight RDUs (a total of 64 RDUs) interconnected to enable model and data parallelism. SambaFlow, Sambanova's software stack, extracts, optimizes and maps the dataflow graphs to the RDUs from standard machine learning frameworks like PyTorch.
- Graphcore Bow Pod 64: The Graphcore Bow-Pod64 system is a one-rack system consisting of 64 Bow-class Intelligence Processing Units (IPU) with a custom interconnect. It has a total of 57.6 GB In-Processor-Memory with 94,208 IPU cores. The system consists of four servers for data processing. The Graphcore software stack includes support for TensorFlow and PyTorch using the Poplar SDK. The Poplar® SDK is the toolchain designed to create graph software for ML applications. It integrates with traditional ML frameworks like PyTorch and TensorFlow, allowing users to port their existing code to the IPU hardware-specific code. It includes the PopTorch framework, a wrapper over the PyTorch framework optimized to the IPU hardware, and PopLibs libraries. It enables the construction of graphs, define tensor data, and control how the code and data are mapped onto the IPU for execution.
- Groq: ALCF's Groq system consists of a single GroqRack compute cluster that provides an extensible accelerator network of 9 GroqNode nodes with a rotational multi-node network topology. Each GroqNodes consists of 8 GroqCard accelerators with integrated chip-to-chip connections with a dragonfly multi-chip topology. GroqCard accelerator is a dual-width, full-height, three-quarter length PCI-Express Gen4 x16 adapter with a single GroqChip processor with 230 MB of on-chip memory. Based on the proprietary Tensor Streaming Processor (TSP) architecture, the GroqChip processor is a low latency and high throughput single-core SIMD compute engine with advanced vector and matrix mathematical acceleration units. The GroqChip processor is deterministic, providing predictable and repeatable performance. The GroqWare suite SDK uses an API-based programming model, enabling users to develop, compile, and run models on the GroqCard accelerator in a host server system. The SDK uses an ONNX/MLIR-enabled DAG compiler consisting of Groq Compiler, Groq API, and utility tools like GroqView™ profiler and Groq-runtime.
Special requirements for ALCF AI Testbed allocations: Researchers supported by active research awards from federal agencies may submit a request for an allocation on the AI Testbed system. To ensure appropriate use of this resource and equitable distribution of projects,
- Projects are expected to be focused on AI and/or learning technologies with the goal to understand how AI specific accelerators can be applied.
- Please be aware that these are modest sized resources that are focused on evaluation.
Neocortex is a highly innovative advanced computing system ideal for foundation and large language models. Neocortex, which captures promising specialized innovative hardware technologies, is designed to vastly accelerate large deep learning (DL) models and high- performance computing (HPC) research in pursuit of science, discovery, and societal good. Neocortex features two Cerebras CS-2 systems, provisioned by an HPE Superdome Flex HPC server and the Bridges-2 filesystems. Each CS-2 system features a Cerebras WSE-2 (Wafer Scale Engine 2), the largest chip ever built, with 850,000 Sparse Linear Algebra Compute cores, 40 GB SRAM on-chip memory, 20 PB/s aggregate memory bandwidth and 220 Pb/s interconnect bandwidth. The HPE Superdome Flex (SDF) features 32 Intel Xeon Platinum 8280L CPUs with 28 cores (56 threads) each, 2.70-4.0 GHz, 38.5 MB cache, 24 TiB RAM, aggregate memory bandwidth of 4.5 TB/s, and 204.6 TB aggregate local storage capacity with 150 GB/s read bandwidth. The SDF can provide 1.2 Tb/s to each CS-2 system and 1.6 Tb/s from the Bridges-2 filesystems. Jobs are submitted via SLURM. The CS-2 systems can run customized TensorFlow and Pytorch containers, as well as programs written using the Cerebras SDK or the WSE Field Equation API. Currently recommended DL projects focus on foundation and large language models such as BERT, GPT-J, and Transformer, or combine supported TensorFlow or PyTorch layers. DL codes can also be developed “from scratch” using the Cerebras Software Development Toolkit (SDK). The SDK can be used to develop HPC codes, such as structured grid based PDE and ODE solvers and particle methods with regular communication. Interested researchers are encouraged to contact PSC at neocortex@psc.edu to address comments and questions.
This is an option for researchers who are not certain which resource or resources would be appropriate for their project or who have no preference for which resource they might use. If you select this option, your project will be assigned to a suitable resource as determined by the information provided in your project proposal. The better your proposal describes your resource needs, the better reviewers will be able to make an appropriate resource assignment.
Proposal evaluation
All proposals will be evaluated for scientific and technical appropriateness and feasibility. Proposals will be evaluated on the following criteria:
- Alignment with the thematic focus defined for this call;
- Project readiness and potential for near-term progress;
- Feasibility of the technical approach;
- Need for advanced research computing resources available via this call;
- Advanced research computing and data management knowledge and experience of the proposing team; and
- Estimated computing and data resource requirements.
Proposals will be reviewed on a biweekly basis as they are received and until available resources for this call are exhausted. Responses to proposers are anticipated to be communicated on a biweekly basis.
Allocation proposal preparation and submission
- Proposal allocations will be up to 6 months in duration.
- Applications will be accepted starting January 24, 2024 through 8:00 pm EDT on March 1, 2024.
- Allocation proposals must be submitted electronically and follow the submission instructions and expectations.
- ORCID iD is required to submit allocation proposals. Create your ORCID iD if you don't have one.
Questions about this call and allocations onto specific resources can be directed to: help@allocations.nairrpilot.org