In 2024, the Ministry of Economy, Trade and Industry (METI) launched the Generative AI Accelerator Challenge (GENIAC)—a Japanese national program to boost generative AI by providing companies with funding, mentorship, and massive compute resources for foundation model (FM) development. AWS was selected as the cloud provider for GENIAC’s second cycle (cycle 2). It provided infrastructure and technical guidance for 12 participating organizations. On paper, the challenge seemed straightforward: give each team access to hundreds of GPUs/Trainium chips and let innovation ensue. In practice, successful FM training required far more than raw hardware.

AWS discovered that allocating over 1,000 accelerators was merely the starting point—the real challenge lay in architecting a reliable system and overcoming distributed training obstacles. During GENIAC cycle 2, 12 customers successfully deployed 127 Amazon EC2 P5 instances (NVIDIA H100 TensorCore GPU servers) and 24 Amazon EC2 Trn1 instances (AWS Trainium1 servers) in a single day. Over the following 6 months, multiple large-scale models were trained, including notable projects like Stockmark-2-100B-Instruct-beta, Llama 3.1 Shisa V2 405B, and Llama-3.1-Future-Code-Ja-8B, and others.

This post shares the key insights from this engagement and valuable lessons for enterprises or national initiatives aiming to build FMs at scale.

Cross-functional engagement teams

A crucial early lesson from technical engagement for the GENIAC was that running a multi-organization, national-scale machine learning (ML) initiative requires coordinated support across diverse internal teams. AWS established a virtual team that brought together account teams, specialist Solutions Architects, and service teams. The GENIAC engagement model thrives on close collaboration between customers and a multi-layered AWS team structure, as illustrated in the following figure.

cross-functional-team-engagement

Customers (Cx) typically consist of business and technical leads, including ML and platform engineers, and are responsible for executing training workloads. AWS account teams (Solutions Architects and Account Managers) manage the relationship, maintain documentation, and maintain communication flows with customers and internal specialists. The World Wide Specialist Organization (WWSO) Frameworks team specializes in large-scale ML workloads, with a focus on core HPC and container services such as AWS ParallelCluster, Amazon Elastic Kubernetes Service (Amazon EKS), and Amazon SageMaker HyperPod. The WWSO Frameworks team is responsible for establishing this engagement structure and supervising technical engagements in this program. They lead the engagement in partnership with other stakeholders and serve as an escalation point for other stakeholders. They work directly with the service teams—Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon FSx, and SageMaker HyperPod—to help navigate engagements, escalations (business and technical), and make sure the engagement framework is in working order. They provide guidance on training and inference to customers and educate other teams on the technology. The WWSO Frameworks team worked closely with Lead Solutions Architects (Lead SAs), a role specifically designated to support GENIAC engagements. These Lead SAs serve as a cornerstone of this engagement. They are an extension of the Frameworks specialist team and work directly with customers and the account teams. They interface with customers and engage their Framework specialist counterparts when clarification or further expertise is required for in-depth technical discussions or troubleshooting. With this layered structure, AWS can scale technical guidance effectively across complex FM training workloads.

Another critical success factor for GENIAC was establishing robust communication channels between customers and AWS members. The foundation of our communication strategy was a dedicated internal Slack channel for GENIAC program coordination, connecting AWS account teams with lead SAs. This channel enabled real-time troubleshooting, knowledge sharing, and rapid escalation of customer issues to the appropriate technical specialists and service team members. Complementing this was an external Slack channel that bridged AWS teams with customers, creating a collaborative environment where participants could ask questions, share insights, and receive immediate support. This direct line of communication significantly reduced resolution times and fostered a community of practice among participants.

AWS maintained comprehensive workload tracking documents, which clarifies each customer’s training implementation details (model architecture, distributed training frameworks, and related software components) alongside infrastructure specifications (instance types and quantities, cluster configurations for AWS ParallelCluster or SageMaker HyperPod deployments, and storage solutions including Amazon FSx for Lustre and Amazon S3). This tracking system also maintained a chronological history of customer interactions and support cases. In addition, the engagement team held weekly review meetings to track outstanding customer inquiries and technical issues. This regular cadence made it possible for team members to share lessons learned and apply them to their own customer engagements, fostering continuous improvement and knowledge transfer across the program.

With a structured approach to communication and documentation, we could identify common challenges, such as misconfigured NCCL library impacting multi-node performance, share solutions across teams, and continuously refine our engagement model. The detailed tracking system provided valuable insights for future GENIAC cycles, helping us anticipate customer needs and proactively address potential bottlenecks in the FM development process.

Reference architectures

Another early takeaway was the importance of solid reference architectures. Rather than let each team configure their own cluster from scratch, AWS created pre-validated templates and automation for two main approaches: AWS ParallelCluster (for a user-managed HPC cluster) and SageMaker HyperPod (for a managed, resilient cluster service). These reference architectures covered the full stack—from compute, network, and storage to container environments and monitoring—and were delivered as a GitHub repository so teams could deploy them with minimal friction.

AWS ParallelCluster proved invaluable as an open source cluster management tool for multi-node GPU training. AWS ParallelCluster automates the setup of a Slurm-based HPC cluster on AWS. AWS ParallelCluster simplifies cluster provisioning based on the open source Slurm scheduler, using a simple YAML config to stand up the environment. For the GEINIAC program, AWS also offered SageMaker HyperPod as another option for some teams. SageMaker HyperPod is a managed service that provisions GPU and Trainium clusters for large-scale ML. HyperPod integrates with orchestrators like Slurm or Kubernetes (Amazon EKS) for scheduling, providing additional managed functionality around cluster resiliency. By including reference architectures for both AWS ParallelCluster and SageMaker HyperPod, the GENIAC program gave participants flexibility—some opted for the fine-grained control of managing their own HPC cluster, whereas others preferred the convenience and resilience of a managed SageMaker HyperPod cluster.

The reference architecture (shown in the following diagram) seamlessly combines compute, networking, storage, and monitoring into an integrated system specifically designed for large-scale FM training.

The base infrastructure stack is available as an AWS CloudFormation template that provisions the complete infrastructure stack with minimal effort. This template automatically configures a dedicated virtual private cloud (VPC) with optimized networking settings and implements a high-performance FSx for Lustre file system for training data (complemented by optional Amazon FSx for OpenZFS support for shared home directories). The architecture is completed with an S3 bucket that provides durable, long-term storage for datasets and model checkpoints, maintaining data availability well beyond individual training cycles. This reference architecture employs a hierarchical storage approach that balances performance and cost-effectiveness. It uses Amazon S3 for durable, long-term storage of training data and checkpoints, and links this bucket to the Lustre file system through a data repository association (DRA). The DRA enables automatic and transparent data transfer between Amazon S3 and FSx for Lustre, allowing high-performance access without manual copying. You can use the following CloudFormation template to create the S3 bucket used in this architecture.

The optional monitoring infrastructure combines Amazon Managed Service for Prometheus and Amazon Managed Grafana (or self-managed Grafana service running on Amazon EC2) to provide comprehensive observability. It integrated DCGM Exporter for GPU metrics and EFA Exporter for network metrics, enabling real-time monitoring of system health and performance. This setup allows for continuous tracking of GPU health, network performance, and training progress, with automated alerting for anomalies through Grafana Dashboards. For example, the GPU Health Dashboard (see the following screenshot) provides metrics of common GPU errors, including Uncorrectable Remapped Rows, Correctable Remapped Rows, XID Error Codes, Row Remap Failure, Thermal Violations, and Missing GPUs (from Nvidia-SMI), helping users identify hardware failures as quickly as possible.

Reproducible deployment guides and structured enablement sessions

Even the best reference architectures are only useful if teams know how to use them. A critical element of GENIAC’s success was reproducible deployment guides and structured enablement through workshops.On October 3, 2024, AWS Japan and the WWSO Frameworks team conducted a mass enablement session for GENIAC Cycle 2 participants, inviting Frameworks team members from the United States to share best practices for FM training on AWS.

The enablement session welcomed over 80 participants and provided a comprehensive mix of lectures, hands-on labs, and group discussions—earning a CSAT score of 4.75, reflecting its strong impact and relevance to attendees. The lecture sessions covered infrastructure fundamentals, exploring orchestration options such as AWS ParallelCluster, Amazon EKS, and SageMaker HyperPod, along with the software components necessary to build and train large-scale FMs using AWS. The sessions highlighted practical challenges in FM development—including massive compute requirements, scalable networking, and high-throughput storage—and mapped them to appropriate AWS services and best practices. (For more information, see the slide deck from the lecture session.) Another session focused on best practices, where attendees learned to set up performance dashboards with Prometheus and Grafana, monitor EFA traffic, and troubleshoot GPU failures using NVIDIA’s DCGM toolkit and custom Grafana dashboards based on the Frameworks team’s experience managing a cluster with 2,000 P5 instances.

Additionally, the WWSO team prepared workshops for both AWS ParallelCluster (Machine Learning on AWS ParallelCluster) and SageMaker HyperPod (Amazon SageMaker HyperPod Workshop), providing detailed deployment guides for the aforementioned reference architecture. Using these materials, participants conducted hands-on exercises deploying their training clusters using Slurm with file systems including FSx for Lustre and FSx for OpenZFS, running multi-node PyTorch distributed training. Another segment of the workshop focused on observability and performance tuning, teaching participants how to monitor resource utilization, network throughput (EFA traffic), and system health. By the end of these enablement sessions, customers and supporting AWS engineers had established a shared baseline of knowledge and a toolkit of best practices. Using the assets and knowledge gained during the workshops, customers participated in onboarding sessions—structured, hands-on meetings with their Lead SAs. These sessions differed from the earlier workshops by focusing on customer-specific cluster deployments tailored to each team’s unique use case. During each session, Lead SAs worked directly with teams to deploy training environments, validate setup using NCCL tests, and resolve technical issues in real time.

Customer feedback

“To fundamentally solve data entry challenges, we significantly improved processing accuracy and cost-efficiency by applying two-stage reasoning and autonomous learning with SLM and LLM for regular items, and visual learning with VLM using 100,000 synthetic data samples for detailed items. We also utilized Amazon EC2 P5 instances to enhance research and development efficiency. These ambitious initiatives were made possible thanks to the support of many people, including AWS. We are deeply grateful for their extensive support.”

– Takuma Inoue, Executive Officer, CTO at AI Inside

“Future chose AWS to develop large-scale language models specialized for Japanese and software development at GENIAC. When training large-scale models using multiple nodes, Future had concerns about environment settings such as inter-node communication, but AWS had a wide range of tools, such as AWS ParallelCluster, and we received strong support from AWS Solutions Architects, which enabled us to start large-scale training quickly.”

– Makoto Morishita, Chief Research Engineer at Future

Results and looking ahead

GENIAC has demonstrated that training FMs at scale is fundamentally an organizational challenge, not merely a hardware one. Through structured support, reproducible templates, and a cross-functional engagement team (WWSO Frameworks Team, Lead SAs, and Account Teams), even small teams can successfully execute massive workloads in the cloud. Thanks to this structure, 12 customers launched over 127 P5 instances and 24 Trn1 instances across multiple AWS Regions, including Asia Pacific (Tokyo), in a single day. Multiple large language models (LLMs) and custom models were trained successfully, including a 32B multimodal model on Trainium and a 405B tourism-focused multilingual model.The technical engagement framework established through GENIAC Cycle 2 has provided crucial insights into large-scale FM development. Building on this experience, AWS is advancing improvements across multiple dimensions: engagement models, technical assets, and implementation guidance. We are strengthening cross-functional collaboration and systematizing knowledge sharing to establish a more efficient support structure. Reference architectures and automated training templates continue to be enhanced, and practical technical workshops and best practices are being codified based on lessons learned.AWS has already begun preparations for the next cycle of GENIAC. As part of the onboarding process, AWS hosted a comprehensive technical event in Tokyo on April 3, 2025, to equip FM builders with hands-on experience and architectural guidance. The event, attended by over 50 participants, showcased the commitment AWS has to supporting scalable, resilient generative AI infrastructure.

The event highlighted the technical engagement model of AWS for GENIAC, alongside other support mechanisms, including the LLM Development Support Program and Generative AI Accelerator. The day featured an intensive workshop on SageMaker HyperPod and Slurm, where participants gained hands-on experience with multi-node GPU clusters, distributed PyTorch training, and observability tools. Sessions covered essential topics, including containerized ML, distributed training strategies, and AWS purpose-built silicon solutions. Classmethod Inc. shared practical SageMaker HyperPod insights, and AWS engineers demonstrated architectural patterns for large-scale GPU workloads. The event showcased AWS’s end-to-end generative AI support landscape, from infrastructure to deployment tools, setting the stage for GENIAC Cycle 3. As AWS continues to expand its support for FM development, the success of GENIAC serves as a blueprint for enabling organizations to build and scale their AI capabilities effectively.

Through these initiatives, AWS will continue to provide robust technical support, facilitating the smooth execution of large-scale FM training. We remain committed to contributing to the advancement of generative AI development all over the world through our technical expertise.

This post was contributed by AWS GENIAC Cycle 2 core members Masato Kobayashi, Kenta Ichiyanagi, and Satoshi Shirasawa, Accelerated Computing Specialist Mai Kiuchi, as well as Lead SAs Daisuke Miyamoto, Yoshitaka Haribara, Kei Sasaki, Soh Ohara, and Hiroshi Tokoyo with Executive Sponsorship from Toshi Yasuda. Hiroshi Hata and Tatsuya Urabe also provided support as core member and Lead SA during their time at AWS.

The authors extend their gratitude to WWSO Frameworks members Maxime Hugues, Matthew Nightingale, Aman Shanbhag, Alex Iankoulski, Anoop Saha, Yashesh Shroff, Natarajan Chennimalai Kumar, Shubha Kumbadakone, and Sundar Ranganathan for their technical contributions. Pierre-Yves Aquilanti provided in-depth support during his time at AWS.


About the authors

Keita Watanabe is a Senior Specialist Solutions Architect on the AWS WWSO Frameworks team. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. He leads GENIAC technical engagements.

Masaru Isaka is a Principal Business Development on the AWS WWSO Frameworks team, specializing in machine learning and generative AI solutions. Having engaged with GENIAC since its inception, he leads go-to-market strategies for AWS’s generative AI offerings.



Source link

Share.
Leave A Reply

Exit mobile version