Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Judge blocks Florida from enforcing social media ban for kids while lawsuit continues

    June 9, 2025

    ‘AI Maker, Not an AI Taker’: UK Builds Its Vision With NVIDIA Infrastructure

    June 8, 2025

    Noem says Guard wouldn’t be needed in LA if Newsom had done his job

    June 8, 2025
    Facebook X (Twitter) Instagram
    • Demos
    • Buy Now
    Facebook X (Twitter) Instagram YouTube
    14 Trends14 Trends
    Demo
    • Home
    • Features
      • View All On Demos
    • Buy Now
    14 Trends14 Trends
    Home » How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries
    AI AWS

    How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries

    adminBy adminMarch 17, 2025No Comments6 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The Qwen 2.5 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text in/text out and code out). The Qwen 2.5 fine tuned text-only models are optimized for multilingual dialogue use cases and outperform both previous generations of Qwen models, and many of the publicly available chat models based on common industry benchmarks.

    At its core, Qwen 2.5 is an auto-regressive language model that uses an optimized transformer architecture. The Qwen2.5 collection can support over 29 languages and has enhanced role-playing abilities and condition-setting for chatbots.

    In this post, we outline how to get started with deploying the Qwen 2.5 family of models on an Inferentia instance using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are also supported.

    Preparation

    Hugging Face provides two tools that are frequently used when using AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which provide support for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

    The first time a model is run on Inferentia or Trainium, you compile the model to make sure that you have a version that will perform optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face along with the Optimum Neuron cache will transparently supply a compiled model when available. If you’re using a different model with the Qwen2.5 architecture, you might need to compile the model before deploying. For more information, see Compiling a model for Inferentia or Trainium.

    You can deploy TGI as a docker container on an Inferentia or Trainium EC2 instance or on Amazon SageMaker.

    Option 1: Deploy TGI on Amazon EC2 Inf2

    In this example, you will deploy Qwen2.5-7B-Instruct on an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the Hugging Face DLAMI.)

    For this option, you SSH into the instance and create a .env file (where you’ll define your constants and specify where your model is cached) and a file named docker-compose.yaml (where you’ll define all of the environment parameters that you’ll need to deploy your model for inference). You can copy the following files for this use case.

    1. Create a .env file with the following content:
    MODEL_ID='Qwen/Qwen2.5-7B-Instruct'
    #MODEL_ID='/data/exportedmodel' 
    HF_AUTO_CAST_TYPE='bf16' # indicates the auto cast type that was used to compile the model
    MAX_BATCH_SIZE=4
    MAX_INPUT_TOKENS=4000
    MAX_TOTAL_TOKENS=4096

    1. Create a file named docker-compose.yaml with the following content:
    version: '3.7'
    
    services:
      tgi-1:
        image: ghcr.io/huggingface/neuronx-tgi:latest
        ports:
          - "8081:8081"
        environment:
          - PORT=8081
          - MODEL_ID=${MODEL_ID}
          - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}
          - HF_NUM_CORES=2
          - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}
          - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}
          - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}
          - MAX_CONCURRENT_REQUESTS=512
          #- HF_TOKEN=${HF_TOKEN} #only needed for gated models
        volumes:
          - $PWD:/data #can be removed if you aren't loading locally
        devices:
          - "/dev/neuron0"
    

    1. Use docker compose to deploy the model:

    docker compose -f docker-compose.yaml --env-file .env up

    1. To confirm that the model deployed correctly, send a test prompt to the model:
    curl 127.0.0.1:8081/generate \
        -X POST \
        -d '{
      "inputs":"Tell me about AWS.",
      "parameters":{
        "max_new_tokens":60
      }
    }' \
        -H 'Content-Type: application/json'
    

    1. To confirm that the model can respond in multiple languages, try sending a prompt in Chinese:
    #"Tell me how to open an AWS account"
    curl 127.0.0.1:8081/generate \
        -X POST \
        -d '{
      "inputs":"告诉我如何开设 AWS 账户。", 
      "parameters":{
        "max_new_tokens":60
      }
    }' \
        -H 'Content-Type: application/json'
    

    Option 2: Deploy TGI on SageMaker

    You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker using instructions on the Hugging Face Model Hub.

    1. From the Qwen 2.5 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium.

    How to deploy the model on Amazon SageMaker

    How to find the code you'll need to deploy the model using AWS Inferentia and Trainium

    1. Copy the example code into a SageMaker notebook, then choose Run.
    2. The notebook you copied will look like the following:
    import json
    import sagemaker
    import boto3
    from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
    
    try:
        role = sagemaker.get_execution_role()
    except ValueError:
        iam = boto3.client("iam")
        role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
    
    # Hub Model configuration. https://huggingface.co/models
    hub = {
        "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",
        "HF_NUM_CORES": "2",
        "HF_AUTO_CAST_TYPE": "bf16",
        "MAX_BATCH_SIZE": "8",
        "MAX_INPUT_TOKENS": "3686",
        "MAX_TOTAL_TOKENS": "4096",
    }
    
    
    region = boto3.Session().region_name
    image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"
    
    # create Hugging Face Model Class
    huggingface_model = HuggingFaceModel(
        image_uri=image_uri,
        env=hub,
        role=role,
    )
    
    # deploy model to SageMaker Inference
    predictor = huggingface_model.deploy(
        initial_instance_count=1,
        instance_type="ml.inf2.xlarge",
        container_startup_health_check_timeout=1800,
        volume_size=512,
    )
    
    # send request
    predictor.predict(
        {
            "inputs": "What is is the capital of France?",
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 128,
                "temperature": 0.7,
                "top_k": 50,
                "top_p": 0.95,
            }
        }
    )

    Clean Up

    Make sure that you terminate your EC2 instances and delete your SageMaker endpoints to avoid ongoing costs.

    Terminate EC2 instances through the AWS Management Console.

    Terminate a SageMaker endpoint through the console or with the following commands:

    predictor.delete_model()
    predictor.delete_endpoint(delete_endpoint_config=True)
    

    Conclusion

    AWS Trainium and AWS Inferentia deliver high performance and low cost for deploying Qwen2.5 models. We’re excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the AWS Neuron documentation.


    About the Authors

    Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups as well as the team at Hugging Face. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, part of the Neuron Data Science community, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

    Miriam Lebowitz ProfileMiriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She leverages her experience with AIML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.

    Rhia Soni is a Startup Solutions Architect at AWS. Rhia specializes in working with early stage startups and helps customers adopt Inferentia and Trainium. Rhia is also part of the AWS Analytics Technical Field Community and is a subject matter expert in Generative BI. Rhia holds a bachelor’s degree in Information Science from the University of Maryland.

    Paul Aiuto is a Senior Solution Architect Manager focusing on Startups at AWS. Paul created a team of AWS Startup Solution architects that focus on the adoption of Inferentia and Trainium. Paul holds a bachelor’s degree in Computer Science from Siena College and has multiple Cyber Security certifications.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    Multi-account support for Amazon SageMaker HyperPod task governance

    June 8, 2025

    Implement semantic video search using open source large vision models on Amazon SageMaker and Amazon OpenSearch Serverless

    June 7, 2025

    Build a serverless audio summarization solution with Amazon Bedrock and Whisper

    June 7, 2025

    Modernize and migrate on-premises fraud detection machine learning workflows to Amazon SageMaker

    June 6, 2025

    How climate tech startups are building foundation models with Amazon SageMaker HyperPod

    June 5, 2025

    Impel enhances automotive dealership customer experience with fine-tuned LLMs on Amazon SageMaker

    June 4, 2025
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views

    Laws, norms, and ethics for AI in health

    May 1, 20252 Views
    Don't Miss

    Judge blocks Florida from enforcing social media ban for kids while lawsuit continues

    June 9, 2025

    TALLAHASSEE, Fla. — A federal judge has barred state officials from enforcing a Florida law…

    ‘AI Maker, Not an AI Taker’: UK Builds Its Vision With NVIDIA Infrastructure

    June 8, 2025

    Noem says Guard wouldn’t be needed in LA if Newsom had done his job

    June 8, 2025

    WATCH: Escaped zebra captured after a week on the run

    June 8, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    Demo
    Top Posts

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    Demo
    About Us
    About Us

    Your source for the lifestyle news. This demo is crafted specifically to exhibit the use of the theme as a lifestyle site. Visit our main page for more demos.

    We're accepting new partnerships right now.

    Email Us: info@example.com
    Contact: +1-320-0123-451

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Judge blocks Florida from enforcing social media ban for kids while lawsuit continues

    June 9, 2025

    ‘AI Maker, Not an AI Taker’: UK Builds Its Vision With NVIDIA Infrastructure

    June 8, 2025

    Noem says Guard wouldn’t be needed in LA if Newsom had done his job

    June 8, 2025
    Most Popular

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    14 Trends
    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • Home
    • Buy Now
    © 2025 ThemeSphere. Designed by ThemeSphere.

    Type above and press Enter to search. Press Esc to cancel.