Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Meta launches new teen safety features, removes 635,000 accounts that sexualize children

    July 23, 2025

    Idaho families slam Bryan Kohberger at emotional sentencing hearing: ‘Hell will be waiting’

    July 23, 2025

    Missing 22-year-old Wisconsin graduate student found dead in the Mississippi River

    July 23, 2025
    Facebook X (Twitter) Instagram
    • Demos
    • Buy Now
    Facebook X (Twitter) Instagram YouTube
    14 Trends14 Trends
    Demo
    • Home
    • Features
      • View All On Demos
    • Buy Now
    14 Trends14 Trends
    Home » Technical approach for classifying human-AI interactions at scale
    AI Features

    Technical approach for classifying human-AI interactions at scale

    adminBy adminJuly 23, 2025No Comments9 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The image features four white icons on a gradient background that transitions from blue on the left to green on the right. The first icon is a network or molecule structure with interconnected nodes. The second icon shows a stylized person in front of a computer screen. The third icon shows an organization tree with one main node and three nodes branching out side by side below it.

    As large language models (LLMs) become foundational to modern AI systems, the ability to run them at scale—efficiently, reliably, and in near real-time—is no longer a nice-to-have. It’s essential. The Semantic Telemetry project tackles this challenge by applying LLM-based classifiers to hundreds of millions of sampled, anonymized Bing Chat conversations each week. These classifiers extract signals like user expertise, primary topic, and satisfaction, enabling deeper insight into human-AI interactions and driving continuous system improvement.

    But building a pipeline that can handle this volume isn’t just about plugging into an API. It requires a high-throughput, high-performance architecture that can orchestrate distributed processing, manage token and prompt complexity, and gracefully handle the unpredictability of remote LLM endpoints.

    In this latest post in our series on Semantic Telemetry, we’ll walk through the engineering behind that system—how we designed for scale from the start, the trade-offs we made, and the lessons we learned along the way. From batching strategies and token optimization and orchestration, we’ll share what it takes to build a real-time LLM classification pipeline.

    For additional project background: Semantic Telemetry: Understanding how users interact with AI systems and Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project.

    System architecture highlights

    The Semantic Telemetry pipeline (opens in new tab) is a highly-scalable, highly-configurable, data transformation pipeline. While it follows a familiar ETL structure, several architectural innovations make it uniquely suited for high-throughput LLM integration:

    • Hybrid compute engine
      The pipeline combines the distributed power of PySpark with the speed and simplicity of Polars, enabling it to scale across large datasets or run lightweight jobs in Spark-less environments—without code changes.
    • LLM-centric transformation layer
      At the core of the pipeline is a multi-stage transformation process tailored for running across multiple LLM endpoints such that:
      • Runs model agnostic. Provides a generic interface for LLMs and adopts model specific interfaces built from a generic interface.
      • Prompt templates are defined using the Prompty language specification for consistency and reuse, with options for users to include custom prompts.
      • Parsing and cleaning logic ensures structured, schema-aligned outputs, even when LLM responses are imperfect such as removing extra characters in output, resolving not-exact label matches (i.e. “create” versus “created”) and relabeling invalid classifications.
    Figure 1. Architecture diagram of LLM workflow
    Figure 1. Architecture diagram

    The pipeline supports multiple classification tasks (e.g., user expertise, topic, satisfaction) through modular prompt templates and configurable execution paths—making it easy to adapt to new use cases or environments.

    Engineering challenges & solutions

    Building a high-throughput, LLM-powered classification pipeline at scale introduced a range of engineering challenges—from managing latency and token limits to ensuring system resilience. Below are the key hurdles we encountered and how we addressed them.

    LLM endpoint latency & variability

    Challenge: LLM endpoints, especially those hosted remotely (e.g., Azure OpenAI), introduce unpredictable latency due to model load, prompt complexity, and network variability. This made it difficult to maintain consistent throughput across the pipeline.

    Solution: We implemented a combination of:

    • Multiple Azure OpenAI endpoints in rotation to increase throughput and distribute workload. We can analyze throughput and redistribute as needed.
    • Saving output in intervals to write data asynchronously in case of network errors.
    • Utilizing models with higher tokens per minute (TPM) such as OpenAI’s GPT-4o mini. GPT-4o mini had a 2M TPM limit which is a 25x throughput increase from GPT-4 (80K TPM -> 2M TPM)
    • Timeouts and retries with exponential backoff.

    Evolving LLM models & prompt alignment

    Challenge: Each new LLM release—such as Phi, Mistral, DeepSeek, and successive generations of GPT (e.g., GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o)—brings improvements, but also subtle behavioral shifts. These changes can affect classification consistency, output formatting, and even the interpretation of prompts. Maintaining alignment with baseline expectations across models became a moving target.

    Solution: We developed a model evaluation workflow to test prompt alignment across LLM versions:

    • Small-sample testing: We ran the pipeline on a representative sample using the new model and compared the output distribution to a known baseline.
    • Distribution analysis: If the new model’s output aligned closely, we scaled up testing. If not, we iteratively tuned the prompts and re-ran comparisons.
    • Interpretation flexibility: We also recognized that a shift in distribution isn’t always a regression. Sometimes it reflects a more accurate or nuanced classification, especially as models improve.

    To support this process, we used tools like Sammo (opens in new tab), which allowed us to compare outputs across multiple models and prompt variants. This helped us quantify the impact of prompt changes and model upgrades and make informed decisions about when to adopt a new model or adjust our classification schema.

    Dynamic concurrency scaling for LLM calls

    Challenge: LLM endpoints frequently encounter rate limits and inconsistent response times under heavy usage. The models’ speeds can also vary, complicating the selection of optimal concurrency levels. Furthermore, users may choose suboptimal settings due to lack of familiarity, and default concurrency configurations are rarely ideal for every situation. Dynamic adjustments based on throughput, measured in various ways, can assist in determining optimal concurrency levels.

    Solution: We implemented a dynamic concurrency control mechanism that proactively adjusts the number of parallel LLM calls based on real-time system behavior:

    • External task awareness: The system monitors the number of parallel tasks running across the pipeline (e.g., Spark executors or async workers) and uses this to inform the initial concurrency level.
    • Success/failure rate monitoring: The system tracks the rolling success and failure rates of LLM calls. A spike in failures triggers a temporary reduction in concurrency, while sustained success allows for gradual ramp-up.
    • Latency-based feedback loop: Instead of waiting for rate-limit errors, measure the response time of LLM calls. If latency increases, reduce concurrency; if latency decreases and success rates remain high, cautiously scale up.

    PODCAST SERIES

    The AI Revolution in Medicine, Revisited

    Join Microsoft’s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.


    Opens in a new tab

    Optimization experiments

    To further improve throughput and efficiency, we ran a series of optimization experiments. Each approach came with trade-offs that we carefully measured.

    Batch endpoints (Azure/OpenAI)

    Batch endpoints are a cost-effective, moderately high-throughput way of executing LLM requests. Batch endpoints process large lists of LLM prompts over a 24-hour period, recording responses in a file. They are about 50% cheaper than non-batch endpoints and have separate token limits, enabling increased throughput when used alongside regular endpoints. However, they require at least 24 hours to complete requests and provide lower overall throughput compared to non-batch endpoints, making them unsuitable for situations needing quick results.

    Conversation batching in prompts during pipeline runtime

    Batching multiple conversations for classification at once can significantly increase throughput and reduce token usage, but it may impact the accuracy of results. In our experiment with a domain classifier, classifying 10 conversations simultaneously led to an average of 15-20% of domain assignments changing between repeated runs of the same prompt. To address this, one mitigation approach is to use a grader LLM prompt: first classify the batch, then have the LLM identify any incorrectly classified conversations, and finally re-classify those as needed. While batching offers efficiency gains, it is important to monitor for potential drops in classification quality.

    Combining classifiers in a single prompt

    Combining multiple classifiers into a single prompt increases throughput by allowing one call to the LLM instead of multiple calls. This not only multiplies the overall throughput by the number of classifiers processed but also reduces the total number of tokens used, since the conversation text is only passed in once. However, this approach may compromise classification accuracy, so results should be closely monitored.

    Classification using text embeddings

    An alternative approach is to train custom neural network models for each classifier using only the text embeddings of conversations. This method delivers both cost and time savings by avoiding making multiple LLM requests for every classifier and conversation—instead, the system only needs to request conversation text embeddings once and can reuse these embeddings across all classifier models.

    For example, starting with a set of conversations to validate and test the new model, run these conversations through the original prompt-based classifier to generate a set of golden classifications, then obtain text embeddings (using a tool like text-embedding-3-large) for each conversation. These embeddings and their corresponding classifications are used to train a model such as a multi-layer perceptron. In production, the workflow involves retrieving the text embedding for each conversation and passing it through the trained model; if there is a model for each classifier, a single embedding retrieval per conversation suffices for all classifiers.

    The benefits of this approach include significantly increased throughput and cost savings—since it’s not necessary to call the LLM for every classifier and conversation. However, this setup can require GPU compute which can increase costs and infrastructure complexity, and the resulting models may not achieve the same accuracy as prompt-based classification methods.

    Prompt compression

    Compressing prompts by eliminating unnecessary tokens or by using a tool such as LLMLingua (opens in new tab) to automate prompt compression can optimize classification prompts either ahead of time or in real-time. This approach increases overall throughput and results in cost savings due to a reduced number of tokens, but there are risks: changes to the classifier prompt or conversation text may impact classification accuracy, and depending on the compression technique, it could even decrease throughput if the compression process takes longer than simply sending uncompressed text to the LLM.

    Text truncation

    Truncating conversations to a specific length limits the overall number of tokens sent through an endpoint, offering cost savings and increased throughput like prompt compression. By reducing the number of tokens per request, throughput rises because more requests can be made before reaching the endpoint’s tokens-per-minute (TPM) limit, and costs decrease due to fewer tokens being processed. However, the ideal truncation length depends on both the classifiers and the conversation content, so it’s important to assess how truncation affects output quality before implementation. While this approach brings clear efficiency benefits, it also poses a risk: long conversations may have their most important content cut off, which can reduce classification accuracy.

    Conclusion

    Building a scalable, high-throughput pipeline for LLM-based classification is far from trivial. It requires navigating a constantly shifting landscape of model capabilities, prompt behaviors, and infrastructure constraints. As LLMs become faster, cheaper, and more capable, they’re unlocking new possibilities for real-time understanding of human-AI interactions at scale. The techniques we’ve shared represent a snapshot of what’s working today. But more importantly, they offer a foundation for what’s possible tomorrow.

    Opens in a new tab





    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    AI Testing and Evaluation: Reflections

    July 21, 2025

    CollabLLM: Teaching LLMs to collaborate with users

    July 15, 2025

    AI Testing and Evaluation: Learnings from cybersecurity

    July 14, 2025

    How AI will accelerate biomedical research and discovery

    July 10, 2025

    AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

    July 7, 2025

    AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

    July 7, 2025
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views

    Laws, norms, and ethics for AI in health

    May 1, 20252 Views
    Don't Miss

    Meta launches new teen safety features, removes 635,000 accounts that sexualize children

    July 23, 2025

    Instagram parent company Meta has introduced new safety features aimed at protecting teens who use…

    Idaho families slam Bryan Kohberger at emotional sentencing hearing: ‘Hell will be waiting’

    July 23, 2025

    Missing 22-year-old Wisconsin graduate student found dead in the Mississippi River

    July 23, 2025

    Video Family of Kaylee Goncalves speaks after Bryan Kohberger’s sentencing

    July 23, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    Demo
    Top Posts

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    Demo
    About Us
    About Us

    Your source for the lifestyle news. This demo is crafted specifically to exhibit the use of the theme as a lifestyle site. Visit our main page for more demos.

    We're accepting new partnerships right now.

    Email Us: info@example.com
    Contact: +1-320-0123-451

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Meta launches new teen safety features, removes 635,000 accounts that sexualize children

    July 23, 2025

    Idaho families slam Bryan Kohberger at emotional sentencing hearing: ‘Hell will be waiting’

    July 23, 2025

    Missing 22-year-old Wisconsin graduate student found dead in the Mississippi River

    July 23, 2025
    Most Popular

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    14 Trends
    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • Home
    • Buy Now
    © 2025 ThemeSphere. Designed by ThemeSphere.

    Type above and press Enter to search. Press Esc to cancel.