Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Trump heads to Texas after catastrophic flooding, avoiding criticism he’s heaped on other governors

    July 11, 2025

    CDC Reports Alarming Rise in Prediabetes Among U.S. Youth, Experts Skeptical of Data, ET HealthWorld

    July 11, 2025

    FBI offers $50,000 reward for info on person who ‘appeared to fire a gun at law enforcement’ during California ICE clashes

    July 11, 2025
    Facebook X (Twitter) Instagram
    • Demos
    • Buy Now
    Facebook X (Twitter) Instagram YouTube
    14 Trends14 Trends
    Demo
    • Home
    • Features
      • View All On Demos
    • Buy Now
    14 Trends14 Trends
    Home » Learning from other domains to advance AI evaluation and testing
    AI Features

    Learning from other domains to advance AI evaluation and testing

    adminBy adminJune 23, 2025No Comments7 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Illustrated headshots of the Guests from the limited podcast series, AI Testing and Evaluation: Learnings from Science and Industry

    As generative AI becomes more capable and widely deployed, familiar questions from the governance of other transformative technologies have resurfaced. Which opportunities, capabilities, risks, and impacts should be evaluated? Who should conduct evaluations, and at what stages of the technology lifecycle? What tests or measurements should be used? And how can we know if the results are reliable?  

    Recent research and reports from Microsoft (opens in new tab), the UK AI Security Institute (opens in new tab), The New York Times (opens in new tab), and MIT Technology Review (opens in new tab) have highlighted gaps in how we evaluate AI models and systems. These gaps also form foundational context for recent international expert consensus reports: the inaugural International AI Safety Report (opens in new tab) (2025) and the Singapore Consensus (opens in new tab) (2025). Closing these gaps at a pace that matches AI innovation will lead to more reliable evaluations that can help guide deployment decisions, inform policy, and deepen trust. 

    Today, we’re launching a limited-series podcast, AI Testing and Evaluation: Learnings from Science and Industry, to share insights from domains that have grappled with testing and measurement questions. Across four episodes, host Kathleen Sullivan speaks with academic experts in genome editing, cybersecurity, pharmaceuticals, and medical devices to find out which technical and regulatory steps have helped to close evaluation gaps and earn public trust.

    Spotlight: blog post

    GraphRAG auto-tuning provides rapid adaptation to new domains

    GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant.


    Opens in a new tab

    We’re also sharing written case studies from experts, along with top-level lessons we’re applying to AI. At the close of the podcast series, we’ll offer Microsoft’s deeper reflections on next steps toward more reliable and trustworthy approaches to AI evaluation. 

    Lessons from eight case studies 

    Our research on risk evaluation, testing, and assurance models in other domains began in December 2024, when Microsoft’s Office of Responsible AI (opens in new tab) gathered independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical devices, nanoscience, nuclear energy, and pharmaceuticals. In bringing this group together, we drew on our own learnings and feedback received on our e-book, Global Governance: Goals and Lessons for AI (opens in new tab), in which we studied the higher-level goals and institutional approaches that had been leveraged for cross-border governance in the past. 

    While approaches to risk evaluation and testing vary significantly across the case studies, there was one consistent, top-level takeaway: evaluation frameworks always reflect trade-offs among different policy objectives, such as safety, efficiency, and innovation.  

    Experts across all eight fields noted that policymakers have had to weigh trade-offs in designing evaluation frameworks. These frameworks must account for both the limits of current science and the need for agility in the face of uncertainty. They likewise agreed that early design choices, often reflecting the “DNA” of the historical moment in which they’re made, as cybersecurity expert Stewart Baker described it, are important as they are difficult to scale down or undo later. 

    Strict, pre-deployment testing regimes—such as those used in civil aviation, medical devices, nuclear energy, and pharmaceuticals—offer strong safety assurances but can be resource-intensive and slow to adapt. These regimes often emerged in response to well-documented failures and are backed by decades of regulatory infrastructure and detailed technical standards.  

    In contrast, fields marked by dynamic and complex interdependencies between the tested system and its external environment—such as cybersecurity and bank stress testing—rely on more adaptive governance frameworks, where testing may be used to generate actionable insights about risk rather than primarily serve as a trigger for regulatory enforcement.  

    Moreover, in pharmaceuticals, where interdependencies are at play and there is emphasis on pre-deployment testing, experts highlighted a potential trade-off with post-market monitoring of downstream risks and efficacy evaluation. 

    These variations in approaches across domains—stemming from differences in risk profiles, types of technologies, maturity of the evaluation science, placement of expertise in the assessor ecosystem, and context in which technologies are deployed, among other factors—also inform takeaways for AI.

    Applying risk evaluation and governance lessons to AI 

    While no analogy perfectly fits the AI context, the genome editing and nanoscience cases offer interesting insights for general-purpose technologies like AI, where risks vary widely depending on how the technology is applied.  

    Experts highlighted the benefits of governance frameworks that are more flexible and tailored to specific use cases and application contexts. In these fields, it is challenging to define risk thresholds and design evaluation frameworks in the abstract. Risks become more visible and assessable once the technology is applied to a particular use case and context-specific variables are known.  

    These and other insights also helped us distill qualities essential to ensuring that testing is a reliable governance tool across domains, including: 

    1. Rigor in defining what is being examined and why it matters. This requires detailed specification of what is being measured and understanding how the deployment context may affect outcomes.
    2. Standardization of how tests should be conducted to achieve valid, reliable results. This requires establishing technical standards that provide methodological guidance and ensure quality and consistency. 
    3. Interpretability of test results and how they inform risk decisions. This requires establishing expectations for evidence and improving literacy in how to understand, contextualize, and use test results—while remaining aware of their limitations. 

    Toward stronger foundations for AI testing 

    Establishing robust foundations for AI evaluation and testing requires effort to improve rigor, standardization, and interpretability—and to ensure that methods keep pace with rapid technological progress and evolving scientific understanding.  

    Taking lessons from other general-purpose technologies, this foundational work must also be pursued for both AI models and systems. While testing models will continue to be important, reliable evaluation tools that provide assurance for system performance will enable broad adoption of AI, including in high-risk scenarios. A strong feedback loop on evaluations of AI models and systems could not only accelerate progress on methodological challenges but also bring focus to which opportunities, capabilities, risks, and impacts are most appropriate and efficient to evaluate at what points along the AI development and deployment lifecycle.

    Acknowledgements 

    We would like to thank the following external experts who have contributed to our research program on lessons for AI testing and evaluation: Mateo Aboy, Paul Alp, Gerónimo Poletto Antonacci, Stewart Baker, Daniel Benamouzig, Pablo Cantero, Daniel Carpenter, Alta Charo, Jennifer Dionne, Andy Greenfield, Kathryn Judge, Ciaran Martin, and Timo Minssen.  

    Case studies 

    Civil aviation: Testing in Aircraft Design and Manufacturing, by Paul Alp 

    Cybersecurity: Cybersecurity Standards and Testing—Lessons for AI Safety and Security, by Stewart Baker 

    Financial services (bank stress testing): The Evolving Use of Bank Stress Tests, by Kathryn Judge 

    Genome editing: Governance of Genome Editing in Human Therapeutics and Agricultural Applications, by Alta Charo and Andy Greenfield 

    Medical devices: Medical Device Testing: Regulatory Requirements, Evolution and Lessons for AI Governance, by Mateo Aboy and Timo Minssen 

    Nanoscience: The regulatory landscape of nanoscience and nanotechnology, and applications to future AI regulation, by Jennifer Dionne 

    Nuclear energy: Testing in the Nuclear Industry, by Pablo Cantero and Gerónimo Poletto Antonacci 

    Pharmaceuticals: The History and Evolution of Testing in Pharmaceutical Regulation, by Daniel Benamouzig and Daniel Carpenter

    Opens in a new tab





    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    How AI will accelerate biomedical research and discovery

    July 10, 2025

    AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

    July 7, 2025

    AI Testing and Evaluation: Learnings from pharmaceuticals and medical devices

    July 7, 2025

    AI Testing and Evaluation: Learnings from genome editing

    June 30, 2025

    PadChest-GR: A bilingual grounded radiology reporting benchmark for chest X-rays

    June 26, 2025

    AI Testing and Evaluation: Learnings from Science and Industry

    June 23, 2025
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views

    Laws, norms, and ethics for AI in health

    May 1, 20252 Views
    Don't Miss

    Trump heads to Texas after catastrophic flooding, avoiding criticism he’s heaped on other governors

    July 11, 2025

    President Donald Trump travels to Texas on Friday amid growing questions about how local officials…

    CDC Reports Alarming Rise in Prediabetes Among U.S. Youth, Experts Skeptical of Data, ET HealthWorld

    July 11, 2025

    FBI offers $50,000 reward for info on person who ‘appeared to fire a gun at law enforcement’ during California ICE clashes

    July 11, 2025

    Trump plans to tour Texas flood damage as disaster tests his pledge to shutter FEMA

    July 11, 2025
    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    Demo
    Top Posts

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    Demo
    About Us
    About Us

    Your source for the lifestyle news. This demo is crafted specifically to exhibit the use of the theme as a lifestyle site. Visit our main page for more demos.

    We're accepting new partnerships right now.

    Email Us: info@example.com
    Contact: +1-320-0123-451

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Trump heads to Texas after catastrophic flooding, avoiding criticism he’s heaped on other governors

    July 11, 2025

    CDC Reports Alarming Rise in Prediabetes Among U.S. Youth, Experts Skeptical of Data, ET HealthWorld

    July 11, 2025

    FBI offers $50,000 reward for info on person who ‘appeared to fire a gun at law enforcement’ during California ICE clashes

    July 11, 2025
    Most Popular

    ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

    March 28, 20254 Views

    Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

    February 28, 20253 Views

    An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

    October 16, 20243 Views

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    14 Trends
    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • Home
    • Buy Now
    © 2025 ThemeSphere. Designed by ThemeSphere.

    Type above and press Enter to search. Press Esc to cancel.