Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project

The image features four white icons on a gradient background that transitions from blue on the left to green on the right. The first icon is a network or molecule structure with interconnected nodes. The second icon is a light bulb, symbolizing an idea or innovation. The third icon is a checklist with three items and checkmarks next to each item. The fourth icon consists of two overlapping speech bubbles, representing communication or conversation.

The Semantic Telemetry Project aims to better understand complex, turn-based human-AI interactions in Microsoft Copilot using a new data science approach.

This understanding is crucial for recognizing how individuals utilize AI systems to address real-world tasks. It provides actionable insights, enhances key use cases , and identifies opportunities for system improvement.

In a recent blog post, we shared our approach for classifying chat log data using large language models (LLMs), which allows us to analyze these interactions at scale and in near real time. We also introduced two of our LLM-generated classifiers: Topics and Task Complexity.

This blog post will examine how our suite of LLM-generated classifiers can serve as early indicators for user engagement and highlight how usage and satisfaction varies based on AI and user expertise.

The key findings from our research are:

When users engage in more professional, technical, and complex tasks, they are more likely to continue utilizing the tool and increase their level of interaction with it.
Novice users currently engage in simpler tasks, but their work is gradually becoming more complex over time.
More expert users are satisfied with AI responses only where AI expertise is on par with their own expertise on the topic, while novice users had low satisfaction rates regardless of AI expertise.

Read on for more information on these findings. Note that all analyses were conducted on anonymous Copilot in Bing interactions containing no personal information.

Classifiers mentioned in article:

Knowledge work classifier: Tasks that involve creating artifacts related to information work typically requiring creative and analytical thinking. Examples include strategic business planning, software design, and scientific research.

Task complexity classifier: Assesses the cognitive complexity of a task if a user performs it without the use of AI. We group into two categories: low complexity and high complexity.

Topics classifier: A single label for the primary topic of the conversation.

User expertise: Labels the user’s expertise on the primary topic within the conversation as one of the following categories: Novice (no familiarity with the topic), Beginner (little prior knowledge or experience), Intermediate (some basic knowledge or familiarity with the topic), Proficient (can apply relevant concepts from conversation), and Expert (deep and comprehensive understanding of the topic).

AI expertise: Labels the AI agent expertise based on the same criteria as user expertise above.

User satisfaction: A 20-question satisfaction/dissatisfaction rubric that the LLM evaluates to create an aggregate score for overall user satisfaction.

What keeps Bing Chat users engaged?

We conducted a study of a random sample of 45,000 anonymous Bing Chat users during May 2024. The data was grouped into three cohorts based on user activity over the course of the month:

Light (1 active chat session per week)
Medium (2-3 active chat sessions per week)
Heavy (4+ active chat sessions per week)

The key finding is that heavy users are doing more professional, complex work.

We utilized our knowledge work classifier to label the chat log data as relating to knowledge work tasks. What we found is knowledge work tasks were higher in all cohorts, with the highest percentage in heavy users.

Bar chart illustrating knowledge work distribution across three engagement cohorts: light, medium, and heavy. The chart shows that all three cohorts engage in more knowledge work compared to the 'Not knowledge work' and 'Both' categories, with heavy users performing the most knowledge work. — *Figure 1: Knowledge work based on engagement cohort*

Analyzing task complexity, we observed that users with higher engagement frequently perform the highest number of tasks with high complexity, while users with lower engagement performed more tasks with low complexity.

Bar chart illustrating task complexity distribution across three engagement cohorts: light, medium, and heavy. The chart shows all three cohorts perform more high complexity tasks than low complexity tasks, with heavy users performing the greatest number of high complexity tasks. — *Figure 2: High complexity and low complexity tasks by engagement cohort+*

Looking at the overall data, we can filter on heavy users and see higher numbers of chats where the user was performing knowledge work tasks. Based on task complexity, we see that most knowledge work tasks seek to apply a solution to an existing problem, primarily within programming and scripting. This is in line with our top overall topic, technology, which we discussed in the previous post.

Tree diagram illustrating how heavy users are engaging with Bing Chat. The visual selects the most common use case for heavy users: knowledge work, “apply” complexity and related topics. — *Figure 3: Heavy users tree diagram*

In contrast, light users tended to do more low complexity tasks (“Remember”), using Bing Chat like a traditional search engine and engaging more in topics like business and finance and computers and electronics.

Tree diagram illustrating how light users are engaging with Bing Chat. The visual selects the most common use case for light users: knowledge work, “remember” complexity and related topics. — *Figure 4: Light users tree diagram*

Novice queries are becoming more complex

We looked at Bing Chat data from January through August 2024 and we classified chats using our User Expertise classifier. When we looked at how the different user expertise groups were using the tool for professional tasks, we discovered that proficient and expert users tend to do more professional tasks with high complexity in topics like programming and scripting, professional writing and editing, and physics and chemistry.

Bar chart illustrating top topics for proficient and expert users with programming and scripting (18.3%), professional writing and editing (10.4%), and physics and chemistry (9.8%) as top three topics. — *Figure 5: Top topics for proficient/expert users*

Bar chart showing task complexity for proficient and expert users. The chart shows a greater number of high complexity chats than low complexity chats, with the highest percentage in categories “Understand” (30.8%) and “Apply” (29.3%). — *Figure 6: Task complexity for proficient/expert*

Bar chart illustrating top topics for novice users with business and finance (12.5%), education and learning (10.0%), and computers and electronics (9.8%) as top three topics. — *Figure 7: Top topics for novices*

In contrast, novice users engaged more in professional tasks relating to business and finance and education and learning, mainly using the tool to recall information.

Bar chart showing task complexity for novice users. The chart shows a greater number of low complexity chats than high complexity chats, with the highest percentage in categories “Remember” (48.6%). — *Figure 8: Task complexity for novices*

However, novices are targeting increasingly more complex tasks over time. Over the eight-month period, we see the percentage of high complexity tasks rise from about 36% to 67%, revealing that novices are learning and adapting quickly (see Figure 9).

Line chart showing weekly percentage of high complexity chats for novice users from January-August 2024. The line chart starts at 35.9% in January and ends at 67.2% in August. — *Figure 9: High complexity for novices Jan-Aug 2024*

How does user satisfaction vary according to expertise?

We classified both the user expertise and AI agent expertise for anonymous interactions in Copilot in Bing. We compared the level of user and AI agent expertise with our user satisfaction classifier.

The key takeaways are:

Experts and proficient users are only satisfied with AI agents with similar expertise (expert/proficient).
Novices are least satisfied, regardless of the expertise of the AI agent.

Table illustrating user satisfaction based on expertise level of user and agent. Each row if the table is the user expertise group (novice, beginner, intermediate, proficient, expert) and on the columns is AI expertise group (novice, beginner, intermediate, proficient, expert). The table illustrates that novice users are least satisfied overall and expert/proficient users are satisfied with AI expertise of proficient/expert. — *Figure 10: Copilot in Bing satisfaction intersection of AI expertise and User expertise (August-September 2024)*

Conclusion

Understanding these metrics is vital for grasping user behavior over time and relating it to real-world business indicators. Users are finding value from complex professional knowledge work tasks, and novices are quickly adapting to the tool and finding these high value use-cases. By analyzing user satisfaction in conjunction with expertise levels, we can tailor our tools to better meet the needs of different user groups. Ultimately, these insights can help improve user understanding across a variety of tasks.

In our next post, we will examine the engineering processes involved in LLM-generated classification.

Source link

What's Hot

Bid to oust Taiwan’s China-friendly lawmakers rejected in closely watched poll

Delhi govt to launch month-long cleanliness campaign in August: CM Gupta, ETHealthworld

USAID analysis finds no evidence of widespread aid diversion by Hamas in Gaza

Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project

Navigating medical education in the era of generative AI

Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

Technical approach for classifying human-AI interactions at scale

AI Testing and Evaluation: Reflections

CollabLLM: Teaching LLMs to collaborate with users

AI Testing and Evaluation: Learnings from cybersecurity

ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

Laws, norms, and ethics for AI in health

Bid to oust Taiwan’s China-friendly lawmakers rejected in closely watched poll

Delhi govt to launch month-long cleanliness campaign in August: CM Gupta, ETHealthworld

USAID analysis finds no evidence of widespread aid diversion by Hamas in Gaza

Dangerous heat continues for over 80 million Americans

ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

Our Picks

Bid to oust Taiwan’s China-friendly lawmakers rejected in closely watched poll

Delhi govt to launch month-long cleanliness campaign in August: CM Gupta, ETHealthworld

USAID analysis finds no evidence of widespread aid diversion by Hamas in Gaza

Most Popular

ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

Subscribe to Updates

What's Hot

Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project

What keeps Bing Chat users engaged?

Novice queries are becoming more complex

How does user satisfaction vary according to expertise?

Conclusion

Related Posts

Subscribe to Updates