AI Tutors: Hype or Hope for Education?

his thought-provoking book, Brave New Words, Sal Khan discusses his early experimentation with generative AI, or GenAI, models and how, over time, they might change education. If AI is a new frontier, Brave New Words reads much like the field notes of an explorer documenting his experiences and trying to make sense of what they mean for teaching and learning.

At the heart of Khan’s vision is the idea of AI-powered tutors that adapt to each student’s unique needs, abilities, and interests. These advanced systems, he suggests, will provide direct instruction, real-time feedback, and personalized support, enabling students to learn at their own pace and master concepts more thoroughly than in traditional classroom settings.

Khan discusses the early lessons his team is learning through Khanmigo, an AI-powered tutoring platform built around OpenAI’s flagship GenAI model GPT-4. Rather than just giving answers, this platform supports students by breaking down complex problems into manageable steps and providing explanations that guide students toward deeper understanding. Khanmigo can also assist students with their writing tasks, offering feedback and suggestions to improve their essays and helping them develop critical thinking skills.

In some respects, this vision is not entirely new. Educators have heard predictions about personalized learning solutions for decades, only to be disappointed by technologies that overpromised and underdelivered. It’s reminiscent of the classic Peanuts comic-strip scenario where Lucy time and again pulls away the football at the last second, causing Charlie Brown to fall flat on his back. Educators have seen wave after wave of hyped ed-tech solutions that sounded great in theory but fell short in practice. Many will feel a sense of déjà vu when they hear Khan’s vision of AI-powered personalized tutors, wondering if this is just the latest in a long line of footballs destined to be yanked away, leaving them disillusioned and disappointed.

So what makes this moment different? Why should educators believe that AI-powered tutoring systems like those envisioned by Khan will succeed where previous attempts at tech-facilitated personalized learning have fallen short?

I’m persuaded by Khan’s enthusiasm in part because of my own experience working with GenAI models over the last year, including participating in some early access and safety testing programs for several of the leading AI companies. What has struck me is that GenAI represents a paradigm shift that goes beyond previous innovations like the printing press or the Internet. While these earlier breakthroughs democratized access to information, AI goes a crucial step further by providing access to expertise. Books and the Internet serve as vast repositories of human knowledge, expanding our collective information base. However, they still require human intelligence to process, interpret, and apply that information effectively. AI, in contrast, not only stores and retrieves information but also simulates human-like intelligence to analyze, synthesize, and generate insights from it. That gives people the ability to apply on-demand expertise to a wide range of problems or tasks, including those common in education such as analyzing data, creating instructional materials, offering pedagogical insights, or brainstorming ideas.

AI-Powered Tutoring

As Khan’s book illustrates, the capabilities of these GenAI models make them uniquely suited to serve as tutors that can reflect a variety of teaching strategies, such as adopting a Socratic approach to a lesson or helping students reflect on their work. These general capabilities of GenAI can be fine-tuned to interact with custom data sets such as research findings, the school’s curriculum, or past student assessments. This gives GenAI more specific expertise, which can help drive coherence and ensure a consistent and integrated learning experience that reflects the school’s instructional goals and is based on rigorous research.

The high IQ of these models is now being matched by another, surprising characteristic—high EQ. GenAI produces fluent language that closely imitates the way humans talk and respond. In fact, a growing body of research shows not only that these models can provide accurate responses, but also that human evaluators rate their responses as more empathetic than those of other humans. Additional research is showing how this capacity might enable the GenAI system to better “understand” the emotional state of a user and respond appropriately. This can potentially allow an AI tutor to encourage, reassure, and motivate students and provide feedback to teachers on whether their lecture is engaging or boring.

Since the publication of Brave New Words, new capabilities have emerged among GenAI models that can enhance the tutoring experience. For example, many models now have the ability to analyze images, allowing students to upload a photo of a textbook page and receive instant assistance in understanding complex concepts.

Google’s Gemini 1.5 Pro goes one step further with the ability to analyze videos. An educator can provide it with a video of their instruction and ask questions about the video’s content just as easily as they could with a research paper. That capacity could provide a powerful tool to inform teacher practice or assess students. It also opens up an entirely new approach to training AI tutors based on analysis of videos that depict effective human tutors engaging with students.

Claude 3.5 Sonnet has introduced Artifacts, a feature that enables the AI model to generate small interactive resources alongside its text responses. In a physics course, Claude can generate simulations and interactive problem sets that allow students to apply their knowledge and see the implications of scientific principles.

OpenAI’s Advanced Voice technology has made significant strides in generating AI speech that closely mimics human speech patterns. The model incorporates subtle nuances such as simulated breathing sounds, filler words (like “um” and “uh”), laughter, and emotional inflections, to create a more natural and human-like listening experience. Additionally, the technology has the capability to detect and respond to users’ emotions, allowing for more empathetic and context-aware interactions.

This advancement should facilitate the development of AI tutors that are far more conversational and engaging than anything students have encountered with previous education technology. If a student is frustrated or confused, the AI model can adjust not only its response but also its tone, offering reassurance and encouragement. In fact, major AI companies are concerned that these systems could become too relational. OpenAI has cautioned that human-like voice interactions could lead users to anthropomorphize the AI tutor and develop an emotional connection or reliance on it. Early testing surfaced risks with extended interaction that could change social norms and cause some users to prefer engaging with the AI bot over human interaction, an issue that the Christensen Institute’s Julia Freeland Fisher has warned about.

These are all emerging capabilities that may still have some limitations, but they will continue to evolve and improve over time. The best way to think about these capabilities is as building blocks, like LEGO pieces, that can be assembled and configured to create innovative tools and services. What could only have been a text-based tutoring system a year ago can now engage students through active listening and conversation using speech recognition and synthesis. Student work, including visual elements, can be analyzed through advanced image analysis techniques. And empathetic capabilities can be adjusted to provide appropriate levels of encouragement or motivation to help guide a student through their lesson, adapting to their individual needs and progress.

Answering the Skeptics

These capabilities have generated much excitement, but this latest generation of AI has also been met with skepticism from some observers, who are wary of becoming distracted by yet another “silver bullet” technology fad. Critics point to the limitations of current models, worry that the hype could lead to diverting funds and attention away from critical education priorities, and argue that the focus should remain on addressing the complex, systemic issues that have long plagued the sector.

These are understandable and, to some extent, reasonable concerns. However, dismissing the potential impact of GenAI based on its current limitations is shortsighted, as the rapid pace of advancements suggests that these models will likely overcome many of their shortcomings in short order.

Skeptics whose only experience with GenAI is limited to the free version of ChatGPT 3.5 from 2023 may not fully grasp the advancements made in the field. In just one year, ChatGPT-4, once the leading frontier model, has been joined by a host of powerful contenders, including Google Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet, as well as new open models such as Meta’s Llama 3.1 and Mistral Large 2. Each of these models has unique strengths, excelling in different areas and tasks, and some do better with tutoring prompts than others. Just recently, OpenAI released a new model that is breaking AI records in complex reasoning, math, and science. These models will continue to get cheaper, faster, and more powerful over time, which will help support more experimentation in tutoring applications.

In some cases, disappointing GenAI output may be the result of the poor prompts it receives rather than the limitations of the technology. A growing body of research is showing that well-crafted prompts can dramatically improve an AI system’s performance on various tasks. Techniques such as assigning the model a particular role or expertise, providing relevant examples, guiding the AI’s reasoning process, or breaking complex problems into smaller steps can lead to much better results.

Critics are quick to point out errors or low-quality GenAI outputs, but quality control is also a challenge with traditional approaches. A RAND study found that educators often use platforms like Pinterest and Teachers Pay Teachers for instructional ideas, even though these sources tend to be of low quality and can contain their own “hallucinations” and biased content. Just over a third of teachers use at least one standards-aligned ELA curriculum, and less than half regularly use at least one aligned mathematics curriculum. Interventions like high-dosage (human) tutoring, which showed significant gains in smaller studies, have struggled to provide the same results when scaled, and survey research suggests these interventions are not reaching the students who need it the most. If the only contribution of a fine-tuned GenAI to the field of education is to accelerate the transition to high-quality instructional materials and support greater curricular coherence, it would still represent a worthwhile and arguably transformative advancement.

A nascent but growing body of research illustrates the effectiveness of GenAI tutors. For instance:

Tutor CoPilot, a human-AI system that provides expert-like guidance to tutors, improved student mastery of topics by 4 percentage points in a randomized controlled trial with 1,800 students. Lower-rated tutors saw the greatest benefit, with their students improving mastery by 9 percentage points. Tutor CoPilot helped less-effective tutors achieve outcomes comparable to more-effective peers.
A Harvard study found that students using a custom-designed AI chatbot tutor for a physics course showed approximately double the learning gains and significantly higher engagement compared to those in a traditional classroom. The AI tutor’s personalized feedback and students’ ability to self-pace proved especially beneficial when students were encountering new material.
In Ghana, an AI-powered math tutor called Rori, accessible via WhatsApp, led to significantly higher math growth scores for students who used it for one hour per week, with an effect size equivalent to an extra year of learning. Rori’s low cost of $5 per student suggests it could be a cost-effective intervention in educational settings with limited resources.
One study introduced Bridge, a method that employs a task analysis to model the decisionmaking processes that expert teachers use when they address student math errors. Researchers applied this method to a data set of 700 annotated real-world tutoring conversations with students from Title I schools. They found that when GPT-4 was given information about an expert teacher’s decisionmaking processes (including the type of mistake, teaching strategy, and response goal), the AI system’s responses to students’ math mistakes were rated 76 percent better by humans compared to when GPT-4 had to respond without that expert guidance. This study demonstrates the importance of incorporating expert knowledge into AI models for tutoring and other uses.
In a randomized controlled study, students using an AI tutor demonstrated significantly greater learning gains in less time compared to those in an active learning classroom. The AI-using students spent a median of 49 minutes on the tasks compared to the 60-minute lecture. AI-using students reported higher levels of engagement and motivation, with 83 percent considering the AI’s explanations as good as or better than human instructors’. The AI tutor’s effectiveness was attributed to its meticulous adherence to pedagogical best practices, including active learning, cognitive load management, growth mindset, scaffolding, accuracy, timely feedback, and self-pacing.
One study demonstrated how AI tutors can act as education experts, successfully replicating known teaching principles and creating improved math worksheets that significantly align with teacher judgments. Those capacities suggest that AI could speed up lesson design while highlighting the continued importance of human expertise and real student testing.
A field experiment with nearly 1,000 students in a Turkish high school used GPT-4 during three tutoring sessions covering 15 percent of the curriculum. Researchers found that access to the AI tutors significantly improved math performance (by 48 percent to 127 percent) but subsequently harmed educational outcomes when access was removed (17 percent reduction), suggesting students used GPT-4 as a “crutch” instead of actually learning critical skills. However, safeguards in the GPT tutor largely mitigated these negative effects, highlighting the need for caution when deploying generative AI to ensure long-term productivity through continued human learning.

The AI-powered tutoring systems of the near future are likely to be significantly more capable than today’s. When critics talk about GenAI not fully understanding a student or failing to build on the student’s previous learning, they’re ignoring how fast these models are becoming more relational. AI tutors will gain access to larger memory and context windows, including the ability to read and analyze a student’s previous work to better inform tutoring sessions. They will soon be able to see and listen to students, opening up new ways of engaging them and assessing their understanding of concepts. There’s reason to believe that future versions of these systems will have even greater empathic capabilities, allowing them to better motivate and engage students.

That said, while GenAI is a powerful tool, it is just that—a tool. The value comes not from the tool itself but from how and when it is used. Educators should implement AI tutors in targeted ways to solve specific instructional challenges, not simply adopt them for their own sake. These tools should serve to support and empower educators, not replace them. Most important, the use of GenAI must be balanced against the need to cultivate students’ ability to focus and sustain attention—skills that today’s digital distractions increasingly threaten.

Moment of Urgency

These GenAI tools and capabilities are emerging at the exact moment when the education sector urgently needs innovative solutions. Chronic absenteeism surged to include 28 percent of all K–12 students in 2022, with only a slight improvement in 2023. A Walton Family Foundation–Gallup “Voices of Gen Z” study found that between 25 percent and 54 percent of Gen Z K–12 students report that they lack engaging experiences in school. The average student has regained only a fraction of the learning lost during the pandemic, with just one-third of math losses and one-quarter of reading losses recovered. According to research by the Northwest Evaluation Association, students will need an average of four additional months of learning to catch up, and in some cases, as much as nine.

Perhaps more traditional reforms and tutoring will be able to address these challenges. I certainly hope they will help, but I doubt that they’ll prove sufficient for the depth and breadth of the challenges we’re facing. The urgency of the moment should be a call to experiment and pilot new approaches that explore how best to thoughtfully and purposefully harness the capabilities of GenAI. We need more, not less, experimentation with AI tutors. We need more efforts using GenAI to lighten the administrative load that often distracts teachers from their most important work: building the deep, meaningful relationships with students that are the foundation of academic success.

The vision Khan presents in Brave New Words is not a distant dream but an unfolding reality that demands our attention and active engagement. The rapid advancements in GenAI have opened up a world of possibilities for improving teaching and learning, but we must approach this new frontier with both excitement and caution. Realizing the full potential of AI in education will require more than just technological innovation; it will demand a collective commitment to ensuring that these powerful tools are harnessed in ways that genuinely benefit all students. Khan’s roadmap may not be fully realized in the immediate future, but it sets a course for a destination worth pursuing—a world in which every student, regardless of background, has access to the personalized support, engaging learning experiences, and high-quality education they need to thrive. The future of our students, and our society, depends on our willingness to act decisively and creatively at this crucial juncture.