Harsh Kumar

Research

Most AI evaluations stop at the interaction (e.g., was the answer correct, or did the user like it?) My work asks what happens afterward. Did the learner retain the concept? Could the writer still generate original ideas without the model? Did advice lead to better action, or only feel reassuring in the moment? I use randomized and longitudinal experiments to measure these downstream outcomes, then use the results to design better assistance, evaluations, and post-training objectives. The preferences used to train and evaluate models are proximal signals. Human experiments can test when those signals predict the downstream outcomes we care about and help us develop better scalable proxies.

Creativity & Cognition

LLMs can enhance human creativity when people co-create with them, but it is unclear what happens to unassisted human creativity. In our CHI 2025 paper, we conducted randomized experiments with 1,100 participants and found that LLM assistance boosted creative performance during use but hindered independent performance afterward, with evidence of homogenization, even after participants did not have LLM access (we are seeing more evidence of this in real world now). In ongoing work, we study how multiple LLMs with different personas (grounded in human social learning mechanisms) affect math problem-solving and open-ended writing (working draft). In another CHI 2026 paper , we investigated how the timing of LLM access affects critical thinking under time constraints. Larger experiments are underway, including on the emotional effects of creating with LLMs and on how AI involvement changes the way others receive a creative artifact.

Learning

How can LLMs be useful for learning? Back in 2023, the promise was clear (personalized explanations on demand). But there was insufficient scientific evidence to confirm the learning benefits. So we ran one of the first large RCTs on LLMs and math education. LLMs helped, most when learners attempted problems before consulting the model, and even when the explanations contained mistakes. This informed the work I did in classrooms at the University of Toronto, on guiding students in their use of LLMs and supporting self-reflection at scale. Increasingly, I think the focus on LLM tutors is inertia from older learning technology. We have had machines that answer questions before (e.g., cognitive tutors). Beyond access to content, there are other aspects of learning that have been overlooked, which were difficult to engineer/scale in a pre-LLM world. One direction I have recently explored uses different LLM personas to simulate a classroom. Learners surrounded by simulated peers did better on math tests and wrote less homogenized essays. Thinking beyond one learner, one tutor, one session makes room for factors such as motivation, self-efficacy, learners’ relationships with the people around them, and other human outcomes. I am exploring more in this direction.

Wellbeing & Mental Health

People increasingly bring personal struggles to LLMs. With Mental Health America, I led the technical side of a text-messaging intervention that was deployed to over 10,000 people, part of the Small Steps program . In our CHI 2026 paper, we compared AI and human responses to real advice-seeking posts. Our position paper argues that LLMs should be optimized for our well-being rather than just short-horizon preferences, account for collective interests alongside individual interests, and balance between user autonomy and guidance. I am currently running longitudinal studies and designing LLM agents that optimizes for longer-term wellbeing outcomes.

Methods

My experiments run in both controlled and field settings, including online platforms, classrooms, and deployed interventions. I develop the interfaces for conducting these experiments and collecting data, and I release the code on GitHub so that others can build on these experiments. Depending on the nature of the problem, I also use interviews, surveys, and qualitative analysis. I am comfortable with the LLM post-training stack: supervised fine-tuning and preference optimization (TRL, LoRA/QLoRA), inference with vLLM, multi-GPU training, and LLM-as-judge evaluation pipelines.

My long-term agenda is to build AI systems with positive societal impact, where success is measured not only by immediate task performance, but by how these systems shape human learning, creativity, and wellbeing over time. I want to run longitudinal experiments over longer horizons, and to scale the evaluations these experiments make possible. I also want to move beyond the single user as the unit of analysis. Many of the effects I care about play out in groups and communities, and these can be studied experimentally too. On the technical side, I want to develop post-training methods tailored directly to the domains I study, so that what we measure can change how models are trained.

Working with others

The problems I work on are interdisciplinary. Besides HCI and ML researchers, I regularly work with psychologists, philosophers, economists, learning scientists, clinicians, tutors, and artists. I have led teams, as in the Mental Health America project, and worked within larger organizations, as an Applied Scientist Intern at Microsoft. I regularly work with undergraduates, many of whom are co-authors on my papers. I have also served on the program committees of conferences such as AIED (2025) and Learning@Scale (2025–26), and as an expert reviewer for grant competitions, such as the Tools Competition. If you work on similar questions or if any of my work interests you, I’d be happy to talk.

back home · all papers

Website template adapted from Abhraneel Sarma.