Can Opus 4.5 agent do real world research qualitative analysis?

I tested Claude Opus 4.5 on real HCI qualitative research - thematic analysis, coding transcripts, and generating codebooks. Here's what actually happened.

Okay, you’ve already seen AI agents doing complex coding, building vibe applications, deep research…

They say AI is good at repetitive tasks or tasks that have a step-by-step process.

But can AI do real-world academic research qualitative analysis?

I don’t actually expect AI to replace researchers or completely take over this task—research is all about novelty, new insights, and original contribution. But I do hope they can collaborate with researchers and speed up some tedious tasks.

I’m going to give AI a real qualitative research task in the HCI field (Human-Computer Interaction) and see how useful it actually is.

WARNING: This is not a research paper or a proper report, it might be biased. You can set up and test it yourself, might produce some different results, but take this as a case study or an example.

The Task

The task setup contains:

2 research questions that need to be answered
3 interview transcripts from 3 different participants
1 starter codebook with sample codes to guide the analysis

The task description is:

Perform the steps of (collaborative) thematic analysis
Use MS Word or Google Docs for coding your transcripts (should spawn subagent to handle this task in parallel if possible), utilize different colours and the comment function. Make sure to note which colour represents which code!
Group codes into themes
Note: The codebook provided is a starting point, but your task is to inductively create additional codes. In this case, create new codes and mark them in the codebook through a yellow highlight so it’s clear which codes you found beyond the provided codebook

Final output should contain:

A final codebook with new codes clearly marked through yellow highlight
A CSV file mapping all themes and codes from the transcripts
Each transcript as a separate word file with coded highlights, compiled into one file

My Agent Setup

Claude Code with Opus 4.5, extended thinking enabled
Created a dedicated folder with all task documents
Custom CLAUDE.md to set the agent’s role as a research assistant, plus a custom skill for thematic analysis. Disabled all unrelated MCP tools and skills
Full access to Python and file manipulation tools

I configured the thematic analysis skill with detailed step-by-step instructions from my professor’s lectures. Guided the agent carefully, provided a structured assessment rubric for feedback.

The Experiment

I compared 3 different approaches:

Agent runs end-to-end with feedback loops (ralph-loop / human-in-the-loop)
Human breaks down the work, agent executes each chunk
Human does the analysis, AI only helps with simple tasks like highlighting and formatting the codebook

My hypothesis: Experiment 1 would fail—we’re not there yet. Experiments 2 and 3 should produce mediocre but usable results, depending on the user’s research skills and prompting ability.

Result

I was surprised that both outputs from experiment 1 and 2 were really, really bad. It did output the codebook and CSV file, but the content was completely random and didn’t align with the criteria—didn’t answer the research question at all. It produced inconsistent codes (each sentence with a different code, didn’t align with the codebook recommendations), wrong highlights on the document, way too many themes, classifying codes into themes that didn’t serve to answer the research question or give meaningful insights. Not even close to academic standard.

Experiment 3, however, worked well. The human did 85% of the work—the actual analysis and reasoning. AI handled the remaining 15%: formatting the codebook, organizing themes based on the codes I provided, and cleaning up the final output. It sped up the tedious formatting work, but it couldn’t actually think through the analysis, even with a structured codebook in front of it.

So… What Does This Mean?

Maybe Apple was right. AI can’t handle abstract, high-level reasoning required for academic research. Right now, they’re just tools—assistants that help us turn drafts into polished outputs faster. The actual thinking? Still on us. Maybe 1 or 3 more years until we see a glimpse of true AGI.