In mid-2023, there were only a handful of academic articles on using generative AI in qualitative research. Early pieces included work by Anis and French (2023), Gao et al. (2023), and Christou (2023). Now there are hundreds of papers, both preprints and peer-reviewed publications. The rapid growth, however, has not been matched by an equivalent growth in AI literacy. Peer review has not been a reliable filter for whether authors actually understand how these systems work.
Many of these studies use the technology in ways that almost guarantee poor results. Most rely on general-purpose chatbots, treat them as if they were qualitative analysis software, and then report “failures” that follow directly from this mismatch. The problem is not simply that AI is new, but that using it does not mean one has mastered it or understood its constraints.
There is also a split in the literature. Scholars with stronger technical expertise, often based in computer science, tend not to use chatbots at all. They build custom systems on top of generative models and avoid some of the most basic pitfalls. Yet they frequently lack grounding in qualitative methodology. Their work raises different concerns, which would require a separate discussion.
In this blog, I focus on a first group: qualitative researchers who experiment with chatbots but have not yet developed sufficient AI literacy to design workable prompts or workflows. When their prompts fail, they conclude that the technology itself is unsuitable for qualitative analysis, rather than recognising that the setup made failure inevitable. In what follows, I present one such prompt published in the study by Jowsey et al. (2025) and explain, step by step, why it was doomed from the start. The study did not need to be run to know that the results would be problematic; the shortcomings were already built into the way the task was formulated.
Jowsey et al. (2025) tested Microsoft Copilot against human researchers. The instructions for Copilot were as follows:
Undertake thematic analysis of the dataset I have uploaded alongside this prompt. Review the dataset in its provided format. Begin by familiarizing yourself with the content and noting key recurring ideas. Generate initial codes based on identified features and apply these codes consistently across the dataset. Group related codes into potential themes, using pattern detection algorithms if applicable. Evaluate and refine the themes to ensure they are coherent and relevant. Develop detailed descriptions and clear names for each theme. Prepare a comprehensive report summarizing the thematic findings, supported by relevant data excerpts. Ensure consistency in coding and theme development, validate the accuracy of themes, and document the entire process transparently. Provide between four and eight themes, supporting the themes with the number of participants that are represented in each theme. Provide quotes from participants to support each theme. Provide this thematic analysis summary within 800 words.
There are several issues with this prompt, but the most important aspect is that they have turned over analysis completely to the LLM. If the technology is used in such a way, then yes, I have to agree with Nguyen & Welch (2025b), this “erodes the ethical obligation to engage with participants with respect, care, and integrity, along with the scholarly commitment to generating meaningful knowledge in service of business and society.”
The prompt looks comprehensive at first glance. It asks Copilot to perform an entire thematic analysis: familiarise itself with the dataset, generate codes, apply them consistently, build and refine themes, check accuracy, track participant numbers, extract quotes, and write a structured summary. It reads like a checklist of what a human analyst would do.
But an LLM cannot perform these steps simply because they are written out in a single prompt. The prompt assumes cognitive and procedural abilities the model does not have. The failure is not an empirical finding about AI; it is a predictable outcome of the system they used. There are four structural reasons for it:
1. The prompt assumes that the chatbot can execute sequential analytic steps
A human analyst moves through a workflow. They read the data, create codes, return to the text, refine codes, compare cases, and slowly build themes. Each step depends on the previous one.
A chatbot cannot run this type of procedure. It generates one response based on the statistical patterns in the prompt and whatever parts of the dataset it attended to. It does not “store” initial codes or apply them across the data. It does not maintain a working memory of segments. It does not iteratively compare patterns. When you ask for a finished thematic analysis in one turn, the model produces something that resembles a finished analysis because that is the pattern it has seen in training—but it has not carried out the steps.
The prompt demands a process. The model can only generate a product. This mismatch cannot be fixed by longer instructions.
2. The LLM had no retrieval mechanism to stay grounded in the data
Copilot was used in simple chat mode. This matters. A chat interface does not give the model ongoing, stable access to the dataset. There is no indexing, no retrieval system, and no way to return to earlier text reliably. Once the input window is processed, the model is generating from compressed representations, not from source material.
The results reported in the article reflect this limitation precisely. Copilot drew most of its quotes from the first two or three pages of data and largely ignored the rest . Without retrieval, this is exactly what happens: the model falls back on the parts of the text it “remembers” most strongly, which are usually the first segments. It is not an error. It is how chat models work.
This is also why hallucinated quotes appear. When a prompt requires evidence that the model cannot access, it generates text that looks like evidence. This is the problem RAG was designed to solve. But RAG was not used here.
3. The prompt forces the model into fabrication
The instructions demand:
None of these can be generated reliably without grounded retrieval.
As the paper shows, 57.5% of Copilot’s quotes were modified or fabricated (Table 4) . This is not a sign of LLM unreliability. It is a sign that the prompt asked for outputs the model had no mechanism to verify. When an LLM is instructed to produce a quote it cannot retrieve, it simply writes a plausible one. That is how generative models operate. The prompt almost guarantees hallucination—not because the model is defective, but because the task setup leaves it no alternative.
4. The prompt treats a chatbot as if it were analysis software
The prompt was written as if Copilot were a CAQDAS tool: something that can ingest data, store segments, track metadata, and perform systematic comparison. But Copilot is a general-purpose conversational assistant. It is not designed for multi-document analysis and has none of the workflow structures needed for systematic qualitative work.
The problem is not just that the prompt was long or ambitious. The problem is that the model was expected to behave like an analytic system when it is, by design, a pattern-based text generator. In short: The prompt could never have produced a trustworthy thematic analysis. It required:
None of these are capabilities of a single-turn, zero-shot chatbot prompt. The study’s results—theme summaries based on only part of the data, fabricated quotes, missing participant numbers, superficial themes—are not surprising. They follow directly from the methodological setup. The lesson is not that AI fails at qualitative analysis. The lesson is that the task must be aligned with how LLMs actually work.
A workable approach looks very different from a one-shot chatbot prompt. As I outline in CA to the Power of AI (Friese, 2025), qualitative analysis with LLMs only becomes reliable when the workflow is grounded and iterative. This requires three components: a retrieval system that anchors every answer in the actual dataset (RAG); a stepwise interaction where the researcher directs each stage of the analysis; and a dialogic process where the model is questioned, corrected, and refined rather than allowed to generate a full analysis in one turn. With these elements in place, the model stops guessing and starts retrieving, and the researcher stays in control of the interpretive work.
In addition to my proposal, several authors have outlined similar iterative approaches that respect how LLMs actually work and keep the researcher actively involved in the analysis. Morgan (2025) has argued for staged interaction rather than one-shot prompts; Nguyen-Trunk and Nguyen (2025) describe a process for narrative thematic analysis; and Krähnke et al. (2025) and Schäffer and Lieder (2023) emphasizes researcher–LLM dialogue as a core analytic mechanism. All of these approaches share the same premise: qualitative analysis with AI only works when the process is iterative, grounded, and researcher-led—not automated through a single command.
The Jowsey et al. study is not just an isolated example of a clumsy prompt. It is symptomatic of a wider problem: qualitative researchers are importing expectations from human analytic practice into systems that do not share those capacities. A single-turn chatbot prompt is asked to execute an entire analytic workflow, maintain memory, retrieve exact quotes, track participants, and verify its own claims. When it fails, the conclusion drawn is that AI cannot do qualitative analysis, rather than that the task design was misaligned with how LLMs function.
The deeper issue exposed by this study is not the failure of a chatbot, but the persistence of workflows designed for a pre-AI era. Traditional qualitative analysis evolved under the constraint that humans cannot hold an entire corpus in working memory. Coding emerged as a way to cut data into manageable pieces, retrieve them later, and build abstractions step by step. When researchers try to make LLMs replicate this mechanical process, disappointment is inevitable. The tools are being forced into a workflow they were never built to execute.
LLMs introduce different affordances. They can process full interviews or entire datasets at once, making it possible to begin with broad patterns and then move inward—flipping the traditional order of analysis. Themes no longer represent final outcomes; they become entry points. From there, researchers can question, refine, challenge, and deepen their understanding without first segmenting the data into hundreds of codes. This shift requires more than prompt optimisation. It requires recognising how the technology works, reconsidering our analytic processes at a conceptual level, and then aligning the two. Only when we rethink the workflow—rather than asking LLMs to mimic coding—can we take advantage of what the technology actually offers while keeping interpretation firmly in human hands.
Anis, S., & French, J. A. (2023). Efficient, Explicatory, and Equitable: Why Qualitative Researchers Should Embrace AI, but Cautiously. Business & Society, 62(6), 1139-1144. https://doi.org/10.1177/00076503231163286 (Original work published 2023)
Christou, P. (2023). Ηow to use artificial intelligence (ai) as a resource, methodological and analysis tool in qualitative research?. The Qualitative Report. https://doi.org/10.46743/2160-3715/2023.6406
Friese, S. (2025). Conversational Analysis with AI - CA to the Power of AI: Rethinking Coding in Qualitative Analysis (April 27, 2025). SSRN http://dx.doi.org/10.2139/ssrn.5232579
Gao, J., Choo, K. T. W., Cao, J., Lee, R. K., & Perrault, S. T. (2023). Coaicoder: examining the effectiveness of ai-assisted human-to-human collaboration in qualitative analysis. ACM Transactions on Computer-Human Interaction, 31(1), 1-38. https://doi.org/10.1145/3617362
Jowsey, T., Stapleton, P., Campbell, S., Davidson, A., McGillivray, C., Maugeri, I., Lee, M., & Keogh, J. (2025). Frankenstein, thematic analysis and generative artificial intelligence: Quality appraisal methods and considerations for qualitative research. PLOS ONE, 20(9), e0330217. https://doi.org/10.1371/journal.pone.0330217
Krähnke, U., Pehl, T., & Dresing, T. (2025). Hybride Interpretation textbasierter Daten mit dialogisch integrierten LLMs: Zur Nutzung generativer KI in der qualitativen Forschung. SSOAR. https://nbn-resolving.org/urn:nbn:de:0168-ssoar-99389-7
Morgan DL. Query-Based Analysis: A Strategy for Analyzing Qualitative Data Using ChatGPT. Qualitative Health Research. 2025;0(0). doi:10.1177/10497323251321712
Nguyen-Trung, K., & Nguyen, N. L. (2025, March 4). Narrative-Integrated Thematic Analysis (NITA): AI-Supported Theme Generation Without Coding. SocArXiv. https://doi.org/10.31219/osf.io/7zs9c_v1
Schäffer, B., & Lieder, F. R. (2023). Distributed interpretation – Teaching reconstructive methods in the social sciences supported by artificial intelligence. Journal of Research on Technology in Education, 55(1), 111-124. https://doi.org/10.1080/15391523.2022.2148786