Generative AI for Data Analysis: Two Experiments to Guide AI Adoption at a Small, Mission-Driven Nonprofit
HSAI Note: We are really excited to have our first guest blog from the incredible team at Perts!
The Project for Education Research That Scales (PERTS) is a nonprofit research and development institute that translates insights from psychological science into cutting-edge tools, measures, and recommendations that educators anywhere can use to foster healthy and equitable academic engagement and success.
Guest authors:
![]() Sarah Gripshover, Director of Research (PERTS) | ![]() Chris Macrander, Director of Technology (PERTS) |
The Big Question: How Can We use AI in our Work?
Leaders in every industry are urgently asking themselves a simple question: “How can we get the most out of AI in our work?” As the Director of Research at PERTS, a small, mission-driven edtech nonprofit, I am asking myself this question too. When I learned that Large Language Models (LLMs) are increasingly being used alongside human analysts to conduct data analysis, I was intrigued: if our small team can use LLMs for data analysis even in a limited capacity, it would allow us to do more high-quality research with the same staff and budget. So together with my colleague Chris Macrander, I decided to conduct some experiments to find out what kinds of data analysis the newest models can do.
Experimenting with AI
On Sep 4-7, 2025, I (Sarah Gripshover, PERTS) conducted two mini-experiments to learn about the newest AI models’ capabilities to conduct data analysis. Experiment 1 used my own manual prompts to elicit regression analysis of fake data, and Experiment 2 used prompts developed by the AI Learning and Innovation Hub. This group has been supporting state governments to use LLMs for policy analysis with real data/statistics, and I was curious how well these tools work for our use cases.
Experiment 1: Regression Analysis. The first thing I tried was asking an LLM to run a regression on fake data, using my own prompt. Results were poor: Google's Gemini (Flash 2.5) fabricated regression output.
A second attempt with Google’s Gemini Pro and some smarter prompts on my part yielded what seeed to be a real attempt to analyze the data. Interestingly, the numbers it put out were conceptually aligned with the properties I knew were in the data. For example, I generated this fake data using a pre-specified covariance matrix and added a small treatment effect to the outcome variables. The regression output and conclusions were qualitatively consistent with these parameters. (Here is the full conversation I had with Gemini, and the code I used to generate the raw data. I’m also happy to provide the exact dataset I used upon request!) But again, the specific numbers in the regression and descriptive tables were always off by a little or a lot. Even simple calculations, like means by subgroup, were plausibly aligned with the overall patterns in the data, but numerically wrong.
I went through a detailed debugging process with the chatbot, asking it to do things like report the mean of just the first five values in the table, then the first 100, etc. It explained that it was using my natural language prompts to generate python scripts, then running them, to get the answer, which is consistent with what I’ve read about how the newest models work. The chatbot and I “agreed” that there seemed to be some problem with how these scripts were subsetting the data. But the chatbot wasn’t able to fix it, at least not with any prompt I could think of to help. These were very simple calculations too, just calculating a mean over a set of values where the row corresponded to treatment A or treatment B. It’s one line of code in R, for example.
My conclusion with this experiment was that AI has advanced an incredible amount, but I still couldn’t trust it to run a simple regression analysis — at least, not yet.
Experiment 2: AI Learning and Innovation Hub Prompt. The next thing I tried was one of the prompts that the AI Learning and Innovation Hub created for policy analysis with ChatGPT (this one — the full conversation is also available here. This was much more encouraging: I found that ChatGPT with this prompt did shockingly well at answering a research question I devised. Not perfect, but well beyond what I thought LLMs could do a few months ago.
My approach was to approximate the results of a paper by Tom Dee and Brian Jacob from 2011. I used this paper because I wanted to be able to check the conclusions of the chatbot easily against a robust and well-done policy analysis. (Actually, this question should have been outside the domain of expertise of the persona elicited by the prompt I used, which was only supposed to answer questions about transportation policy. But it seemed happy to oblige my question regardless.) ChatGPT showed its work a little bit, with tables of data from the same public database that Tom Dee’s group used and stuff like that. (And it even was able to clarify what numbers it was looking at in the table!) This output was impressive because I didn’t tell it what database to use or what analyses to run. The analysis gave results that were conceptually aligned with the Dee & Jacobs paper, but with different numbers (effect size .11 instead of .22 but on the same subset of grade level/subject area).
So then I told it to use the specific method referenced in the Dee & Jacobs paper. At that point, ChatGPT actually referenced the paper (I hadn’t mentioned it by name in any of my prompts) and summarized the paper’s methodology. It said it couldn’t run the same analysis directly because the NCES data wasn’t available by state (not true) and the designation of which states had pre-existing accountability frameworks was not available (true). Then, ChatGPT reported the stats and effect sizes directly from the Dee & Jacobs paper and compared them with the numbers ChatGPT had derived from its previous analysis. It concluded that causal analyses produced even stronger effects, about 2x stronger, which is true — and an impressively sophisticated conclusion.
On the suggestion of Chris Macrander, PERTS Director of Technology, I then asked ChatGPT to display its source data so that I could re-run the calculations it claimed to have done, in order to understand how it seems to be reasoning about data. I found that its output was consistent with the process it said it went through to derive the effect size estimate of 0.11, but with a couple of problems/inaccuracies:
- It correctly pulled information from the NCES website, but made a typo (?) in pulling the information from one of the tables for a core calculation. (234 instead of 235 for the year 2003)
- More significantly, it calculated the effect size from 2003 to 2007 instead of 2000 to 2007, despite saying that 2003 is the first post-NCLB observation. Maybe that could be justified if a researcher thinks one year isn’t long enough for a new policy to have an impact, but it’s not explained at all in the AI’s report, which makes me think it’s just one of those “truthy” inconsistencies.
- As noted above, it also wrongly claimed that NCES doesn’t report state trajectories — they obviously do, it’s right there in the table. But the AI was right that NCES doesn’t report which states had pre-existing accountability systems, so the conclusion that it can’t replicate Dee & Jacobs analysis is still valid.
Overall Conclusions
My first conclusion from these experiments is that LLM-based AI services do seem to be doing web searches now and at least “attempting” (if intentionality language is appropriate here) to perform some kind of real analysis of the data. The reasoning capabilities have improved as well and the new LLMs are able to have at least some transparency (presumably) in their “thought process,” which in principle could be helpful for debugging.
However, we’re not ready to trust these systems with data analysis just yet (though we’re eager to do so in the future.) In order to trust AI systems with data analysis at PERTS, here is what we would need:
- Much higher accuracy with exact calculations. I would need to see at least 99% of exact calculations be correct before I could trust AI with data analysis. Right now the rate is more like about 30%, at least with the experiments we ran above.
- More consistency in the errors. A human junior data analyst often produces output that’s not perfect. But their errors are predictable. They don’t just fail to calculate means correctly for seemingly no reason: they make mistakes due to conceptual errors about the structure of the data and things like that. The errors I saw with the AI systems were due to things like typos (at best) and incorrectly constructed python code with no easily understandable pattern, or completely fabricated output (at worst).
- Better debugging/Quality Assurance (QA). Most importantly, we would need to be able to subject the results more easily to QA testing and debug better when it gets things wrong. If it’s true that it’s writing python functions on the fly, we need to be able to see those functions and submit them to our standard code review processes. Receiving a natural language summary of what the chatbot “thinks” it’s doing is somewhat helpful, but to rely on analysis in research papers or even just for internal decision-making, we really need to see the full script. And there also needs to be a way that we can edit the code, have it re-run the analysis and re-do the report. Otherwise the QA process would be too cumbersome to save time, even if the analysis were accurate in the end.
Interestingly, in all of my travels with these experiments, I found a curious pattern in which AI was right about the overall pattern in the data, but wrong about the specific numbers in its analyses. In Experiment 1, Google Gemini correctly picked up on the treatment effect I programmed into the fake data, even though it was unable to calculate the means accurately. In Experiment 2, ChatGPT correctly arrived at approximately the correct effect size for the right subgroup in the data, despite calculation errors. This makes me wonder if there could be some limited utility there as far as exploratory data analysis. Maybe we could use AI to identify qualitative patterns in the data, but then have humans write code to produce reproducible analyses of those patterns.
I would be more comfortable doing this if I understood more about what was going on under the hood: How is it deriving these patterns? Was it just linguistic priming? i.e. Did my prompts somehow reveal my own hypotheses about the data in subtle ways that the LLMs picked up on, and reflected in the output? Experiment 1 suggests otherwise because Gemini’s first “analysis” occurred before I had input any fake data yet (facepalm), and the correct pattern did not emerge until I pasted in the fake data. It seems to me that the LLMs are probably picking up on co-occurring text strings in the data, and tailoring the reported patterns in the mean and regression tables to those patterns.
While it’s not yet enough for me to “take it to the bank,” this possibility is sufficiently intriguing to keep playing around with. There could be a limited use case where we have low priors about the relationships between variables in a dataset, and the LLM could parse through all that text and let us know what variables would be interesting to check manually.
If anyone has insight into what’s going on here, or if you’ve used LLMs this way yourself, we would love to hear about it! You can contact me, Sarah Gripshover, at my LinkedIn page, and I’d be happy to share the fake dataset I used, as well as any other resources!

