
Summary: This article describes the methodology and motivation for the analysis decisions of our investigation of the individual submissions to the NTIA’s AI Accountability Request for Comment. For our discussion of results and implications for policy makers, please see here.
Specifically, this report describes how we created the system to perform this analysis and how we validated its results. Throughout this exercise, we also critiqued our own work and investigated the limitations of what we ultimately produced. We hope that by communicating both the strengths and weaknesses of our system, this will result in greater confidence in how our system actually behaves and contextualize our conclusions about people’s concerns regarding AI Accountability. We believe this transparency is important when evaluating systems and it’s one piece of the puzzle to building safer AI systems.
We’ve uploaded our code to compute the results here.
We believe that, when using complicated AI systems to analyze data, those presenting an analysis ought to be responsible for also presenting sufficient data about their methodology that enables the reader to verify the results. In particular, this means providing not only the raw data, but the justifications of the various decisions made, as well as the procedures and methodologies that were used.
Here, we present what we believe represents a positive example of using large language models (LLMs) to analyze a relatively large and unstructured dataset – the set of all individual submissions [1] to the NTIA request for comment on AI accountability.
We had two goals for this project:
The focus of this report is on the second (meta-level) question – how can we use these powerful models in order to create an analysis that we can actually trust? For more on the first question (what was actually said in the report), see our other post here.
The rest of this document is broken down as follows:
We selected the following 15 questions after several rounds of iteration as we felt that in their entirety they would cover a majority of concerns within the submissions:
We attempted to answer these questions for every individual (non-corporate/non-entity) submission to the RFC that was:
This resulted in a final dataset of 1,198 assessed submissions. For each valid submission, we answered each of the 15 questions by querying GPT-4 with the following prompt:
textWe used a temperature of 0.2, and made at least 3 queries, taking the most common answer as the actual answer.
We validated our results by creating a “gold” set of labels by answering each of the 15 questions for 18 hand-selected submissions. We selected these submissions for diversity, and ensured that even for the most-skewed questions, we had at least some examples of both positive and negative answers. We selected 18 submissions as a compromise between the cost of doing additional submissions and the need to have a sufficiently large and diverse sample to get meaningful results.
In addition to the responses from GPT-4, we used the crowd sourcing platform Prolific to collect a large amount of responses from human participants as well. In total we recruited 406 participants, each of whom answered questions about 6 submissions: 1 taken from the gold set and 5 others. Participants were allowed to participate in multiple surveys on different sets of submissions; the most involved participant completed 11 such surveys. We excluded submissions longer than 2,000 characters from human labeling to avoid having to force people to read extremely long documents. In total, human participants rated 929 unique submissions. We took several steps to ensure we conducted our human data labeling in a fair and ethical manner, see Appendix A for details.
We performed quality assurance on the human responses at multiple overlapping levels:
In practice, we found that all these steps contributed to increasing the quality of the human labeling data, and our final dataset contains a high level of agreement among participants, with a high level of confidence – the most common answer to any question about any submission was “Unsure” only around 1% of the time. The final instructions given to participants can be seen in Appendix B.
Our final results on our gold set are presented below in Table 1. For each question, we present 3 metrics of reliability:
Additionally, we present the same metrics for the LLMs, using the most common answer across 3 independent LLM runs as the modal LLM answer.
Table 1: Comparing Prolific Participants and LLMs on the Gold Set (higher agreement is better, lower false positives and false negatives are better).

This data allows us to draw two conclusions:
We also observe that the reliability of both human and LLM answers varies strongly between questions. To reinforce this finding, we also report two metrics computed across all submissions, golden or otherwise (see Table 2 below):
These metrics too vary consistently across questions. The two questions asking about whether a commenter is an artist or software developer are the most reliably answered, while the questions asking for concerns about AI lacking human values or behaving out of control are the least reliable
Table 2: Prolific Participants internal agreement on questions.

Overall, the high agreement between our gold label set, the scores made by other human participants, and the LLMs, gave us enough confidence in our methodology to feel comfortable presenting them in our post summarizing our takeaways from analyzing the NTIA RFC dataset.
Our process did not begin with any of the details described above in the sections about questions, methodology, and validation. Rather, they each evolved over time as we iterated towards creating a set of results in which we had sufficient confidence to feel comfortable sharing more broadly.
We started the process by exploring what was possible.
We created an initial question list, evaluation set, and the LLM infrastructure to ask our questions across all submissions. Through this exercise we started to better understand what questions we should ask. We also realized a need to look closely at the data, the importance of having a representative evaluation set, and the need for human labeling to help scale up the creation of this evaluation. This process gave us a rough outline of all the experiments that we felt were worth running on our system to see how far we could push performance.
After this initial exploratory work, our next step was to better understand how exactly we should handle particular intricacies of the dataset and which submissions we could not analyze. In particular, we elected to filter out submissions that did not match our criteria above (too short, too long, non-English, etc) which reflect fundamental limitations of our approach based on LLMs and human labeling. Because the large majority (95%) of submissions passed these filters, we were able to manually review the remainder to ensure that their exclusion did not substantially affect the final conclusions. This work also allowed us to notice issues like duplicate submissions or submissions that contained attachments, which we handled by adding the text of each attachment to the submission.
Much of our work and experimentation was around selecting the questions themselves. Through our own internal preliminary investigation of the questions, we removed any questions where the answer was subjective or the wording of the question itself led to ambiguity. We found that even short answer or non-binary multiple choice questions led to disagreement among humans, enough so that we struggled to build enough confidence in the LLM responses.
We developed the questions and the golden set iteratively, using a combination of human labellers, language models, and our own judgements.
We started with a list of questions based on our own hypotheses for what individuals who submitted to the RFC may be concerned about. For example, one such question was “Does the author express concern about loss of privacy or misuse of personal or restricted data?”
We then labeled each submission with yes/no answers using a language model, and selected the set of 18 golden submissions which we sampled pseudorandomly with the constraint that, for each question, there was at least one golden submission for which the LLM answered “yes” to that question, and one submission for which it answered “no”. We also ensured that a variety of submission lengths were represented.
Next, we recruited human participants on Prolific to answer each question for each submission, using 10-fold recruitment to assess consensus among participants. We also answered these questions ourselves, as did other members on our team. We presented questions to humans as multiple-choice between “Yes”, “No” and “Unsure”.
We removed, edited, or rewrote any questions for which the majority answer was “Unsure”, or for which participants did not exhibit a consensus on all golden submissions. We iterated this procedure until all questions were answered unambiguously by human participants in our golden submission set. For example, our final version of the above question now reads: “Does the author express concern about loss of privacy?”
Through this iterative procedure, we were able to arrive at a set of questions that was sufficiently unambiguous to function as an objective “gold” data set. This is what allowed us to benchmark both the LLMs as well as future human participants against a set of data that we could actually trust was correct.
We iteratively developed each of the quality assurance datasets as we began collecting small scale versions of our human label data. For example, in one of the first runs, we noticed that there was a large variability in how long participants took to read submissions and answer questions. This led us to explicitly instruct participants about how much time they should spend reading, which largely alleviated the issue. We got a lot of value from being able to quickly run small-scale studies to test and refine our methodology, and ran about ~10 studies over the course of the project.
We also introduced two extra questions to help us identify any concerns we may have missed and to perform sentiment analysis in which our LLM struggled with in our early experiments. Extra questions provided to human labellers:
Throughout the project, we constantly experimented with various prompts and methods of eliciting answers from the LLMs for all the questions we considered running. We often evaluated their performance on a small number of examples by hand when iterating, and on a larger number of examples as we started developing the ground truth and other human data.
We conducted our LLM experiments with off-the-shelf language models, specifically OpenAI’s GPT-4 and Anthropic’s Claude-2, so that our experiments and results could be replicated by others. As a first step, we iterated on simple prompting strategies until we felt confident the model was answering approximately correctly. This process consisted of two steps:
Given a prompt, we identified all golden submissions and questions where the LLM deviated from the golden answer. For each incorrectly answered question, we visualized the submission, prompts, and answers. We manually investigated this information to identify weaknesses in our prompting strategy and edited the prompt accordingly.
The prompt presented in the Methodology section is the one we settled on after this exploratory process. In particular, the final paragraph instructing the model to provide its answer between [answer] and [/answer] tags is helpful for ensuring that Claude-2 actually outputs the answer in the correct format.
Once we settled on a preliminary prompt and final golden evaluation set, we were then free to iterate on our system and experiment with alternatives that could boost agreement between LLMs and golden answers. Concretely, our quantitative metric was percentage of golden questions and submission for which the LLM’s output matched the golden answer, which we call the LLM-gold agreement. We also broke down any mismatches into false positives and false negatives, to assess any response bias in the models.
Additionally, we ran experiments where we changed LLM inference parameters to determine how to maximize the gold agreement. The specific parameters we tested were:
We found that, for both Claude-2 and GPT-4, results improved when we used a temperature of 0.2 and generated at least 3 samples. Although we could generate more samples, this would make the prompting process more expensive and time-consuming. We also found GPT-4 to perform superior to Claude-2 in our use case. Finally, we experimented with the following three prompting strategies:
Surprisingly, the simplest strategy matched or outperformed these more complicated prompting strategies on our evaluation set. We observed some behaviors that may serve to explain why this happened:
Early on, we tried to investigate RFC submissions by asking language models to summarize their content. While this worked to some extent, we found that summaries often hid important content and did not accurately reflect the original commenter’s concerns. We eventually abandoned this approach in favor of our final question-based approach.
We also tried encouraging language models to give responses with an indicator of confidence by asking them to report answers on a 5-point scale. However, this data was unreliable, and calibrating scales between questions ended up being complicated (what does it mean to have “some concern”?). Therefore, we abandoned the 5-point scale and turned to a yes/no answer for all questions except for “Does the author feel optimistic about AI?” which was only answered by human labellers.
Similarly, we had trouble processing the model’s responses to open-ended questions, such as “What does the author do for a living?” Instead, we converted these questions to a yes/no format, replacing them with specific questions asking if the author was an artist or if they were a software engineer.
Finally, we learned that formulating questions in great detail tends to confuse both language models and human labellers. This led us to simplify our questions over time - for example, we decided to ask for concerns about “privacy” instead of “privacy and misuse of personal or restricted data”.
While we have tried to be as thorough as reasonably possible with this analysis, there remain many limitations at both the object level (e.g., what we can conclude about people’s concerns about AI) as well as at the meta level (e.g., how confident we can be in our conclusions).
In our view, the primary limitations of the analysis itself are as follows.
These limitations are extremely serious. One should not construe our analysis as saying something about the broader questions about AI risks, dangers, and harms because the underlying dataset simply does allow such broad statements.
This is a major part of the reason why we are focused on the methodology — in a sense, the way in which we have answered these questions is somewhat more interesting than the questions or answers themselves.
At the object level, our analysis suffers from additional limitations: did we ask the right questions? Did we phrase them correctly? Do those questions cover the issues that were actually raised in the responses? Are the questions sufficiently unambiguous? Even taking as fixed the questions we ultimately decided to ask, there are also limitations in how confident we can even be in our answers to those questions.
However, unlike the limitations on our analysis, the limitations on the methodology are somewhat less serious. However, there are factors that could have introduced noise or error into our answers:
There are certainly other limitations of our methodology, though overall, we feel relatively confident that, at least for the questions we ended up with and the dataset we were working with, we were able to answer questions in a way that seems likely to have been relatively accurate. However, it is critical to remember that this was only possible because of our iteration on those questions — naively asking an LLM random questions about a set of documents does, empirically in the case of this dataset, lead to significantly less accurate results.
In this post, we presented a methodology for using LLMs to answer questions about a large body of text documents, using our analysis of the NTIA RFC as a particular example. We make no claims that this methodology applies beyond the scope of this particular analysis, though we hope that the extra details at least help to inspire confidence in the numbers that we report in our associated blog post, as well as to justify the decisions that we made during the project. We also highlighted some serious limitations, both of the dataset itself (and the strength of conclusions that can be drawn from it) and, to a lesser extent, our analysis choices and the level of confidence that we can have in our results.
In the future, we hope to see improved AI systems which are both more ergonomic to work with, and which can be more easily used to reach accurate conclusions more quickly. Until that point though, we would highly recommend that users of these systems take care and use good judgment in applying them to specific situations, spending time to dig into the data and validate that the results they are getting are in fact reflective of reality, and ideally share that work with downstream consumers of that information to foster trust and reproducible analyses.
We took a number of steps to ensure that our human data was collected in an ethical and appropriate manner:
To instruct human participants on the labeling task, we provided these instructions on separate screens:
Next, we instructed participants to answer three quiz questions:
If participants failed any of these simple questions, we instructed them to re-read the instructions and re-do the quiz. We set up our task to automatically exclude any participants who failed the quiz three times, but this never occurred in practice.
Finally, after a participant read a submission, we presented them with a multiple-choice question “which of the following sentences appeared in the submission you just read”, with as choices: