My Answer to Two Questions I’m Often Asked About AI and Grading

One is About Teacher Duty, and the Other is About Grading Accuracy

Peter Paccone
10 min readOct 7, 2024

I’m a recently retired AP US History teacher, a consultant for the AI instant-feedback generating platform Class Companion, and someone who frequently speaks and blogs about various AI in education-related topics.

This past year, whether during the Q&A session of various teacher conference presentation I’ve given or webinars I’ve hosted, I’ve increasingly been asked the following two questions:

  1. Isn’t it the teacher’s job to grade FRQs?
  2. Can AI accurately grade students’ FRQ responses?

In what follows, I provide context for each question and then share my personal response based on my experiences and insights.

But first, it’s important to clarify the difference between grading and feedback. Grading refers to assigning a score or grade to a student’s work based on specific criteria, while feedback involves providing guidance or suggestions aimed at improving the student’s understanding or performance. AI can be used for both purposes: to quickly assign scores (grading) and to provide instant, constructive comments (feedback). This distinction is key to understanding how AI can support teachers in the classroom.

This post focuses solely on AI and grading, not on AI and feedback.

Question 1: Isn’t it the Teacher’s Job to Grade FRQs?

Leon Furze, in his blog post, Don’t Use GenAI to Grade Student Work, certainly seems to think so, in essence arguing that grading should remain a human task given that it requires a nuanced understanding and contextual awareness that AI cannot replicate. According to Furze, by delegating grading responsibilities to AI, teachers risk losing the personal touch that helps guide students’ growth in writing and critical thinking.

This viewpoint resonates with 32% of teachers recently polled in various Facebook groups for educators and other online teacher communities. Here are some comments made by those who believe that it’s, indeed, the teacher's job to grade FRQs.

  • ‘Teachers who use AI for grading are engaging in a “dereliction of duty.”
  • ‘When we use AI to handle grading, we risk losing our direct engagement with students’ work, which is vital for truly understanding their progress and challenges. It feels like a shortcut that undermines our duty to foster student growth in a meaningful way.’
  • ‘Grading most definitely is the teacher’s job. It’s how we connect with our students and understand their progress in a way that AI simply can’t replicate.’
  • ‘If we start depending on AI to handle grading, what’s next? Using AI to grade is doing nothing more than paving the way for replacing teachers altogether.’
  • ‘If I tell the students they can’t use AI to write it, I feel highly uncomfortable using AI to grade it.’
  • ‘AI grading is a shortcut that undermines the teaching profession.’
  • ‘Grading is a teacher’s job, not a machine’s.’
  • At this point, without proper oversight, using AI for grading purposes is poor practice and just as importantly a poor model for students.

While 32% of teachers view grading as the teacher's job, a responsibility that should remain exclusively human, 48% see AI as a tool that can and should be used to help them grade summatively, with the remaining 20% choosing to remain neutral.

My Response to Those Who Say It’s the Teacher’s Job to Grade Student Work

I handle this question by doing four things:

First, I answer the questions directly, saying that, in my opinion, a teacher’s job, above all else, is to facilitate student learning. So, if using AI tools to grade student work helps students improve their writing and enhances their overall learning, then it is the teacher’s responsibility to use these tools. On the other hand, if AI grading hinders student growth, it should be avoided.

Second, I show just a few of the ways an APUSH, APWH, or APEuro teacher might use AI for grading their students’ FRQs:

The table suggests that the decision of whether teachers should rely solely on their judgment or use AI to grade students’ FRQ responses is not a straightforward yes-or-no choice.

Third, I describe how I personally use AI in my APUSH course, saying that I annually assign a large number of FRQs: countless SAQs, 20–25 LEQs, and 10–15 DBQs.

  • For the SAQs: I use AI to score (grade) all of them. Periodically and without prior notice, I select 6–9 scores to be recorded in the gradebook. However, I encourage students to dispute these scores if they feel it’s necessary.
  • For the LEQs: AI scores (grades) all of them, but I do not enter any of these scores into the gradebook, as they are meant for practice.
  • For the DBQs: AI scores all but a few (2–3 typically). I personally grade these, with the grade entered into the grade book significantly impacting the students’ overall grades. The rest are treated as practice and are not recorded.

Going this route has allowed me to significantly reduce the time I spend grading while dramatically increasing the number of writing opportunities I provide. This frequent practice, paired with immediate AI feedback, has led to noticeable improvements in my student’s mastery of the skills needed to effectively respond to the various FRQs placed before them on APUSH exam day.

Finally, I ask those who believe grading is solely the teacher’s responsibility the two questions appearing below:

  1. Do you think a year from now most teachers will still approach the question of AI grading as a strict yes/no proposition? Or will there be a shift, with more educators acknowledging that AI can and should be used to grade some kinds of FRQs — like SAQs, for example — or certain portions of LEQs and DBQs, such as whether the student has earned the contextualization or thesis point?
  2. Do you think a year from now, we’ll see a significant increase in the number of teachers who recognize the value of assigning large numbers of FRQs for practice, then having AI assess and provide instant feedback on student responses, but strictly for formative purposes?

Question 2: Can AI Accurately Grade Students’ FRQ Responses?

There are generally two types of teachers who ask me this question — those who genuinely want to know the answer and those looking for an opportunity to highlight what they perceive as flaws in the tool.

In either case, teachers who have made claims against AI for grading have been heard to say things like:

  • ‘AI is often prone to making grading errors.’
  • ‘I highly recommend that teachers avoid using AI, especially for grading and summative assessments, and that’s because, when I have it score a College Board-graded essay, it usually assigns a score that’s 1–2 points lower. AI just doesn’t get it. It’s way too lenient.’
  • ‘If AI could grade accurately, don’t you think by now an organization like College Board would have bought one of those ed-tech grading start-ups we’ve been hearing so much about? But it hasn’t happened. Nor has there been any word leaked out about CB planning to introduce AI grading for AP essays in the coming years. The reason, I’m guessing, is that one of the biggest players in the game has reached the same conclusion many of us have: AI just can’t accurately grade students’ written work.’

My Response to Those Who Say AI Can’t Accurately Grade Students’ FRQ Responses

To claim AI can’t accurately grade is nonsense. So, too, is the claim that it is often prone to making mistakes and is way too lenient. If this were true, then Texas would never have used AI to grade their STAAR tests this year.

But this is not to say that I think AI is the perfect grader — it’s not.

From experience, I’ve learned that AI performs differently depending on the type of FRQ.

  • SAQs: In my experience, AI is quite accurate when it comes to the grading of Short Answer Questions. While it might not capture every nuance perfectly, it’s generally consistent with the scoring guidelines.
  • LEQs: For Long Essay Questions, AI performs well overall, but it sometimes struggles with more nuanced aspects, such as assessing the complexity point. However, given that human readers often face similar challenges, this isn’t unique to AI.
  • DBQs: Document-based questions are the most challenging for AI to score accurately, particularly regarding sourcing and complexity points. However, it generally scores well on other aspects, such as the thesis, contextualization, and initial evidence. That said, if there is a discrepancy across the 7 DBQ points, it almost always seems to fall within a 1–2 point range. Not more.

Over the past year, I’ve also learned that to improve AI’s accuracy, LEQ and DBQ assignments should be broken down into steps. In other words:

  1. Instead of asking students to respond to an entire LEQ or DBQ at once and then having the AI score it, have students work on earning the points one at a time. Use AI to assess the attempt to earn each point before they move on to the next point.
  2. For the sourcing point, the AI should be directed to award the point only when one specific method has been properly used, with that method clearly specified upfront. The student should not be choosing from the four options; instead, they should use the method outlined at the beginning of the assignment. In my class, sure, I teach all four sourcing methods — Historical Context, Intended Audience, Purpose, and Point of View — but I instruct students to focus solely on Historical Context for their responses. By narrowing down the options and making the expectations clear, the AI is much more likely to grade their attempts accurately.
  3. When it comes to the complexity point, if the AI is directed to award the point only when the student uses a specific method, it becomes much more accurate. In my class, I teach all possible strategies, but I make it clear that I will only award the point if students use my “tack-it-on” approach. This involves adding a standalone paragraph at the end of their essay that presents a counterargument, nuance, or additional insight. By focusing the AI on a single, clear strategy, like the “tack-it-on” approach, AI is far more likely to assess the student’s attempt accurately.

To summarize, AI’s accuracy appears to vary by FRQ type: it works very well with SAQs, performs fairly well with LEQs, and finds DBQs the most challenging, particularly with sourcing and complexity points. To improve AI grading, break LEQ and DBQ assignments into smaller steps. Have students earn points one at a time, using AI to assess each attempt before moving on. For sourcing, specify a method upfront — such as focusing on Historical Context — to guide the AI’s evaluation. For the complexity point, using a clear strategy like the “tack-it-on” method, where students add a standalone paragraph with a counterargument or nuance, improves AI’s grading accuracy. By simplifying tasks and setting clear expectations, AI can become a more reliable grader.

Something else I’ve learned over the past year — those who are most likely to claim that AI ‘can’t grade accurately’ tend to be teachers with many years of experience, a well-earned reputation for expertise in the teaching and scoring of student writing, and an active presence in Facebook groups and AP online communities, where they are admired for their FRQ-related knowledge and guidance and often remind others that they have been a ‘reader.’

In other words, I’m suggesting that many of those who take a hardline stance of ‘no use of AI for grading’ just may be driven by more than a few legitimate concerns, with the one related to accuracy topping their list. They may also be motivated to make these claims by an underlying fear that their expertise and the role they play within the AP community are being challenged. If AI tools gain more ground in classrooms, they might worry that they won’t be the go-to source for advice, which could make them feel less needed. This concern is certainly understandable but, let’s face it, it can also lead to an overly harsh dismissal of tools that, when used thoughtfully, can actually support both teachers and students.

Now I’m not saying these critics don’t have valid points or that they’re only motivated by self-interest; rather, I’m suggesting that their skepticism may also stem from a fear of losing the unique role and influence they’ve built within the teaching community.

If only the critics were to say that AI isn’t 100% accurate and leave it at that, we could agree. AI isn’t 100% correct, and hence why I believe that teachers — as I’ve also already made clear — shouldn’t rely exclusively on AI for grading and, when it comes to the scores that truly matter, either self-check a certain percentage of assessments or, as I do, grade them all personally without using AI.

Conclusion:

In this post, I’ve addressed two key questions about the use of AI in grading. The first is whether teachers should rely solely on their own judgment to grade their students’ FRQ responses. My answer is an unequivocal ‘no.’ Teachers should not refrain from using AI for grading. They should explore, lean in, and maybe even embrace. It’s in the best interest of all students, I’m convinced.

My answer to the second question — accuracy — is infinitely more nuanced. While AI generally handles elements like the thesis, contextualization, and initial evidence quite well, it can sometimes struggle with nuances, especially in the sourcing and complexity points of DBQs.

This is why a balanced approach is necessary: use AI where it excels and rely on teacher expertise for more complex evaluations.

By thoughtfully integrating AI into the FRQ grading process, we can best support student learning and development without sacrificing the quality of assessment.

Sidenote #1

Though I am a consultant for Class Companion, I wrote and published this post independently, without any input from anyone at Class Companion or any other AI grading or instant feedback platform. Nor was I encouraged or influenced by any organization to write this post. The observations, experiences, and conclusions shared here are entirely my own.

Sidenote #2

That said, I do want to extend a special thanks to APUSH teacher Richard Vanden Bosch for providing me with invaluable guidance and several great suggestions for improving this post.

Sidenote #3

Since publishing this post, a teacher friend reached out to highlight an important point in the AI grading debate: all teachers agree that their students matter. With this in mind, my friend suggested that teachers should poll their students to find out what they collectively want.

In this regard, he suggested a potential poll question: “How do you feel about me using AI tools like ChatGPT and/or Class Companion to grade your writing?”

Then to give students something to think about before answering the questions, he encouraged teachers to include arguments both for and against AI grading.

The possible responses to the poll question could be:

  • Totally fine, even if we can’t use it for writing.
  • Not cool if we’re not allowed to use it too.
  • I’m on the fence about it.
  • As long as the grading is fair, I’m good.

--

--

Peter Paccone
Peter Paccone

Written by Peter Paccone

Social studies teacher, tutor, book author, blogger, conference speaker, webinar host, ed-tech consultant, member of College Boards AI in AP Advisory Committee.

Responses (1)