The essential problem in data assessment is called overfitting, i.e. using a small dataset to predict something. The grading software must compare essays, understand what parts are great and not so great and then condense this down to a number which constitutes the grade, which in its turn must be comparable with a different essay on a totally different topic. Sounds hard, doesn’t it? That’s because it is. Very hard. But still, not impossible. Google uses similar tactics when comparing what resulting texts and images are more preferable to different search terms. The issue is just that Google uses millions of data samples for their approximations. A single school could, at best, input a few thousand essays. This is like trying to solve a 1000-piece puzzle with just 50 pieces. Sure, some pieces can end up in the right place but it’s mostly guess work. Until there is a humongous database of millions and millions of essays, this problem will most likely be hard to work around.
The only plausible solution to overfitting is specifying a specific set of rules for the computer to act upon to determine if a text makes sense or not, since computers can’t read. This solution has worked in many other applications. Right now, auto-grading vendors are throwing everything they got at coming up with these rules, it’s just that it is so hard coming up with a rule to decide the quality of creative work such as essays. Computers have a tendency of solving problems in the way they usually do: by counting.
In auto-grading, the grade predictors could, for example, be; sentence length, the number of words, number of verbs, number of complex words and so on. Do these rules make for a sensible assessment? Not according to Perelman at least. He says that the prediction rules are often set in a very rigid and limited way which restrains the quality of these assessments. For example, he has found out that:
- A longer essay is considered better than short one (a coincidence according to auto grading advocate and professor Mark D. Shermis)
- Specific word associated with complex thinking such as ’moreover’ and ’however’ leads to better grades
- Towering words such as ’avarice’ gives more points than using simple ones such as ’greed’
On other instances he found examples of rules poorly applied or just not applied at all, the software could for example not determine whether facts were true or false. In a published and automatically graded essay, the task was to discuss the main reasons why a college education is so expensive. Perelman argued that the explanation lies within the greedy teacher’s assistants who has a salary of six times that of a college president and regularly uses their complementary private jets for a south sea vacation.
The essay was awarded the highest grade possible: 6/6.
To avoid the examining eye of Perelman and his peers most vendors have restricted use of their software while development is still ongoing. So far, Perelman hasn’t gotten his hand on the most prominent systems and admits that so far he has only been able to fool a couple of systems.
If we are to believe Perelman’s claims, automatic grading of college level essays still has a long way to go. But remember that already today, lower grade essays is actually being graded by computers already. Granted, under meticulous supervision by humans but still, technological progress can move fast. Considering how much effort being asserted towards perfecting automatic grading scoring it is likely we will see a fast expansion in a not too distant future.
About the author: Hubert.ai is a young edtech company based in Stockholm, Sweden. We are working to disrupt teacher feedback by using AI conversational dialog with every student separately. Feedback is then analyzed and compiled down to a few recommendations on how you as a teacher can improve your skills and methods. Are you a teacher and would like to help us in development? Please sign up as a beta tester at our website :]
How U of Michigan Built Automated Essay-Scoring Software to Fill ‘Feedback Gap’ for Student Writing
By Jessica Leigh Brown Jun 6, 2017
The University of Michigan’s M-Write program is built on the idea that students learn best when they write about what they’re studying, rather than taking multiple-choice tests. The university has created a way for automated software to give students in large STEM courses feedback on their writing in cases where professors don’t have time to grade hundreds of essays.
The M-Write program started in 2015 as a way to give more writing feedback to students by enlisting other students to serve as peer mentors to help with revisions. This fall, the program will add automated text analysis, or ATA, to its toolbox, primarily to identify students who need extra help.
Senior lecturer Brenda Gunderson teaches a statistics course that will be first to adopt the automated element of M-Write. “It’s a large gateway course with about 2,000 students enrolled every semester,” Gunderson says. “We always have written exams, but it never hurts to have students communicate more through writing.”
As part of the M-Write program, Gunderson introduced a series of writing prompts in the course last year. The prompts are targeted to elicit specific responses that clearly indicate how well students grasp the concepts covered in class. Students who chose to participate in the program completed the writing assignments, submitted them electronically, and received three of their peers’ assignments for review. “We also hired students who’d previously done well in the course as writing fellows,” Gunderson says. “Each fellow is assigned to a group of students and is available to help them with the revision process.”
Rising senior Brittany Tang has been a writing fellow in the M-Write program for the past three semesters. “Right now, I have 60 students in two lab sections,” she says. “After every semester, professors and fellows review every student submission from the class and score them based on a rubric.”
To build the automated system, a software development team used that data to create course-specific algorithms that can identify students who are struggling to understand concepts.
“In developing this ATA system, we needed to go through the pilot project and have students do the writing assignments to collect the data,” Gunderson says. “This fall, we’ll be ready to roll out the program to all the students in the course.” Gunderson is also incorporating eCoach, a personalized student messaging system developed by a research team at U-M, to provide students with targeted advice based on their performance.
When a student submits a writing assignment, the ATA system will generate a score. After a writing fellow quickly reviews it, the score gets delivered to the student via the eCoach system. The student then has an opportunity to revise and resubmit the piece based on the combination of feedback from the assigned writing fellow, the ATA system, and peer review.
Filling the Feedback Gap
The university’s launch of ATA is part of a growing nationwide trend in both K-12 and higher education classrooms, according to Joshua Wilson, assistant professor of education at the University of Delaware. Wilson researches the application of automated essay scoring. “I project the fastest adoption in the K-12 arena, and pretty quick adoption at community colleges, where it is helpful for remedial English courses,” Wilson says. “U-M presents a really interesting model of adoption. It has required them to build a content-specific system, but there’s really a demand for that among faculty who aren’t trained to teach writing.”
Wilson says ATA’s critics dislike the systems because they seem to remove the human element from essay grading—a traditionally personal act. But in reality, systems are being “taught” how to respond by their human programmers. “Systems are designed by looking closely at a large body of representative student work and the strengths and weaknesses of those papers,” he says. “Essentially, they provide a subset to the computer and they develop a model used to evaluate future papers.”
While a computer program will never give the same depth of feedback a professor can, Wilson says these systems could fill a growing gap in many K-12 and higher education classrooms. “I think people who outright reject these systems forget what the status quo is. Unfortunately, we know that instructors don’t give enough feedback, often because the teacher-student ratio is such that they don’t have time.”
In Wilson’s view, ATA feedback isn’t as good as human feedback, but it’s better than nothing—and the quality is improving all the time. “Obviously, a computer can’t understand language the same way we can, but it can identify lexical proxies that, combined with machine learning, can produce a score that’s very consistent with a score given by humans, even though humans are reading it in a different way.”