Goodbye job applications, hello dream career
Seize control of your career and design the future you deserve with LW career

Can GenAI outperform Australian law students?

An NSW-based law lecturer recently undertook an experiment, pitting his criminal law cohort against 10 separate AI-generated responses for an end-of-semester exam. The results might surprise you.

user iconJerome Doraisamy 25 September 2024 Big Law
expand image

In the wake of its still-recent explosion into mainstream consciousness, much has been made of the capacity for generative AI (GenAI) to perform the duties of legal professionals, resulting in a revival of the discourse surrounding the replacement of lawyers by emerging technology.

Dr Armin Alimardani (pictured), a lecturer in law and emerging technologies at the University of Wollongong (UOW), has specifically been investigating whether GenAI can outperform law students or, indeed, an overwhelming majority of them.

 
 

His findings form the basis of a new paper, Generative Artificial Intelligence vs. Law Students: An Empirical Study on Criminal Law Exam Performance, which was published yesterday (Tuesday, 24 September) in the Journal of Law, Innovation and Technology.

He said: “The OpenAI claim was impressive and could have significant implications in higher education; for instance, does this mean [that] students can just copy their assignments into generative AI and ace their tests?”

“Many of us have played around with generative AI models, and they don’t always seem that smart, so I thought why not test it out myself with some experiments.”

The experiment

Last year, in UOW’s second semester, Alimardani – in his capacity as the subject coordinator for criminal law – compiled answers from AI to the end-of-semester exam. Five responses were sought using different versions of ChatGPT, and another five used various prompt engineering techniques for the sake of enhanced responses.

“My research assistant and I hand-wrote the AI-generated answers in different exam booklets and used fake student names and numbers. These booklets were indistinguishable from the real ones,” Alimardani said.

After the criminal law exam was held, Alimardani mixed the AI-generated papers with the real student papers and handed them to tutors for grading, who unknowingly marked two AI papers in their allocated bundle.

The results showed that – for a cohort of 225 students sitting an exam marked out of 60 – the average mark was approximately 40 (i.e. 66 per cent).

The papers using different versions of ChatGPT saw two bare pass marks and three fail marks. The best-performing paper of that quintet scored better than only 14.7 per cent of students.

“… this small sample suggests that if the students simply copied the exam question into one of the OpenAI models, they would have a 50 per cent chance of passing,” Alimardani said.

Of the papers that used prompt engineering tricks, three papers “weren’t that impressive”, Alimardani noted, but two performed reasonably well, with one scoring at 73 per cent and the other scoring at 78 per cent.

Ultimately, he said, “these results don’t quite match the glowing benchmarks from OpenAI’s United States bar exam simulation, and none of the 10 AI papers performed better than 90 per cent of the students”.

Another potentially surprising result was that “hallucination” – the formulation of fabricated information by an AI tool – did not eventuate in this experiment, with the models used staying true to existing legal principles and facts provided in the exam, Alimardani noted.

Implications

Looking ahead, “alignment” – the degree to which AI-generated outputs match the user’s intentions – will be the real problem, Alimardani warned.

“The AI-generated answers weren’t as comprehensive as we expected. It seemed to me that the models were fine-tuned to avoid hallucination by playing it safe and providing less detailed answers,” he said.

“My research shows that people can’t get too excited about the performance of GenAI models in benchmarks. The reliability of benchmarks may be questionable, and the way they evaluate models could differ significantly from how we evaluate students.”

Moreover, the findings suggest that graduates who know how to work with AI could have an advantage in the job market, Alimardani continued.

“Prompt engineering can significantly enhance the performance of GenAI models, and therefore, it is more likely that future employers would have higher expectations regarding students’ GenAI proficiency,” he said.

“It’s likely students will be increasingly assessed on their ability to collaborate with AI to complete tasks more efficiently and with higher quality.”

Elsewhere, there may be implications for legal educators, Alimardani said.

None of the tutors tasked with grading papers suspected any papers were AI-generated, he noted, and were “genuinely surprised” when they found out.

In addition, “three of the tutors admitted that even if the submissions were online, they wouldn’t have caught it”.

“So, if academics think they can spot an AI-generated paper, they should think again,” he said.

Jerome Doraisamy

Jerome Doraisamy

Jerome Doraisamy is the editor of Lawyers Weekly. A former lawyer, he has worked at Momentum Media as a journalist on Lawyers Weekly since February 2018, and has served as editor since March 2022. He is also the host of all five shows under The Lawyers Weekly Podcast Network, and has overseen the brand's audio medium growth from 4,000 downloads per month to over 60,000 downloads per month, making The Lawyers Weekly Show the most popular industry-specific podcast in Australia. Jerome is also the author of The Wellness Doctrines book series, an admitted solicitor in NSW, and a board director of Minds Count.

You can email Jerome at: This email address is being protected from spambots. You need JavaScript enabled to view it.