image
By Asia Education Review Team , Monday, 24 June 2024

AI Excels in China's 'Gaokao' Language Tests, Struggles in Math

  • According to a study conducted by researchers from the Shanghai Artificial Intelligence Laboratory, artificial intelligence excelled in generating answers for Chinese literature and English language sections of this year's national college entrance exams (gaokao). However, its performance in mathematics was notably weaker. The study utilized six open-source AI models alongside GPT-4o, the latest version developed by leading company OpenAI, to simulate the test-taking experience required for admission to Chinese universities.

    Results released by the laboratory showed that AI test-takers achieved an average accuracy rate of 67 percent in the Chinese language and literature and 81 percent in the English language. In mathematics, however, they only answered 36 percent of questions correctly. The top scorer was domestic company Alibaba’s latest multilingual language model, known as Qwen2-72B, which got about 72 percent of the questions right, followed by GPT-4o and a model launched by the Shanghai Artificial Intelligence Laboratory itself on June 4.

    Researchers noted that the exams consisted of various question formats, including multiple-choice, fill-in-the-blank, questions with single correct answers, and open-response questions such as essay writing on given themes. Each answer sheet underwent assessment by at least three tutors who were unaware of the test-takers' identities until after grading. The graders observed that AI tools demonstrated better comprehension of contemporary Chinese text but struggled with understanding classical Chinese passages from pre-modern times. Additionally, few AI models were capable of employing techniques like quoting adages in their written responses.

    “On the math test, their subjective responses tend to be disorganized and confusing, and the answer could be correct despite errors in the process. They also exhibited a strong memorization capability for formulas but were not able to swiftly apply them to problem-solving”, the graders said. AI participants also had mediocre results during the preliminary round of the 2024 Alibaba Global Mathematics Competition. Organizers said this month that the average score of the 500-plus AI teams was 18 out of 120, and the highest score among them was only 34, compared with the highest human score of 113.

    Cao Sanxing, deputy dean at the Institute for Internet Information Research of Communication University of China, suggested that the AI models' underperformance in mathematics does not necessarily indicate deficiencies in their abilities for reasoning and calculation. “At present, AI training related to math questions is not the primary focus of the sector, and the majority of resources have been devoted to feeding human language materials into AI models, hence the higher score in Chinese and English languages”, Cao said.

    Despite AI's strong performance in language-related subjects, Cao noted that AI-generated content still exhibits noticeable flaws, such as contradictory statements, and reveals a lack of deep thought. Xu Yi, a graduate student at Renmin University of China's Gaoling School of Artificial Intelligence, highlighted that AI's predominant strength lies in summarizing information by analyzing extensive datasets, which clarifies its exceptional performance in text generation. Xiong Bingqi, director of the 21st Century Education Research Institute, also attributed the lower scores in mathematics to insufficient programming related to math.