Đang tải lên

Vui lòng đợi trong giây lát...

Preview text:

Received 26 February 2025, accepted 22 March 2025, date of publication 31 March 2025, date of current version 9 April 2025.
Digital Object Identifier 10.1109/ACCESS.2025.3556352
ET-GNN: Ensemble Transformer-Based Graph
Neural Networks for Holistic Automated Essay Scoring
HIND ALJUAID 1, AREEJ ALHOTHALI 1, OHOUD ALZAMZAMI1, HUSSEIN ASSALAHI2, AND TAHANI ALDOSEMANI 3
1 Department of Computer Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia
2 English Language Institute, King Abdulaziz University, Jeddah 21589, Saudi Arabia
3 College of Education, Prince Sattam bin Abdulaziz University, Al-Kharj 16273, Saudi Arabia
Corresponding author: Hind Aljuaid (hhamadaljuaid0001@stu.kau.edu.sa)
This work was supported in part by the Deputyship for Research and Innovation, Ministry of Education in Saudi Arabia, under Project
IFPRC-039-126-2020; and in part by the Deanship of Scientific Research at King Abdulaziz University, Jeddah, Saudi Arabia.
ABSTRACT Essay writing tasks are crucial for assessing students’ writing skills, but manual evaluation
can be time-consuming and prone to inconsistencies. Automated Essay Scoring (AES) offers a solution
by automatically evaluating essays, reducing the need for human intervention. This paper presents a
hybrid method, called Ensemble Transformer-Based Graph Neural Networks (ET-GNN), which integrates
Transformer-based models with Graph Convolutional Networks (GCNs) for holistic AES. Three Transformer
models, DistilBERT, RoBERTa, and DeBERTaV3, were individually fine-tuned to generate contextual
embeddings for each essay. The GCNs process these embeddings, effectively capturing relevant semantic
information and inter-essay similarities. Additionally, ensemble methods are used to combine the
DistilBERT-GCN, RoBERTa-GCN, and DeBERTaV3-GCN models employing averaging for regression
tasks, majority voting for classification tasks, and a weighted ensemble method for both types of tasks. The
proposed ET-GNN method enhances the performance and robustness of AES systems, achieving Quadratic
Weighted Kappa (QWK) scores of 0.835 and 0.841 on the ASAP and AES 2.0 datasets, respectively. These
results outperform other state-of-the-art models based on Transformer or GCNs architectures for the AES task.
INDEX TERMS Automated essay scoring (AES), transformer models, graph convolutional networks
(GCNs), ensemble methods, natural language processing (NLP). I. INTRODUCTION
with a rubric and a scoring scale, human evaluation can
Essay writing tasks are often utilized to evaluate a student’s
lack precision due to potential biases and inconsistencies in
creativity, knowledge, and intellect, as they reflect the ability
raters’ judgments [3]. To address these issues, Automated
to collect, synthesize, and present information clearly [1].
Essay Scoring (AES) has emerged as an efficient solution,
Essay assignments are widely used in various assessment
using Natural Language Processing (NLP) and advanced
contexts, including college and university entrance appli-
algorithms to automatically evaluate writing, significantly
cations, classroom evaluations, and standardized tests such
reducing the need for human intervention [4]. AES systems
as TOEFL and IELTS [2]. However, manual evaluation
provide consistent, reliable scoring that can enhance assess-
becomes increasingly challenging when large volumes of
ment processes and improve educational outcomes [5].
essays must be evaluated in a limited time. In addition, even
Existing AES models can be broadly categorized into
two groups: traditional approaches that rely on handcrafted
The associate editor coordinating the review of this manuscript and
features and neural network methods that utilize automatic
approving it for publication was Arianna D’Ulizia .
feature extraction. Traditional methods involve manually
2025 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 58746
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 13, 2025
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
extracting shallow features that reflect various characteris-
long-range contextual semantics and integrating local and
tics of the essay, followed by applying machine learning global relationships [9].
techniques to predict scores [6]. Figure 1 illustrates the
Several recent studies have explored the potential of GCNs
types of features typically employed in essay scoring, which
in the context of AESs. For example, Ait Khayi and Rus [14],
can be classified into three main categories: lexical, style,
Tan et al. [15], and Ma et al. [16] have demonstrated the
and semantic features [7]. However, these feature-engineered
effectiveness of GCN in capturing relationships between
models often struggle to understand deeper semantic features,
essays and improving features extractions. Each study differs
which can result in lower accuracy and require significant
in the choice of word embeddings used to represent graph
time and effort for development [4].
nodes. Ait Khayi and Rus [14] utilize TF-IDF and Word2Vec
embeddings to encode term importance and semantic rela-
tionships, while Tan et al. [15] employ one-hot encoding for
binary vector representations. Ma et al. [16] introduce GloVe
embeddings to represent document, sentence, and token
nodes, enabling more effective modeling of semantic and contextual relationships.
However, traditional word embedding methods, such as
TF-IDF, Word2Vec, and GloVe, have limitations in handling
polysemy and capturing contextual dependencies. These
methods generate static representations that do not adapt
to different contexts, which can be a significant drawback
in AES. Pre-trained Transformer models address these
challenges by generating dynamic embeddings that capture
FIGURE 1. Types of features typically used in essay scoring, categorized
complex semantic and contextual features [17].
into lexical, style, and semantic features.
While Transformers excel at capturing deep linguistic
representations, they lack the ability to model relationships
Recently, deep learning models, including Convolutional
across multiple essays. This gap drives the need to integrate
Neural Networks (CNNs) and Recurrent Neural Networks
Transformer-based embeddings with GCNs, a combination
(RNNs) such as Long Short-Term Memory Networks
that has shown success in various NLP tasks, such as text
(LSTMs), have been widely used for AES tasks [5].
classification [17], [18], [19], [20]. Combining the contextual
These approaches utilize word embedding techniques to
power of Transformer-based embeddings with GCNs’ ability
transform characters, words, or sentences into n-dimensional
to model relational structures between texts offers a more
vectors, enabling their use in various neural network
comprehensive solution for tasks like AES.
architectures [4], [8]. The strength of these models lies in
To bridge this gap, this paper introduces the Ensem-
their ability to extract rich semantic features from essays,
ble Transformer-Based Graph Neural Networks (ET-GNN)
effectively capturing semantic and syntactic information
method, which effectively combines the contextualized rep-
in local consecutive word sequences [9], while addressing
resentation power of pre-trained Transformers with the rela-
some of the limitations associated with traditional feature
tional modeling capabilities of GCNs. This novel integration engineering [6], [10].
enhances AES performance by addressing key limitations
However, despite their advantages, these deep learning
in feature representation and relational modeling, offering a
models often struggle to capture global word co-occurrences
more robust and accurate solution for AES.
and Long-range dependencies within a corpus, both of which
To further enhance the performance of the proposed
are essential for understanding complex contextual semantics
hybrid model, ET-GNN, ensemble methods are employed
beyond consecutive words [9]. To address these challenges,
to combine the outputs of multiple models, resulting in
GNNs, also referred to as graph embeddings, have emerged
more reliable and accurate predictions. These methods mit-
as a promising research area. GNNs have gained significant
igate overfitting and improve generalization across various
attention for their effectiveness in tasks with rich relational
data representations by leveraging the strengths of diverse
structures and their ability to preserve global structural
models [21]. In the context of NLP applications, ensemble
information within graph embeddings [11].
methods that integrate multi-featured deep learning models
A notable advancement within GNNs is the Graph
have demonstrated significant promise, especially for text
Convolutional Networks (GCNs) model introduced by Kipf
classification tasks. These methods improve performance by
and Welling, which achieved state-of-the-art classification
combining multiple deep learning models, each trained on
results on a variety of graph datasets [12]. GCNs have distinct features [22].
also been widely explored in NLP tasks such as text
In this paper, the proposed ET-GNN method incor-
classification [9], [13], demonstrating a more comprehensive
porates ensemble techniques to combine outputs from
understanding of the text due to their ability in modeling
three top-performing models, namely DistilBERT-GCN, VOLUME 13, 2025 58747
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
RoBERTa-GCN, and DeBERTaV3-GCN. Common ensem-
combined GloVe with CNNs and LSTMs, while Li et al. [31]
ble techniques, such as averaging for regression tasks [23],
integrated LSTMs with self-attention mechanisms for coher-
majority voting for classification tasks, and weighted
ence detection. Other studies, such as Cai [32] and Zhu and
ensembles that assign varying importance to individual
Sun [33], explored the effectiveness of Gated Recurrent Units
models based on their performance, are utilized to refine
(GRUs), LSTMs, and Bi-directional LSTMs (Bi-LSTMs) in
predictions [24]. The proposed ET-GNN method was eval- improving AES performance.
uated on two datasets, the ASAP and AES 2.0 datasets,
Unlike unidirectional LSTMs, Bi-LSTMs models process
achieving a higher prediction accuracy than state-of-the-
text bi-directionally to enhance contextual representation
art models in AES. The ET-GNN method enhances both
and have been widely explored with various static embed-
prediction accuracy and model robustness, demonstrating its
dings. Xia et al. [34] proposed a two-layer Bi-LSTM with
effectiveness as a comprehensive and reliable solution for
Word2Vec, while Muangkammuen and Fukumoto [35] and AES.
Chen et al. [36] employed Bi-LSTMs with GloVe embed-
This paper is organized as follows: Section II reviews
dings. Chen and Li [36] further improved performance by
related work in the field of AES, highlighting previous
incorporating attention mechanisms. However, static embed-
methodologies, their limitations, and the exploration of GCNs
dings lack contextual awareness, as words are represented
in AES tasks. Section III presents the proposed methodology,
identically regardless of context.
detailing the integration of Transformer models, GCNs, and
Transformer-based models, such as Bidirectional Encoder
ensemble methods across multiple tasks. Section IV outlines
Representations from Transformers (BERT), address this
the experiments, including dataset details, evaluation metrics, limitation by providing contextualized embeddings.
experimental setup, results, and comparisons with baseline
Wangkriangkri et al. [37] demonstrated that BERT outper-
studies. Finally, Section V discusses the conclusion and
formed GloVe and ELMo when combined with LSTMs,
suggests directions for future research.
leading to improved AES performance. Other studies have
integrated BERT embeddings with CNNs, LSTMs [38], II. RELATED WORK
Capsule Networks [39], and Bi-LSTMs [40] to further
Research on holistic essay scoring has evolved from enhance AES performance.
traditional machine learning methods, which relied on
Some studies have combined static and contextualized
handcrafted features, to neural network-based methods
embeddings to enhance AES performance. Li et al. [41]
leveraging word embeddings, resulting in improved scoring
integrated RoBERTa and GloVe with LSTMs, while
performance. However, these neural network-based methods
Beseiso et al. [8] combined Word2Vec and BERT. Zhou et al.
often focus on local word sequences, overlooking global word
[6] incorporated GloVe with self-trained embeddings in
co-occurrence. GCNs have emerged as a promising method
a Transformer-based model. Various studies have lever-
to address this by capturing global relationships. This section
aged Transformer models, including BERT, DistilBERT,
reviews previous research in three key areas: traditional
RoBERTa, and DeBERTa, for AES tasks [42], [43], [44].
machine learning methods, deep neural network methods, and
Hybrid approaches, such as RoBERTa combined with GCNs.
Bi-LSTMs, have addressed length truncation issues, leading
to enhanced performance [45]. Despite these advances,
many deep learning methods primarily focus on local
A. TRADITIONAL MACHINE LEARNING METHODS
word sequences, without explicitly capturing global word
Traditional studies in AES have primarily focused on feature co-occurrence [9].
engineering and classification or regression algorithms. Com-
monly used features include vocabulary, sentence length,
C. GRAPH CONVOLUTIONAL NETWORKS (GCNs)
grammatical errors, and semantic similarity [25], [26].
GCNs have recently demonstrated strong performance in
Advanced features, such as parse tree depth and type-token
various NLP tasks due to their ability to model relational
ratio, have also been explored [27], along with syntactic,
structures within textual data [46], [47]. Inspired by these
semantic, and sentiment-based features aimed at enhancing
advancements, researchers have explored the use of GCNs
predictive performance [28]. In addition, some studies have in AES.
utilized ensemble learning methods to improve prediction
For instance, Tan et al. [15] and Ait Khayi and Rus [14]
accuracy by leveraging the strengths of multiple models [26],
applied GCNs to short-answer scoring. Tan et al. conducted
[27], [29]. In contrast to these traditional machine learning
experiments on the SemEval-2013 (English) and Two-
methods, which rely on handcrafted features, deep neural
subject (Chinese) datasets, achieving competitive results
network methods automatically learn text embeddings,
using one-hot encoding for binary vector representations.
offering a more flexible and efficient approach to AES.
Ait et al. utilized TF-IDF and Word2Vec embeddings to
incorporate term importance and semantic relationships,
B. DEEP NEURAL NETWORKS METHODS
demonstrating promising performance on the DT-Grade
Early deep learning methods used static embeddings, such as
dataset. Ma et al. [16] extended the use of GCNs to essay
GloVe and Word2Vec for text representation. Chen et al. [30]
scoring by applying GCNs to the Automated Student 58748 VOLUME 13, 2025
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
Assessment Prize (ASAP) dataset. This study employed
traditional models that handle inputs sequentially. While
GloVe word embeddings to represent essays, allowing GCNs
it does not inherently encode word order, it incorporates
to capture semantic relationships between texts. However,
positional encoding to retain sequence information, ensuring
reliance on static word embeddings limited the ability to
that embeddings for the same word vary depending on its
fully capture contextual word meanings, particularly for
position within the sequence. The Transformer employs a polysemous words.
self-attention mechanism to capture dependencies between
Unlike earlier studies that relied on static embeddings with tokens [49].
GCNs, this work is the first, based on existing research,
These models excel in learning universal language rep-
to integrate Transformer-based contextual embeddings with
resentations from vast amounts of unlabeled text data and
GCNs for AES. Transformer models provide rich, context-
applying this knowledge to downstream tasks through a
dependent representations of essays, overcoming the limita-
process known as transfer learning. This approach involves
tions of static embeddings. By leveraging these contextual
pre-training the model on extensive, generic datasets in an
embeddings within GCNs architecture, ET-GNN captures
unsupervised manner to develop a broad understanding of
both essay-level semantics and structural relationships
language, followed by fine-tuning task-specific data. Fine-
between essays, offering a more comprehensive assessment
tuning adapts the model to the target domain, enabling
approach. This hybrid method advances AES by combining
it to achieve superior results compared to training from
the strengths of both architectures, improving predictive scratch [50].
performance, and the ability to model relationships among
This study evaluates the performance of three Transformer- essays.
based pre-trained models–DistilBERT, RoBERTa, and
DeBERTaV3–on the AES task using the base version of each III. METHODOLOGY model for comparison.
This section presents ET-GNN, a hybrid method that inte- 1) DistilBERT, short for ‘‘Distilled BERT,’’ is a
grates Transformer-based models and GCNs for holistic AES.
lightweight and efficient version of the BERT
Rather than grading individual sentences, the method evalu-
model, designed to address the computational and
ates overall essay quality by analyzing content, structure, and
memory challenges of large Transformer architectures.
context, providing a more comprehensive assessment. The
It employs knowledge distillation, where a smaller
overall architecture of the proposed method is illustrated in
‘‘student’’ model is trained to replicate the behavior Figure 2.
of a larger, pre-trained ‘‘teacher’’ model, in this case,
In this approach, the Transformer model extracts con-
BERT. By halving the number of layers in BERT-base
textual embeddings that capture the semantic and syntactic
and removing components such as token embeddings
features of each essay, but relationships between essays are
and poolers, DistilBERT retains 97% of BERT’s
not explicitly represented. To address this, GCNs incorporate
performance on benchmarks like GLUE, while being
contextual similarities and structural relationships that the
40% smaller and 60% faster during inference [51].
Transformer model alone cannot capture. Through a graph-
2) RoBERTa, short for ‘‘Robustly Optimized BERT Pre-
based structure, GCNs model inter-essay relationships, rep-
training Approach,’’ replicates the original BERT by
resenting each essay as a node and establishing edges based
optimizing hyperparameters and significantly increas-
on similarities in themes, writing styles, or other contextual
ing the size of the training data. Pre-trained on five
features. This process refines predictions by combining both
diverse English-language datasets totaling over 160GB
individual essay features and structural information from
of text, RoBERTa has achieved state-of-the-art results similar essays.
in several downstream tasks across three datasets [52].
To further enhance performance and generalization,
3) DeBERTa, short for ‘‘Decoding-enhanced BERT with
an ensemble of Transformer-GCN models is employed.
Disentangled Attention,’’ further refines the BERT and
By combining outputs from multiple models, the ensemble
RoBERTa models by introducing two novel techniques:
method improves reliability across diverse data embeddings.
a disentangled attention mechanism and an enhanced
Additionally, ET-GNN is designed for both classification
masked decoder. These techniques improve pretraining
and regression tasks, effectively supporting discrete and
efficiency, enabling DeBERTa, trained on 78GB of
continuous score predictions. This flexibility makes the
data, to outperform RoBERTa, which was pre-trained
method adaptable to various essay scoring rubrics and on 160GB of data [53]. requirements.
In this study, each model was fine-tuned on the specific A. TRANSFORMER-BASED MODEL
AES datasets to generate embeddings tailored to each
Transformer-based pre-trained language models, such as
dataset. These embeddings served as the foundation for
BERT, have achieved remarkable success in NLP [48], lever-
further processing in the GCNs. Additionally, the models
aging the innovative Transformer architecture introduced by
were configured to handle both classification and regression
Vaswani [49]. The Transformer is a novel encoder-decoder
tasks, ensuring compatibility with diverse scoring rubrics and
model that processes all tokens simultaneously, unlike prediction types. VOLUME 13, 2025 58749
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
FIGURE 2. ET-GNN is a hybrid method that integrates Transformer-based models and GCNs for holistic AES. Essay embeddings, generated from
fine-tuned Transformer models to capture contextual information, are represented as nodes in a graph structure. Each node corresponds to a distinct
essay embedding, and these nodes are processed by GCNs to capture relational dependencies between essays, thereby enhancing the method’s scoring
capabilities. ET-GNN supports both classification and regression tasks: for classification, the output is passed through a Softmax function to provide class
probabilities, while for regression, raw continuous values are output directly without activation. To further improve performance and generalization,
ensemble methods are applied, including majority voting for classification, averaging for regression, and a weighted ensemble method for both tasks.
B. GRAPH CONVOLUTIONAL NETWORKS (GCNs)
the graph. These embeddings effectively capture the semantic
This section describes the architecture of the proposed GCNs
content of the essays. Let hi denote the embedding vector of
model, detailing both graph construction and graph convolu-
the i-th essay in the dataset.
tional layers. The model is based on the foundational work
introduced by Kipf and Welling [12], with modifications b: EDGE CONSTRUCTION
to the model’s parameters to optimize performance for this
Edges between nodes are established based on the cosine specific task.
similarity of their embedding vectors. A cosine similarity
close to 1 indicates that the vectors are nearly identical, while 1) GRAPH CONSTRUCTION
a value close to 0 suggests little or no similarity. The cosine
A graph representation of the dataset is constructed using
similarity between two essays, i and j, is calculated as follows:
fine-tuned Transformer embeddings as node features, with
edges defined by the cosine similarity between these
hi · hj
cosine_similarity(i, j) = , (1)
embeddings. The details of the node representation and edge
hi∥∥hj
construction are outlined below. where: a: NODE REPRESENTATION
hi · hj is the dot product of the vectors hi and hj,
In this model, each essay embedding, generated by the •
hi∥ and ∥hj∥ represent the Euclidean norms of the
fine-tuned Transformer models, is represented as a node in
vectors hi and hj, respectively. 58750 VOLUME 13, 2025
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
An edge is added between nodes i and j if the cosine
TABLE 1. Cosine similarity between pairs of essays.
similarity between their embeddings exceeds a predefined
threshold. During experimental trials, different threshold
values starting from 0.7 were tested, and the values that
achieved the best performance were selected. For regression
tasks, the threshold is set to 0.98, while for classification
tasks, it is set to 0.99. These thresholds ensure that only
essays with highly similar representations are connected
in the graph, preserving the most relevant relationships
between essays. By setting high thresholds (0.98 and 0.99),
the model connects only those essays that show strong
similarities, ensuring that the edges in the graph represent
meaningful relationships, such as shared themes or writing
styles. This approach helps prevent the inclusion of noise,
which can arise if weaker or irrelevant connections are added.
By excluding weak connections, the model becomes more
focused on true similarities, improving the overall accuracy
of the predictions. Using this method, the graph structure
remains more relevant and coherent, allowing the model to
make better predictions based on the meaningful connections
between essays. The edge set E is defined as follows:
E = {(i, j) | cosine_similarity(i, j) > threshold, i ̸= j}. (2)
Table 1 presents excerpts from essay pairs with high cosine
similarity scores exceeding 0.98 and 0.99. Interestingly,
although the essays cover diverse topics, the high similarity 2) GRAPH CONVOLUTIONAL LAYERS
scores suggest that structural and stylistic elements–such as
The GCNs architecture consists of multiple layers, each con-
writing style, language use, and sentence construction–are
tributing to the learning process. The input layer initializes
the primary drivers of similarity. The varied content seems
node features using embeddings derived from a fine-tuned
to have less impact on the similarity metrics compared
Transformer model. These embeddings serve as the initial
to these linguistic and structural factors, demonstrating
feature representation for each node in the graph and act as
how essays with different subjects can still demonstrate the input to the network.
remarkably similar embeddings due to shared language and
Following the input layer, the first Graph Convolutional tone.
Layer aggregates features from neighboring nodes through a
The entire graph construction process is illustrated in
convolution operation. This process updates node represen- Algorithm 1.
tations by combining information from neighboring nodes
and associated features, producing a hidden representation.
Algorithm 1 Graph Construction Process
To improve generalization and mitigate overfitting, a Dropout
Layer is applied during training. The convolution operation is
Input: Set of nodes V = {v1, v2, . . . , vn}, where each node mathematically expressed as:
is represented by its embedding e1, e2, . . . , en
Output: Graph G = (V , E), where E is the set of edges H(l+1) = σ ˜ D− 12 ˜ A ˜
D− 12 H(l)W(l) , (3) Initialization:
1: Compute the cosine similarity matrix S n×n R such that where:
Sij = cosine_similarity(ei, ej) •
H(l+1): Node feature matrix after layer l+1, representing Graph Construction:
the updated features for each node.
2: for each pair of nodes (i, j) do ˜ •
A = A + I: Adjacency matrix A, augmented with 3:
if Sij > threshold then
self-loops by adding the identity matrix I. This ensures 4:
Add edge (vi, vj) to the graph G
nodes incorporate their features during aggregation. 5: end if
D: Degree matrix, a diagonal matrix where each element 6: end for
corresponds to the degree (number of connections) of a 7: return G node. ˜ •
D− 12 : Inverse square root of the degree matrix, used to
This process results in an undirected, weighted graph
symmetrically normalize the adjacency matrix, ensuring
where nodes represent the embeddings of essays, and edges
nodes with higher degrees do not dominate feature
indicate high similarity between these embeddings. propagation. VOLUME 13, 2025 58751
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES •
W(l): Trainable weight matrix at layer l, learned during
Each weight is then calculated as follows:
training to transform node features. Qi
σ: Activation function (e.g., LeakyReLU) applied after wi = for i = 1, 2, 3. (8) Total QWK
the convolution to introduce non-linearity, enabling the
model to learn complex patterns in the data.
This calculation ensures that each model’s weight reflects
The second Graph Convolutional Layer applies the same
its performance, with models achieving higher QWK perfor-
mathematical operation as described above to the hidden
mance assigned greater weights.
representations, further refining the node features. However,
The final score S, calculated using the weighted method
no activation function is applied at this stage. This layer
for both regression and classification tasks, is given by:
produces the final output for each node. For classification n X
tasks, the output is processed through a Softmax function to S = wi · Oi, (9)
yield class probabilities. For regression tasks, the raw output i=1
is used directly, producing continuous values as predictions. where:
This architecture utilizes graph convolution operations to •
S represents the final score,
iteratively propagate and refine features, effectively capturing •
n denotes the total number of models,
structural relationships and learning meaningful patterns •
Oi is the output from model i, from the graph data. •
wi is the weight assigned to model i.
Incorporating model-specific weights is a crucial optimiza- C. ENSEMBLE METHODS
tion step that enables models with higher performance to
The ensemble method is a statistical approach that aggregates
contribute more significantly to the final score, ultimately
predictions from multiple models to enhance overall output.
improving the ensemble’s performance in both regression and
In this study, the method combines the DistilBERT-GCN, classification tasks.
RoBERTa-GCN, and DeBERTaV3-GCN models, resulting in
more reliable outcomes across diverse data embeddings.
IV. EXPERIMENTS AND DISCUSSION
Different methods are employed for each task. For
This section begins with an introduction to the datasets used
regression tasks, the final score S is computed by averaging
in this study, followed by an overview of the performance
the predictions Pi from n Transformer-based GCN models as
evaluation methodology employed. The experimental setup follows:
is then described. Next, the section presents the experimental n 1 X
results, offering a detailed analysis and discussion of the S = Pi. (4) n
findings for both classification and regression tasks. Finally, i=1
the discussion section compares the performance of the
This operation ensures that contributions from all models
ET-GNN method with baseline models.
are weighted equally, leading to a more stable and reliable final output. A. DATASETS
In classification tasks, a majority voting mechanism is
To validate and compare the results with existing models, two
employed. The final class C is determined by the class that
datasets were utilized. The first, an older dataset known as
receives the highest number of votes Vj from the output of the
ASAP [54], originally had limitations such as inconsistent individual models:
scoring scales. The second dataset, the Learning Agency
Lab - AES 2.0 (AES 2.0) [55], is a more recent dataset
C = argmaxj(Vj), (5)
specifically designed to overcome previous shortcomings,
where j represents the classes. This method effectively
particularly focusing on argumentative essays. The details of
aggregates the models’ decisions to yield a consensus output.
each dataset are described below.
In addition, a weighted ensemble method is utilized for
both regression and classification tasks. The weights for each 1) ASAP DATASET
model are calculated based on the proportion of that model’s
The ASAP dataset [54] was a competition launched in
QWK performance relative to the total QWK performance
2012 on Kaggle to advance the development of AES across all models.
systems. The ASAP-AES dataset contains a total of 12,978
Let the QWK performances of the models be represented
essays, which are categorized into eight essay sets. Each set as a list:
corresponds to a unique prompt, representing a variety of
writing tasks, such as narrative, argumentative, and source-
model_qwk = [Q1, Q2, Q3]. (6)
dependent responses. These essays, written by students in
Grades 7 to 10, typically range from 150 to 550 words
To compute the weights wi for each model, the total QWK
in length. The essays were evaluated by two or, in some
performance is first calculated:
cases, three human raters on a holistic scale that varied
Total QWK = Q1 + Q2 + Q3. (7)
across prompts. One limitation of this dataset is the varying 58752 VOLUME 13, 2025
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
TABLE 2. Description of the ASAP dataset: For genre, ARG, RES, and NAR where:
correspond to argumentative essays, response essays, and narrative essays, respectively. •
Oi,j is the observed agreement matrix, representing the
agreement between predicted and actual scores. •
Ei,j is the expected agreement matrix, representing the
agreement that would occur by chance. •
wi,j is the weight matrix, typically calculated as: (i j)2 wi,j = , (11) (N − 1)2
where i and j represent the predicted and true score
categories, and N is the total number of possible score categories.
scoring scales across prompts, which adds complexity to the
QWK ranges from −1 to 1, where a value of 1 indicates
task of developing consistent and accurate scoring models.
perfect agreement between the predicted and true scores,
To address this, the essay scores were rescaled to a common
0 signifies no agreement beyond what would occur by chance,
range of 1 to 6, enabling training on all essay sets together.
and negative values indicate worse-than-chance performance.
Table 2 provides comprehensive details for each set.
QWK applies a quadratic weighting, which assigns greater
penalties to larger discrepancies between the predicted and 2) AES 2.0 DATASET
actual scores. This weighting is particularly important in
Building on efforts from twelve years ago, the ‘‘Learning
educational contexts, such as AES, where precise grading is
Agency Lab - AES 2.0’’ competition [55], launched in
critical, and large scoring errors can significantly impact the
2024 on Kaggle, represents a significant advancement in
reliability of the model’s evaluation. As shown in Table 4,
automated essay grading technology. The updated dataset
the interpretation of kappa values helps quantify the level of
for this competition includes 17307 student-written argu-
agreement, ranging from ‘‘no agreement’’ for values less than
mentative essays, each scored on a holistic scale of 1 to
0 to ‘‘almost perfect agreement’’ for values between 0.81 and
6. This scoring range offers a clear and consistent measure 1.00 [56].
for assessing the quality of student writing. The essays
were collected to reflect high-quality, realistic classroom
TABLE 4. Interpretation of kappa values.
writing and to address the shortcomings of previous
datasets, which did not focus on argumentative essays,
a commonly used and relevant format in educational assessments. 3) SUMMARY OF DATASETS
Table 3 summarizes the key characteristics of the ASAP and AES 2.0 datasets.
TABLE 3. Summary of datasets. The genres are abbreviated as follows: C. EXPERIMENTAL SETUP
ARG (Argumentative), RES (Response), NAR (Narrative).
In this study, both Transformer models and Transformer-based
GCN models were configured to handle regression and classification tasks.
For the Transformer models, Stratified K-Fold Cross-
Validation with 5 splits was applied to partition the dataset,
ensuring balanced class representation by preserving the B. PERFORMANCE EVALUATION
distribution of the target variable (i.e., scores) across each
Quadratic Weighted Kappa (QWK) is a statistical measure
fold. Each subset contained approximately 20% of the
designed to evaluate the agreement between predicted scores
samples. The maximum sequence length for model input was
and true scores, making it particularly suitable for AES
set to 1024 tokens for the DeBERTaV3 model and 512 tokens
systems. Unlike simple accuracy metrics, QWK considers
for the other models, with a learning rate of 1e−5.
the degree of disagreement between predictions and actual
The tokenizer was customized to include tokens for new
scores, providing a more detailed assessment of model
paragraphs (‘‘\n’’) and double spaces (‘‘ ’’) to accommodate performance [56].
specific formatting in the text. No dropout was applied
The QWK is calculated using the formula [56]:
to the Transformer models. The AdamW optimizer, with
a weight decay of 1e−2 was utilized, and the learning P w i,j i,jOi,j QWK = 1 − , (10)
rate followed a cosine annealing schedule without warmup P w i,j i,jEi,j
(warmup ratio = 0.0). Each model was trained for 3 epochs, VOLUME 13, 2025 58753
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
using a batch size of 4 for training and 8 for evaluation. FP16
represents a significant optimization, particularly on the AES
precision was enabled for faster computation. The evaluation
2.0 dataset, where the highest QWK of 0.841 is achieved.
was conducted using QWK after each epoch. To enhance
predictive performance and robustness, the final result was
2) ANALYSIS OF RESULTS IN REGRESSION TASKS
obtained by averaging the outputs from all five models across
For the regression tasks, similar to the classification results,
the folds. This method incorporates the strengths of each
the DeBERTaV3-base model achieves the highest QWK per-
model to provide a more stable and reliable prediction while
formance among Transformer-only models on both datasets.
ensuring that only the best-performing model is saved for
However, when combining Transformer models with the further processing.
GCNs model, the RoBERTa-base-GCN model demonstrates
For the GCNs model, the input features were embeddings
strong performance on the ASAP dataset, achieving a QWK
generated by the best-performing Transformer model. The
of 0.835, which is the highest overall performance among
dataset was randomly split into training, validation, and test
all models across both datasets. In contrast, during the
sets. The GCNs model consisted of 2 layers with 512 hidden
classification task, the RoBERTa-base-GCN model achieves
units and employed LeakyReLU activation with a negative
the highest QWK on the AES 2.0 dataset.
slope of 1e−3. No dropout was applied for regression tasks,
Interestingly, while Transformer models perform well in
while a dropout rate of 0.5 was utilized for classification
regression tasks, integrating them with GCNs does not con- tasks.
sistently yield significant improvements. This is particularly
The AdamW optimizer was also employed for the GCNs
evident in the case of the DeBERTaV3-base-GCN, where the
model, with a learning rate of 1e−3 and a weight decay
regression performance on the ASAP dataset drops to 0.724,
of 5e−4. The learning rate followed a cosine annealing
compared to 0.812 for the standard DeBERTaV3-base model.
schedule, with Tmax = 2000, representing the maximum
The ensemble methods, which include averaging for
number of epochs for learning rate decay. The model was
regression tasks and a weighted method, demonstrate robust
trained on a GPU to accelerate the process.
results. Notably, there is no significant difference between the
two ensemble methods for regression, as both achieve QWK D. EXPERIMENTAL RESULTS
performance of 0.819 on the ASAP dataset and 0.829 on the
The results displayed in Table 5 illustrate the performance of
AES 2.0 dataset. The consistency of these results underscores
various models, including Transformer models, Transformer-
the effectiveness of the ensemble methods in stabilizing
GCN models, and ensemble methods, across the ASAP performance across models.
and AES 2.0 datasets. Each model was evaluated for
both classification and regression tasks using QWK as the 3) SUMMARY
assessment metric. To further illustrate these results, Figure 3
Overall, the integration of GCNs with Transformer-based
presents the QWK scores achieved by each model for both
models leads to significant performance improvements,
tasks on the ASAP and AES 2.0 datasets.
particularly in classification tasks. However, results are more
The detailed analysis of the results for each task, along with
varied for regression tasks, with some models showing
a summary of the findings, is described below.
enhancements while others experience slight drops in per-
formance. The ensemble methods, especially the weighted
1) ANALYSIS OF RESULTS IN CLASSIFICATION TASKS
ensemble, provide a slight increase in performance, particu-
For the classification task, the DeBERTaV3-base model
larly in classification tasks, where it consistently outperforms
achieves the highest QWK performance among Transformer-
both individual models and the standard ensemble.
only models across both datasets. However, when combining
The findings suggest that the weighted ensemble method
Transformer models with the GCNs model, the RoBERTa-
provides a flexible and effective approach for integrating
base-GCN outperforms all others, achieving QWK perfor-
model outputs, leveraging the strengths of each. Although
mance of 0.835 and 0.838 on the ASAP and AES 2.0 datasets,
the improvements from the weighted ensemble are not respectively.
always substantial, they remain consistent, particularly in
In terms of ensemble methods, both the ensembling meth-
classification tasks where this method excels.
ods with majority voting (Ensemble-Transformers-GCN) and
with weighted ensemble (Ensemble(W)-Transformers-GCN) E. DISCUSSION
demonstrate competitive performance. The Ensemble(W)-
Since no prior studies have explored the AES 2.0 dataset,
Transformers-GCN achieves a QWK of 0.833 on the ASAP
the effectiveness of the proposed ET-GNN method was
dataset and 0.841 on the AES 2.0 dataset, showing a
evaluated by comparing it with two baseline studies that
slight improvement over the Ensemble-Transformers-GCN.
utilized the ASAP dataset. The first baseline study, con-
This indicates that incorporating model-specific weights in
ducted by Susanto et al. [44], used Transformer models,
the prediction combination process can enhance predictive
RoBERTa and DeBERTaV3, similar to those used in this
results by assigning greater influence to more effective
research. Susanto et al. reported QWK values of 0.749 for
models. Although the improvement from the weighted
the RoBERTa-base and 0.771 for the DeBERTaV3-base,
ensemble over the standard majority voting is marginal, it still
highlighting the solid performance of these Transformer 58754 VOLUME 13, 2025
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
TABLE 5. QWK performance of various models on the ASAP and AES 2.0 datasets for classification and regression tasks. Ensemble-Transformers-GCN
denotes an ensemble method that applies majority voting for classification and averaging for regression tasks, while Ensemble(W)-Transformers-GCN
uses a weighted ensemble method, where outputs are combined based on model-specific weights for both tasks. For each dataset, the highest QWK
values within each model group are presented in bold, and the highest across all models are marked with underlined formatting for both tasks.
FIGURE 3. QWK performance of various models on ASAP and AES 2.0 datasets for classification and regression tasks.
models in the context of AES. The second baseline study by
results, especially in the classification task, with a QWK of
Ma et al. [16] used GCNs with GloVe embeddings, achieving
0.804, while maintaining strong performance in regression
a QWK of 0.773. Although the study achieved notable
with a QWK of 0.766. Furthermore, the RoBERTa-GCN
results, the static nature of GloVe embeddings is a limitation
model displayed superior performance, recording a QWK of
compared to the contextually aware embeddings generated by
0.835 for both classification and regression tasks, highlight-
Transformer models. Table 6 shows a comparison of QWK
ing the model’s robustness across various scoring rubrics.
performance between the baseline studies and the proposed
Notably, the DeBERTaV3-GCN model achieved a QWK of ET-GNN method.
0.752 for classification and 0.724 for regression, indicating
The proposed ET-GNN method significantly outperforms
areas for potential improvement. Despite this, the overall
these baseline models. For example, when comparing the
results demonstrate that the ET-GNN method outperforms
QWK performance of Susanto et al. [44] with ET-GNN, the
existing methods, suggesting a promising direction for future
RoBERTa-base model in ET-GNN achieved a QWK of research in AES systems.
0.793 for classification and 0.810 for regression, while the
The ensemble methods incorporated into ET-GNN further
DeBERTaV3-base model reached 0.800 for classification
enhanced scoring performance. The Ensemble-Transformers-
and 0.812 for regression. These results surpass the QWK
GCN model, which employs majority voting for classifi-
values reported by Susanto et al., which were 0.749 for the
cation tasks and averaging for regression tasks, achieved
RoBERTa-base and 0.771 for the DeBERTaV3-base.
notable QWK scores of 0.824 for classification and 0.819 for
Similarly, when comparing the QWK performance of
regression. The Ensemble(W)-Transformers-GCN model,
Ma et al. [16], which utilized GloVe embeddings with GCNs
which uses a weighted ensemble method, achieved signif-
model and achieved a QWK value of 0.773, the ET-GNN
icantly higher performance with QWK values of 0.833 for
method demonstrated significantly higher performance. The
classification and 0.819 for regression. These results empha-
DistilBERT-GCN model in ET-GNN showed outstanding
size the effectiveness of ensemble methods in combining the VOLUME 13, 2025 58755
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
TABLE 6. Comparison of QWK performance between baseline studies and the proposed ET-GNN method on the ASAP dataset for both classification and
regression tasks. Ensemble-Transformers-GCN utilizes majority voting for classification and averaging for regression tasks, while
Ensemble(W)-Transformers-GCN combines outputs based on model-specific weights for both tasks. The highest QWK values for each task within each
model group are highlighted in bold.
strengths of Transformers and GCNs to optimize predictive
[2] K. Zupanc and Z. Bosnić, ‘‘Increasing accuracy of automated essay grading accuracy.
by grouping similar graders,’’ in Proc. 8th Int. Conf. Web Intell., Mining
Semantics
, Jun. 2018, pp. 1–6.
In summary, the comparison between the baseline studies
[3] M. Uto and M. Okano, ‘‘Robust neural automated essay scoring using item
and the ET-GNN method demonstrates the advancements
response theory,’’ in Proc. Int. Conf. Artif. Intell. Educ. Cham, Switzerland:
made through the integration of Transformer models and
Springer, Jan. 2020, pp. 549–561.
GCNs. This combination results in improved performance
[4] M. Uto, ‘‘A review of deep-neural automated essay scoring models,’’
Behaviormetrika, vol. 48, no. 2, pp. 459–484, Jul. 2021.
and robustness in AES, further validating the potential of the
[5] D. Ramesh and S. K. Sanampudi, ‘‘An automated essay scoring systems:
ET-GNN method in educational assessments and suggesting
A systematic literature review,’’ Artif. Intell. Rev., vol. 55, no. 3,
promising directions for future research and development in pp. 2495–2527, Mar. 2022. this field.
[6] X. Zhou, L. Yang, X. Fan, G. Ren, Y. Yang, and H. Lin, ‘‘Self-training vs
pre-trained embeddings for automatic essay scoring,’’ in Proc. China Conf.
Inf. Retr.
Cham, Switzerland: Springer, Jan. 2021, pp. 155–167. V. CONCLUSION
[7] J. S. Tan and I. K. T. Tan, ‘‘Feature group importance for automated essay
This paper introduced ET-GNN, a hybrid method that
scoring,’’ in Proc. Int. Conf. Multi-Disciplinary Trends Artif. Intell. Cham,
Switzerland: Springer, Jan. 2021, pp. 58–70.
integrates Transformer-based models and GCNs for holistic
[8] M. Beseiso and S. Alzahrani, ‘‘An empirical analysis of BERT embedding
AES. Three Transformer models, DistilBERT, RoBERTa,
for automated essay scoring,’’ Int. J. Adv. Comput. Sci. Appl., vol. 11,
and DeBERTaV3, were independently fine-tuned to generate no. 10, pp. 204–210, 2020.
contextual embeddings for each essay. GCNs then pro-
[9] L. Yao, C. Mao, and Y. Luo, ‘‘Graph convolutional networks for text
classification,’’ in Proc. AAAI Conf. Artif. Intell., 2019, vol. 33, no. 1,
cessed these embeddings separately, capturing both semantic pp. 7370–7377.
information and similarities between essays. To further
[10] J. Xue, X. Tang, and L. Zheng, ‘‘A hierarchical BERT-based transfer
enhance the performance, an ensemble method was used
learning approach for multi-dimensional essay scoring,’’ IEEE Access,
that combines the DistilBERT-GCN, RoBERTa-GCN, and
vol. 9, pp. 125403–125415, 2021.
[11] H. Cai, V. W. Zheng, and K. C. Chang, ‘‘A comprehensive survey of graph
DeBERTaV3-GCN models, using averaging for regression
embedding: Problems, techniques, and applications,’’ IEEE Trans. Knowl.
tasks, majority voting for classification tasks, and a weighted
Data Eng., vol. 30, no. 9, pp. 1616–1637, Sep. 2018.
ensemble method for both tasks. This method improves the
[12] T. N. Kipf and M. Welling, ‘‘Semi-supervised classification with graph
overall results and strengthens the robustness of the model
convolutional networks,’’ in Proc. Int. Conf. Learn. Represent. (ICLR),
2017. [Online]. Available: https://arxiv.org/abs/1609.02907
by leveraging the strengths of multiple models and diverse
[13] X. Liu, X. You, X. Zhang, J. Wu, and P. Lv, ‘‘Tensor graph convolutional
feature representations, resulting in robust and generalized
networks for text classification,’’ in Proc. AAAI Conf. Artif. Intell.,
predictions. Experimental results in two datasets further
Apr. 2020, vol. 34, no. 5, pp. 8409–8416.
confirm the effectiveness of ET-GNN. The study highlights
[14] N. A. Khayi and V. Rus, ‘‘Graph convolutional networks for student
answers assessment,’’ in Proc. 23rd Int. Conf., Brno, Czech Republic.
the potential of combining different Transformer-based GCN
Cham, Switzerland: Springer, Sep. 2020, pp. 532–540.
models to advance AES performance across various scoring
[15] H. Tan, C. Wang, Q. Duan, Y. Lu, H. Zhang, and R. Li, ‘‘Automatic short
rubrics. Future work will focus on applying this method
answer grading by encoding Student responses via a graph convolutional
to analytic scale scoring, which allows for a more detailed
network,’’ Interact. Learn. Environ., vol. 31, no. 3, pp. 1636–1650, Apr. 2023.
evaluation of specific aspects of writing quality, including
[16] J. Ma, X. Li, M. Chen, and W. Yang, ‘‘Enhanced hierarchical structure
organization, ideas, sentence structure, word choice, cohe-
features for automated essay scoring,’’ in Proc. China Conf. Inf. Retr.
sion, vocabulary, grammar, and conventions.
Cham, Switzerland: Springer, 2021, pp. 168–179.
[17] X. She, J. Chen, and G. Chen, ‘‘Joint learning with BERT-GCN and multi-
attention for event text classification and event assignment,’’ IEEE Access, REFERENCES
vol. 10, pp. 27031–27040, 2022.
[18] W. Gao and H. Huang, ‘‘A gating context-aware text classification model
[1] H. West, G. Malcolm, S. Keywood, and J. Hill, ‘‘Writing a successful
with BERT and graph convolutional networks,’’ J. Intell. Fuzzy Syst.,
essay,’’ J. Geography Higher Educ., vol. 43, no. 4, pp. 609–617, Aug. 2019.
vol. 40, no. 3, pp. 4331–4343, Mar. 2021. 58756 VOLUME 13, 2025
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
[19] Z. Lu, P. Du, and J. Nie, ‘‘VGCN-BERT: Augmenting BERT with graph
[42] R. Yang, J. Cao, Z. Wen, Y. Wu, and X. He, ‘‘Enhancing automated essay
embedding for text classification,’’ in Proc. Eur. Conf. Inf. Retr. Cham,
scoring performance via fine-tuning pre-trained language models with
Switzerland: Springer, Jan. 2020, pp. 369–382.
combination of regression and ranking,’’ in Proc. Findings Assoc. Comput.
[20] B. AlBadani, R. Shi, J. Dong, R. Al-Sabri, and O. B. Moctard,
Linguistics, 2020, pp. 1560–1569.
‘‘Transformer-based graph convolutional network for sentiment analysis,’’
[43] E. Mayfield and A. W. Black, ‘‘Should you fine-tune BERT for automated
Appl. Sci., vol. 12, no. 3, p. 1316, Jan. 2022.
essay scoring?’’ in Proc. 15th Workshop Innov. Use NLP Building Educ.
[21] D. Xibin, Z. Yu, W. Cao, Y. Shi, and Q. Ma, ‘‘A survey on ensemble
Appl., 2020, pp. 151–162.
learning,’’ Frontiers Comput. Sci., vol. 14, no. 2, pp. 241–258, Aug. 2019.
[44] H. Susanto, A. A. S. Gunawan, and M. F. Hasani, ‘‘Development
[22] S. Abimannan, E.-S.-M. El-Alfy, Y.-S. Chang, S. Hussain, S. Shukla,
of automated essay scoring system using DeBERTa as a transformer-
and D. Satheesh, ‘‘Ensemble multifeatured deep learning models and
based language model,’’ in Proc. Comput. Methods Syst. Softw. Cham,
applications: A survey,’’ IEEE Access, vol. 11, pp. 107194–107217,
Switzerland: Springer, Jan. 2024, pp. 202–215. 2023.
[45] M. Beseiso, O. A. Alzubi, and H. Rashaideh, ‘‘A novel automated essay
[23] M. S. Matena and C. A. Raffel, ‘‘Merging models with Fisher-weighted
scoring approach for reliable higher educational assessments,’’ J. Comput.
averaging,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022,
Higher Educ., vol. 33, no. 3, pp. 727–746, Dec. 2021. pp. 17703–17716.
[46] H. Ren, W. Lu, Y. Xiao, X. Chang, X. Wang, Z. Dong, and D. Fang, ‘‘Graph
[24] S. Jaradat, R. Nayak, A. Paz, and M. Elhenawy, ‘‘Ensemble learning with
convolutional networks in language and vision: A survey,’’ Knowl.-Based
pre-trained transformers for crash severity classification: A deep NLP
Syst., vol. 251, Sep. 2022, Art. no. 109250.
approach,’’ Algorithms, vol. 17, no. 7, p. 284, Jun. 2024.
[47] U. A. Bhatti, H. Tang, G. Wu, S. Marjan, and A. Hussain, ‘‘Deep learning
[25] D. S. V. Madala, A. Gangal, S. Krishna, A. Goyal, and A. Sureka,
with graph convolutional networks: An overview and latest applications in
‘‘An empirical analysis of machine learning models for automated essay
computational intelligence,’’ Int. J. Intell. Syst., vol. 2023, no. 1, Jan. 2023,
grading,’’ PeerJ Preprints, Jan. 2018, doi: 10.7287/peerj.preprints.3518v1. Art. no. 8342104.
[26] S. Sharma and A. Goyal, ‘‘Automated essay grading: An empirical
[48] A. Gillioz, J. Casas, E. Mugellini, and O. A. Khaled, ‘‘Overview of the
analysis of ensemble learning techniques,’’ in Computational Meth-
transformer-based models for NLP tasks,’’ in Proc. 15th Conf. Comput.
ods and Data Engineering, vol. 2. Berlin, Germany: Springer, 2020,
Sci. Inf. Syst. (FedCSIS), Sep. 2020, pp. 179–183. pp. 343–362.
[49] A. Vaswani, ‘‘Attention is all you need,’’ in Proc. Adv. Neural Inf. Process.
[27] Y. Salim, V. Stevanus, E. Barlian, A. C. Sari, and D. Suhartono,
Syst., 2017, pp. 5998–6008.
‘‘Automated English digital essay grader using machine learning,’’ in Proc.
[50] T. Schomacker and M. Tropmann-Frick, ‘‘Language representation
IEEE Int. Conf. Eng., Technol. Educ. (TALE), Dec. 2019, pp. 1–6.
models: An overview,’’ Entropy, vol. 23, no. 11, p. 1422, Oct. 2021.
[28] H. K. Janda, A. Pawar, S. Du, and V. Mago, ‘‘Syntactic, semantic and
[51] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, ‘‘DistilBERT, a
sentiment analysis: The joint effect on automated essay evaluation,’’ IEEE
distilled version of BERT: Smaller, faster, cheaper and lighter,’’ 2019,
Access, vol. 7, pp. 108486–108503, 2019. arXiv:1910.01108.
[29] A. Doewes and M. Pechenizkiy, ‘‘Structural explanation of automated
[52] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
essay scoring,’’ in Proc. EDM, 2020, pp. 758–761.
L. Zettlemoyer, and V. Stoyanov, ‘‘RoBERTa: A robustly optimized BERT
[30] Z. Chen, Y. Quan, and D. Qian, ‘‘Automatic essay scoring model based
pretraining approach,’’ 2019, arXiv:1907.11692.
on multi-channel CNN and LSTM,’’ in Proc. BenchCouncil Int. Federated
[53] P. He, X. Liu, J. Gao, and W. Chen, ‘‘DeBERTa: Decoding-enhanced BERT
Intell. Comput. Block Chain Conferences. Singapore: Springer, Jan. 2021,
with disentangled attention,’’ 2020, arXiv:2006.03654. pp. 337–346.
[54] B. Hamner, J. Morgan, L. VanderVelde, M. Shermis, and T. V. Ark. (2012).
[31] X. Li, M. Chen, J. Nie, Z. Liu, Z. Feng, and Y. Cai, ‘‘Coherence-based
The Hewlett Foundation: Automated Essay Scoring. [Online]. Available:
automated essay scoring using self-attention,’’ in Chinese Computational
https://kaggle.com/competitions/asap-aes
Linguistics and Natural Language Processing Based on Naturally
[55] S. Crossley, P. Baffour, J. King, L. Burleigh, W. Reade, and M. Demkin.
Annotated Big Data. Berlin, Germany: Springer, 2018, pp. 386–397.
(2024). Learning Agency Lab—Automated Essay Scoring 2.0. [Online].
[32] C. Cai, ‘‘Automatic essay scoring with recurrent neural network,’’ in Proc. Available:
https://kaggle.com/competitions/learning-agency-lab-
3rd Int. Conf. High Perform. Compilation, Comput. Commun., Mar. 2019, automated-essay-scoring-2 pp. 1–7.
[56] V. Ramnarain-Seetohul, V. Bassoo, and Y. Rosunally, ‘‘Similarity measures
[33] W. Zhu and Y. Sun, ‘‘Automated essay scoring system using multi-model
in automated essay scoring systems: A ten-year review,’’ Educ. Inf.
machine learning,’’ in Proc. MLNLP, BDIOT, ITCCMA, CSITY, DTMN,
Technol., vol. 27, no. 4, pp. 5573–5604, May 2022.
AIFZ, SIGPRO, Oct. 2020, pp. 109–117.
[34] L. Xia, J. Liu, and Z. Zhang, ‘‘Automatic essay scoring model based on
two-layer bi-directional long-short term memory network,’’ in Proc. 3rd
Int. Conf. Comput. Sci. Artif. Intell.
, Dec. 2019, pp. 133–137.
[35] P. Muangkammuen and F. Fukumoto, ‘‘Multi-task learning for automated
essay scoring with sentiment analysis,’’ in Proc. 1st Conf. Asia–Pacific
HIND ALJUAID received the bachelor’s degree in computer science
Chapter Assoc. Comput. Linguistics 10th Int. Joint Conf. Natural Lang.
Process., Student Res. Workshop
, 2020, pp. 116–123.
from Taif University, Taif, Saudi Arabia. She is currently pursuing the
[36] M. Chen and X. Li, ‘‘Relevance-based automated essay scoring via
master’s degree in computer science with King Abdulaziz University, Jeddah,
hierarchical recurrent model,’’ in Proc. Int. Conf. Asian Lang. Process.
Saudi Arabia. Her research interests include artificial intelligence (AI),
(IALP), Nov. 2018, pp. 378–383.
machine learning (ML), deep learning (DL), and natural language processing [37] P. Wangkriangkri, C. Viboonlarp, A. T. Rutherford, and (NLP).
E. Chuangsuwanich, ‘‘A comparative study of pretrained language
models for automated essay scoring with adversarial inputs,’’ in Proc.
IEEE REGION 10 Conf. (TENCON)
, Nov. 2020, pp. 875–880.
[38] J. Wang, ‘‘A study of scoring English tests using an automatic scoring
model incorporating semantics,’’ Autom. Control Comput. Sci., vol. 57,
no. 5, pp. 514–522, Oct. 2023.
AREEJ ALHOTHALI received the Ph.D. degree
[39] A. Sharma, A. Kabra, and R. Kapoor, ‘‘Feature enhanced capsule
in computer science, specializing in artificial
networks for robust automatic essay scoring,’’ in Proc. Joint Eur. Conf.
intelligence from the University of Waterloo,
Mach. Learn. Knowl. Discovery Databases. Cham, Switzerland: Springer,
Canada, in 2017. She is currently an Associate 2021, pp. 365–380.
Professor with the Faculty of Computer Science
[40] Y. Yang and J. Zhong, ‘‘Automated essay scoring via example-based learn-
and Information Technology, King Abdulaziz
ing,’’ in Proc. Int. Conf. Web Eng. Cham, Switzerland: Springer, Jan. 2021, pp. 201–208.
University, Jeddah, Saudi Arabia. Her research
[41] F. Li, X. Xi, Z. Cui, D. Li, and W. Zeng, ‘‘Automatic essay scoring
interests include machine learning, deep learning,
method based on multi-scale features,’’ Appl. Sci., vol. 13, no. 11, p. 6775,
natural language processing, computer vision, Jun. 2023.
intelligent agent systems, and affective computing. VOLUME 13, 2025 58757
H. Aljuaid et al.: ET-GNN: Ensemble Transformer-Based Graph Neural Networks for Holistic AES
OHOUD ALZAMZAMI received the Ph.D. degree
TAHANI ALDOSEMANI is currently a Professor
in computer science from the College of Engi-
of educational technology. She is the Director
neering and Computer Science, Florida Atlantic
of cultural higher education. Her previous roles
University, USA, in 2018. She is currently
include a Professor of educational technology
an Assistant Professor with the Department of
with Prince Sattam bin Abdulaziz University,
Computer Science, Faculty of Computing and
a member of the University’s Council, and the Vice
Information Technology, King Abdulaziz Univer-
Dean of the Information Technology and Distance
sity. She actively contributes to the academic
Education. She also served as an Advisor for the
community by serving as a reviewer for several
Minister of Education, focusing on e-learning and
prestigious international journals affiliated with
international cooperation. She also served as the
IEEE, Elsevier, Wiley, PLOS, and Springer. Her research interests include
Co-Chair for the G20 2020 Education Group. She has received several
smart city applications and technologies, the Internet of Things, wireless and
international awards and recognitions in educational research and has
mobile networks, and vehicular networks.
many publications in educational technology and digital transformation in
education. She has implemented many successful initiatives in education
presented at conferences, seminars, and workshops.
HUSSEIN ASSALAHI received the Ph.D. degree
in TESOL from the University of Exeter, U.K.,
in 2016. He is currently an Associate Professor
of TESOL with the English Language Institute,
King Abdulaziz University. His research inter-
ests include second language teacher education,
language teacher professional development, pro-
fessionalism, second language acquisition, and critical pedagogy. 58758 VOLUME 13, 2025