K-mer-based
Machine Learning Method to Classify Echinococcus granulosusS.L. Mitochondrial DNA Sequences
Enas Al-khlifeh1*, Ahmad B. Hassanat2,
Suleyman A. AlShowarah2,3, Ahmad S. Tarawneh2,
Lujain A Alhasanat4
and Awni Hammouri2
1Applied
Biology Department, Faculty of Science, Al-Balqa Applied University,
Al-Salt, Jordan, 2Faculty of Information Technology,
Mutah University, Al-Karak, Jordan, 3Software Engineering
Department, Faculty of Science and Information Technology, Al-Zaytoonah
University of Jordan, Jordan, 4Faculty
of Medicine, Mutah University, Karak, 61710, Jordan
There are ongoing debates regarding the taxonomy of the E. granulosus
Echinococcus granulosussensu lato (s.l.) complex, with the
species status of genotypes G1/G3 (E. granulosuss.s.) and the
G6–G10 cluster (E. canadensis) being particularly contested. To solve
this challenge, we
develop a
machine learning (ML)-based
k-mer method to predict genotypes of E. granulosus s.l.
via
the mitochondrial DNA sequence data available in
GenBank.
We used 7-mer sequences that represent mtDNA to reveal untapped diversity across
948 sequences of seven known genotypes of E. granulosus s.l.
We evaluated the utility of varying k-mer lengths, including fixed vs
adaptive
lengths,
for
sequence
comparisons.
Principal
component analysis
(PCA) identified the most discriminative patterns,
addressed the computational challenges posed by high-dimensional
features and fed them into 5 distinct ML algorithms, including both conventional
(LR, RF and SVM) and
DL
(1D-CNN and the LSTM)
methods.
Moreover, crucial insights into performance implications and efficiency
improvements have been emphasized
via
10-fold
cross-validation.
Our method
attains approximately 95%
accuracy
in
predicting E. granulosus s.l. genotypes. High genetic diversity within
G6, G8 and G3
contributes
significantly
to the
controversial taxonomy of
the
E. canadensis
cluster and E. granulosus s.s., respectively. Moreover, ML can
potentially compete with BLAST (the NCBI sequence alignment tool) when
the
BLAST-Similarity-KNN
classifier is implemented
on the E. granulosus mtDNA data. This
study provides
a novel approach
for
classifying E. granulosus s.l.
at the
species level, thereby supporting disease control activities.
To Cite This Article:
Al-khlifeh A, Hassanat AB,
AlShowarah SA, Tarawneh AS, Alhasanat
LA and Hammouri A,
2026. K-mer-based machine learning method to classify Echinococcus
granulosusS.L. mitochondrial DNA sequences. Pak Vet J, 46(4):
816-829.
http://dx.doi.org/10.29261/pakvetj/2026.076