9 trang 14 lượt tải

BGA Systematic analyses of AISNPs - môn Nhập môn học máy và khai phá dữ liệu | Đại học Bách Khoa Hà Nội

BGA Systematic analyses of AISNPs - môn Nhập môn học máy và khai phá dữ liệu | Đại học Bách Khoa Hà Nội. Tài liệu giúp bạn tham khảo, ôn tập và đạt kết quả cao. Mời đọc đón xem!

Môn: Nhập môn học máy và khai phá dữ liệu 15 tài liệu

Trường: Đại học Bách Khoa Hà Nội 5.6 K tài liệu

Tác giả:

Khánh Nguyễn

1 tháng trước

Tải xuống Báo cáo

Danh sách Quiz

Forensic Science International 357 (2024) 111975

Available online 2 March 2024

Systematic analyses of AISNPs screening and classication algorithms

based on genome-wide data for forensic biogeographic ancestry inference

Meiming Cai

, Fanzhang Lei

, Man Chen

, Qiong Lan

, Xiaolian Wu

, Chen Mao

Meisen Shi

, Bofeng Zhu

Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identication, School of Forensic Medicine, Southern Medical University, Guangzhou, Guangdong,

China

Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong, China

Department of Epidemiology, School of Public Health, Southern Medical University, Guangzhou, Guangdong, China

Criminal Justice College of China University of Political Science and Law, Beijing, China

ARTICLE INFO

Keywords:

Biogeographical ancestry

AISNP

Machine learning

Feature selection

ABSTRACT

Identifying the biogeographic ancestral origin of biological sample left at a crime scene can provide important

evidence for judicial case, as well as clue for narrowing down suspect. Ancestry informative single nucleotide

polymorphism (AISNP) has become one of the most important genetic markers in recent years for screening

ancestry information loci and analyzing the population genetic background and structure due to their high

number and wide distributions in the human genome. In this study, based on data from 26 populations in the

1000 Genomes Project Phase 3, a Random Forest classication model was constructed with one-vs-rest classi-

cation strategy for embedded feature selection in order to obtain a panel with a small number of efcient

AISNPs. The research aim was to clarify differentiations of population genetic structures among continents and

subregions of East Asia. ADMIXTURE results showed that based on the 58 AISNPs selected by the machine

learning algorithm, the 26 populations involved in the study could be categorized into six intercontinental

ancestry components: North East Asia, South East Asia, Africa, Europe, South Asia, and America. The 24

continental-specic AISNPs and 34 East Asian-specic AISNPs were nally obtained, and used to construct the

ancestry prediction model using XGBoost algorithm, resulting in the Matthews correlation coefcients of 0.94

and 0.89, and accuracies of 0.94 and 0.92, respectively. The machine learning models that we constructed using

population-specic AISNPs were able to accurately predict the ancestral origins of continental and intra-East

Asian populations. To summarize, screening a set of high-perform AISNPs to infer biogeographical ancestral

information using embedded feature selection has potential application in creating a layered inference system

that accurately differentiates from intercontinental populations to local subpopulations.

1. Introduction

If the biogeographical ancestor of the person from whom the bio-

logical samples originated can be deduced, it will be possible to provide

key investigative clues for the case, lock in the scope of the suspect, and

clarify the direction of the investigation[1]. Ancestry inference is to

calculation the ancestral composition of a population’s genetic struc-

ture, or infer the ancestral origin of an individual by evaluating a series

of ancestry informative markers (AIM). Researchers mostly performed

ancestry inference research with single nucleotide polymorphism (SNP)

genetic markers, which provide many advantages: relatively higher

polymorphisms, low mutation rates, and shorter amplied fragments

which are better suited for multiplex polymerase chain reaction (PCR)

based on capillary electrophoresis platform and successful genotyping of

severely degraded biological materials. Many AISNP panels have been

published previously, most of which could distinguished among inter-

continental populations[2–6]. However, the above previously published

detection systems of ancestry inference were less able to further subdi-

vision of the East Asian populations. Constructing a hierarchical infer-

ence system to distinguish the accuracy from inferring intercontinental

populations to localized subgroups is a must for future forensic genetic

development. Nevertheless, the majority of SNP combinations were

* Corresponding authors.

E-mail addresses: maochen9@smu.edu.cn (C. Mao), shimeisen2000@163.com (M. Shi), zhubofeng7372@126.com (B. Zhu).

Contents lists available at ScienceDirect

Forensic Science International

journal homepage: www.elsevier.com/locate/forsciint

https://doi.org/10.1016/j.forsciint.2024.111975

Received 19 November 2023; Received in revised form 23 January 2024; Accepted 1 March 2024

Forensic Science International 357 (2024) 111975

derived by assessing the genetic differentiation index (F

) and the allele

frequency difference (δ) values among populations. This approach

focused more on the selection of single optimal SNP and was less

effective for near-population delineation. Hence, it is a crucial task to

develop machine learning algorithms to screen the best combinations of

AISNPs, and then explore promising discriminative methods for forensic

ancestry inference.

Machine learning algorithms excel at extracting those loci with high

ancestral information inference efciency from large-scale genome-wide

datasets, due to their capacity to handle high-dimensional data. Several

feature selection techniques, using various models, have been developed

to identify the appropriate set of features for population genetics data[7,

8]. Feature selection is able to eliminate irrelevant or redundant fea-

tures, thus achieving the purpose of reducing the number of features and

improving the model accuracy. According to the form of feature selec-

tion, it is mainly divided into three categories: ltering, packing and

embedding. The feature selection processes of ltering and packing al-

gorithms are obviously different from the learner training process, while

the embedded feature selection integrates the feature selection process

with the learner training process, and the two are completed in the same

optimization process, i.e., feature selection is carried out automatically

in the learner training process. Therefore, this study was based on the

Random Forest (RF) algorithm using one-vs-rest classication strategy

for the embedded feature selection. In RF algorithm, each decision tree

was trained based on a random subset of features. By calculating the

importance of each feature, its degree of contribution to the model

performance could be evaluated. And the one-vs-rest classication

strategy could transform a multi-category problem into multiple binary

classication problems, so that feature selection could be performed for

each category. Furthermore, in order to explore an optimal method to

construct the discriminative model, this study built the inference pre-

diction models based on six different machine learning algorithms,

including the K Nearest Neighbors (KNN), Linear Discriminant Analysis

(LDA), Support Vector Machines (SVM), Neural Networks (NN),

XGBoost, and RF algorithms.

In this study, we performed feature selection for distinguishing

intercontinental and intra-East Asian populations based on the RF

classication algorithm using the one-vs-rest classication strategy to

obtain a set of 58 AISNPs. The 24 AISNPs and 34 AISNPs were used to

infer biogeographic ancestry information at the continental and intra-

East Asian levels, respectively. In addition, six machine learning algo-

rithms were used to construct predictive models for ancestry informa-

tion inference, and the performances of the six methods were evaluated

by the f1 score, Matthews correlation coefcient (MCC), and area under

the receiver operating characteristic curve (AUC-ROC).

2. Materials and methods

2.1. Sample set sources

This study used population data from the 1000 Genomes Project

Phase 3[9], which consists of 26 populations with a total of 2504 in-

dividuals. There were ve East Asian populations (EAS), ve European

populations (EUR), ve South Asian populations (SAS), seven African

populations (AFR), and four American populations (AMR). Detailed

information of the 26 populations could be found in Supplementary

Table 1.

2.2. Preliminary selection of AISNP loci

Initial screening of AISNP loci in the 1000 Genomes Project Phase 3

was performed on a genome-wide scale using PLINK 2.0 software (http

s://www.cog-genomics.org/plink/2.0/). The basic conditions for the

screening of AISNP loci were as follows: (1) loci belong to autosomal

SNP genetic markers and show biallelic polymorphisms; (2) AISNP loci

with minimum allele frequencies greater than 0.01; (3) the physical

distances of the loci on the same chromosome are greater than 10 Mb

and the pairwise loci conform to linkage equilibrium (R

less than 0.4);

(4) all selected AISNP loci conform to the Hardy-Weinberg equilibrium

(HWE) in all reference populations; (5) pairwise intercontinental pop-

ulations with δ values greater than 0.5; δ values greater than 0.3 between

the AMR and the other populations, and between the EUR and SAS

populations, (6) δ values are greater than 0.25 between EAS populations,

(7) loci with δ values in the top 10 loci in descending order per chro-

mosome between the populations are selected. We also added some

AISNP loci from previously published studies[5,10–12]. Removing

duplicate loci in two methods mentioned above and pairwise AISNP loci

in linkage disequilibrium (LD), a total of 1750 AISNPs with potential for

ancestry information inference were nally obtained in this study.

2.3. Dimensionality reduction analysis and visualization

The theoretical basis for realizing high-dimensional data visualiza-

tion is based on dimensionality reduction algorithms. The dimension-

ality reduction algorithms are generally classied into two categories:

(1) algorithms such as principal component analysis (PCA), multidi-

mensional scaling (MDS), can be good for presenting global character-

istics of the data; and (2) t-distributed stochastic neighbor embedding (t-

SNE), uniform manifold approximation and projection (UMAP), tend to

preserve the local structural features of the data. The nature of PCA is an

unsupervised model, in the model the grouping of each sample is un-

known, and the analysis is performed purely on the basis of the char-

acteristics of the data. The MDS algorithm makes the visualization and

analysis of the data more intuitive by downscaling the high-dimensional

data into two or three dimensions. Both t-SNE and UMAP are commonly

used nonlinear dimensionality reduction techniques for mapping high-

dimensional data into low-dimensional spaces.

In order to clarify the approximate genetic distribution pattern of

2504 individuals in the 1000 Genomes Project Phase 3 based on the

selected 1750 AISNPs, the PLINK 2.0 software was used for PCA at the

individual level. In addition, we performed MDS, t-SNE and UMAP

dimensionality reduction analyses and visualization operations succes-

sively using the ‘stats’, ‘Rtsne’, ‘umap’ and ‘ggplot2’ R packages on the R

software (v4.2.2; http://www.r-project.org/), respectively.

2.4. Feature selection

‘OneVsRestClassier’ and ‘RandomForestClassier’ from the scikit-

learn library of python v3.8 software were used to construct multiclass

classiers for intercontinental and intra-EAS distinction purposes,

respectively. And the ‘GridSearchCV’ from scikit-learn library was also

used to search for optimal Random Forest model parameters. Then,

adjust the parameters such as the number of trees, depth, and minimum

number of leaf nodes. Meanwhile, use ‘learning_curve’ function in scikit-

learn library for learning curve plotting. The feature importance of each

classier was determined, and the top 10 loci specic to the continent

and the top 20 loci specic to East Asia in descending order were

selected. Additionally, the learning curve of each classier was plotted

to illustrate the 10-fold cross-validated classication accuracy of the RF

model when using different numbers of loci.

2.5. Modeling biogeographical ancestry inference, model testing and

efcacy evaluation

Various classication algorithms were used in this study, including

RF, SVM, XGBoost, LDA, KNN and NN. Corresponding classiers were

introduced from the scikit-learn library of python v3.8 software,

including ‘RandomForestClassier’, ‘SVC’, ‘XGBClassier’, ‘Line-

arDiscriminantAnalysis’, ‘KNeighborsClassier’ and ‘MLPClassier’. All

methods use the same dataset for multiclass classication. RF is an

ensemble learning method that combines multiple decision trees to

make predictions. It randomly selects subsets of data to build each tree,

M. Cai et al.

Forensic Science International 357 (2024) 111975

and then aggregates the results from all trees to make the nal predic-

tion. Support Vector Classication (SVC) is an implementation of SVM, a

support vector classier that uses some training samples to construct a

hyperplane or a set of hyperplanes which can be used for classication.

XGBoost is an optimized gradient boosting framework that uses decision

trees as base learners. And it improves upon traditional gradient

boosting by incorporating regularization technique and parallel pro-

cessing. LDA can perform supervised dimensionality reduction by pro-

jecting the input data into a linear subspace consisting of directions

which maximize the separation between classes. In classication prob-

lems, KNN is based on the idea of nding the k nearest neighbors to a

query point and using their labels to make predictions. A multilayer

perceptron (MLP) method has been implemented in NN, which uses

backpropagation algorithm to generate a nonlinear function approx-

imator for classication.

The dataset in this study was randomly divided into 70% training set

and 30% testing set. The training set was used to train the model and the

testing data was used to independently evaluate the performance. The

six classication methods were evaluated by f1 score, MCC, and AUC-

ROC, respectively. In addition, all prediction models were tested using

10-fold cross validation, and the statistically obtained mean value of 10-

fold cross validation accuracy was also used as an indicator to assess the

efcacy of each prediction model.

3. Results

3.1. Dimensionality reduction analyses for 1750 candidate AISNP loci

The MDS, PCA, t-SNE and UMAP dimensionality reduction analyses

based on the raw genotyping data of 2504 individuals at 1750 SNP loci

were performed, which were shown in Fig. 1. Dots of the same color in

the gure indicated different individuals of the same continental origin,

with red, green, purple, blue and yellow indicating individuals from

AFR, AMR, EAS, EUR and SAS, respectively. The MDS plot (Fig. 1A)

revealed that in the dimensional space constituted by MDS1 and MDS2,

the signicant separation occurred between the AFR, EAS, SAS and EUR

individuals. In contrast, the AMR clustered between SAS and EUR in-

dividuals. The results of the PCA analysis of the ve continental pop-

ulations at the individual level (Fig. 1B) showed that populations from

the same geographic region clustered together and these 26 populations

were divided into ve clusters. Since the t-SNE and UMAP methods are

secondary dimensionality reduction treatment of PCA analysis, both

Fig. 1. The MDS (A), PCA (B), t-SNE (C) and UMAP (D) plots on basis of 1750 AISNP loci data in 2504 individuals. The AFR, AMR, EAS, EUR and SAS were labeled in

red, green, purple, blue and yellow, respectively. CDX, Chinese Dai in Xishuangbanna; JPT, Japanese in Tokyo; KHV, Kinh in Ho Chi Minh City; CHB, Chinese Beijing

Han; CHS, Chinese Southern Han; ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest USA; MSL, Mende in Sierra Leone; ESN, Esan in Nigeria;

YRI, Yoruba in Ibadan; GWD, Gambian in Western Division; LWK, Luhya in Webuye, Kenya; FIN, Finnish in Finland; CEU, Utah residents with Northern and Western

European ancestry; GBR, British in England and Scotland; IBS, Iberian populations in Spain; TSI, Toscani in Italy; ITU, Indian Telugu in the UK; PJL, Punjabi in

Lahore; STU, Sri Lankan Tamil in the UK; BEB, Bengali in Bangladesh; GIH, Gujarati Indian in Houston; CLM, Colombian in Medellin; PUR, Puerto Rican in Puerto

Rico; MXL, Mexican Ancestry in Los Angeles; PEL, Peruvian in Lima. AFR, African populations; SAS, South Asian populations; EUR, European populations; EAS, East

Asian populations; AMR, American populations.

M. Cai et al.

Forensic Science International 357 (2024) 111975

methods can effectively visualize the 10 principal components (PCs)

obtained from the original analysis in a two-dimensional space. In t-SNE

plot (Fig. 1C), it can be seen that EAS was divided into three clusters,

Japan, Han Chinese (CHS and CHB) and CDX and KHV (Southern East

Asian populations, SEAS). And GIH was distinguished from South Asian

populations. West African individuals (ESN, GWD) in AFR were also

distinguished in this plot. The distribution pattern in UMAP (Fig. 1D)

plot was roughly similar to that of t-SNE, but the former focuses on

preserving the global structure, and thus the distances of individuals

within the clusters were small, making it difcult to distinguish the

subgroup structure. The results of the dimensionality reduction analysis

conrmed that the 1750 AISNPs could effectively discriminate between

AFR, EUR, SAS, AMR and EAS populations, and also suggested that the

selected loci would help further differentiate subgroups in EAS.

Fig. 2. (A) Learning curves based on the top 10 AISNP loci of feature importance in each classier of the continental ancestry inference models were plotted to

represent the10-fold cross-validated classication correctness when different numbers of AISNP loci were used for the RF classication model. (B) Learning curves

were drawn based on the 20 AISNP loci with the highest feature importance in each classier of the EAS ancestry inference model. (C) The t-SNE downscaling

analysis was performed based on raw genotyping data of 58 AISNP loci in 2504 individuals. CDX, Chinese Dai in Xishuangbanna; JPT, Japanese in Tokyo; KHV, Kinh

in Ho Chi Minh City; CHB, Chinese Beijing Han; CHS, Chinese Southern Han; ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest USA; MSL,

Mende in Sierra Leone; ESN, Esan in Nigeria; YRI, Yoruba in Ibadan; GWD, Gambian in Western Division; LWK, Luhya in Webuye, Kenya; FIN, Finnish in Finland;

CEU, Utah residents with Northern and Western European ancestry; GBR, British in England and Scotland; IBS, Iberian populations in Spain; TSI, Toscani in Italy;

ITU, Indian Telugu in the UK; PJL, Punjabi in Lahore; STU, Sri Lankan Tamil in the UK; BEB, Bengali in Bangladesh; GIH, Gujarati Indian in Houston; CLM, Colombian

in Medellin; PUR, Puerto Rican in Puerto Rico; MXL, Mexican Ancestry in Los Angeles; PEL, Peruvian in Lima. AFR, African populations; SAS, South Asian pop-

ulations; EUR, European populations; EAS, East Asian populations; AMR, American populations.

M. Cai et al.

Forensic Science International 357 (2024) 111975

3.2. Embedded feature selection based on random forest algorithm via

multi-classication and one-vs-rest classication strategies

In order to generate a small number of features for efcient classi-

cation and to reduce model overtting, we performed embedded

feature selection based on the RF algorithm using the multi-

classication and one-vs-rest strategies, respectively. Firstly, 746

candidate AISNP loci were screened from 1750 AISNP loci by embedded

feature selection based on the RF model through multi-classication

strategy. The 357 of 746 loci showed potential for continental

ancestry inference; while 432 loci had the potential for EAS ancestry

inference. To further obtain fewer number loci with stronger specicity,

a one-vs-rest classication strategy was employed for the above two sets

of specic AISNP loci to obtain the top 10 (each continental-specic)

and 20 (EAS-specic) loci in terms of feature importance for each clas-

sier. And the learning curves were plotted to represent the 10-fold

cross-validated classication correctness of the RF model when using

different number of loci, and the results can be seen in Fig. 2A and B. We

selected 24 continental-specic loci, including seven AFR ancestry loci

(99.32%), one AMR ancestry locus (88.74%), seven EAS ancestry loci

(98.88%), six EUR ancestry loci (95.25%) and three SAS ancestry loci

(92.21%), which reached an average of 95% of the correct rate of ve-

continent categorization. For the EAS-specic loci, 12 Han Chinese

ancestry loci (84.29%), 19 Japanese ancestry loci (97.22%) and 11 SEAS

ancestry loci (93.47%), and by removing duplicates, we nally obtained

34 EAS-specic loci, which achieved an average accuracy rate of 92%

for the three-EAS classications.

Details of the 58 AISNPs screened from 1000 Genomes Project Phase

3 using machine learning algorithms were shown in Supplementary

Table 2. The 24 of these loci have already been reported[1,3–5,10–18],

and other 34 AISNPs are novel loci. Supplementary Tables 3 and 4

showed the F

and δ values of each AISNP locus for the pairwise

intercontinental populations, and pairwise East Asian populations,

respectively. Among the 24 continental-specic loci, the F

and I

sta-

tistics for the ve continental populations were 0.1270–0.6218, and

0.0501–0.2739, respectively. The rs575377 (F

=0.1270, I

=0.0501),

rs513265 (F

=0.2581, I

=0.1219), and rs2072053 (F

=0.4875,

=0.1995) have not been reported previously. At the intercontinental

level, AFR with other intercontinental populations (AMR, EAS, EUR,

SAS) had 14, 16, 14, and 10 loci with δ > 0.3; EAS with other pop-

ulations (AMR, EUR, SAS) had 11, 14, and 10 loci with δ > 0.3; EUR with

other populations (AMR, SAS) had ve and eight loci with δ > 0.3,

respectively. And there were four loci with δ

SAS/AMR

>0.3. Among the 34

East Asian-specic loci, the F

and I

statistics of the ve East Asian

populations were 0.0472–0.3388, and 0.0235–0.1624, respectively. The

31 East Asian-specic loci have not been reported. At the East Asian

level, the δ

CHB/KHV

, δ

CHB/CDX

, and δ

CHS/CDX

values of rs434124 reached

more than 0.4, and δ

JPT/CDX

and δ

KHV/JPT

values of rs11629323 were all

greater than 0.5. The loci with the highest δ

KHV/CHS

, δ

JPT/CHS

, and

CHB/JPT

values were rs149768401, rs543086096, and rs2920295,

respectively.

In order to visualize the distribution characteristics of the genotyping

data of 58 AISNP loci in the 26 population (Supplementary Table 5), we

used the t-SNE method to downscale the high-dimensional data to two

dimensions, and then illustrated it in a two-dimensional coordinate

system (Fig. 2C). The AFR, EAS, SAS and EUR were separated from each

other with obvious gaps, whereas some AMR individuals overlapped

with the EUR and SAS. In addition, it could also be found from the gure

that the EAS was divided into three clusters, namely, Japan, Han Chi-

nese and SEAS clusters from top to bottom at the t-SNE2 level.

3.3. Machine learning model construction, testing and evaluation

In order to investigate whether the selected AISNP molecular genetic

markers have sufcient large differences in genetic differentiations

among the target populations under study, PCA was used, and the

distributions of the samples were demonstrated in two dimensions. First,

individual-level PCA analysis was performed based on the genotyping

data of the selected 24 AISNPs in 26 reference populations of the 1000

Genomes Project. In the Fig. 3A, the rst two PCs explained 28.19% and

23.64% of the total variance of the genetic distributions among the

intercontinental populations, respectively. In the dimensional space

formed by PC1 and PC2, the red, green, purple, blue and yellow dots

denoted the individuals from AFR, AMR, EAS, EUR and SAS, respec-

tively, where the AFR individuals distributed in the lower-left quadrant,

EAS individuals clustered in the lower-right quadrant, SAS individuals

distributed in the center, and EUR individuals distributed in the upper-

center, and the populations from the same geographic region roughly

clustered together, whereas the partial AMR individuals distributed

among the SAS, EAS and EUR individuals.

Then RF, SVM, XGBoost, LDA, KNN and NN methods were used to

build classication models for intercontinental biogeographic ancestry

prediction, respectively. The results of the best parameters of these six

models were shown in Supplementary Table 6. The performances of six

models were exhibited in Table 1 and Fig. 3B-D (Confusion Matrix and

ROC curve). The XGBoost and RF models achieved better classication

performances, with both MCC and accuracy values of 0.94. The XGBoost

and RF models could fully identify EAS, but the XGBoost model mis-

identied 1% of AFR, 7% of EUR and 5% of SAS, respectively; and the RF

model also misidentied 2% of AFR, 5% of EUR and 4% of SAS,

respectively. The f1 scores predicted by both models for AMR were 0.83.

By comparing the AUC-ROC values (Table 1), it can be found that the six

models had the best ancestral inference efcacy for EAS (AUC-ROC=1),

and except for KNN, the other ve models had the best ancestral infer-

ence efcacy for AFR (AUC-ROC=1). In this study, the micro-averaging

(Fig. 3C) and macro-averaging (Fig. 3D) values of the six models were

also calculated, and the ROC curves were plotted. Macro-averaging and

micro-averaging are two different methods used to calculate metrics in

multi-category classication problems. Macro-averaging focuses on the

performance of each category, while micro-averaging focuses on the

overall performance[19]. As can be seen in Figs. 3C and D, the AUC-ROC

values of XGBoost and RF models were the same and the highest in the

six models.

In addition, we also performed individual-level PCA analysis based

on the genotyping data of the 34 AISNPs selected from 504 individuals

of EAS populations in the 1000 Genomes Project (Fig. 4A). The 504

individuals were categorized according to the geographical origins

represented by different colors. And red, green, purple, blue and yellow

dots respectively represented the individuals from CDX, CHB, CHS, JPT

and KHV populations, respectively, and EAS was categorized into three

clusters, i.e., SEAS (CDX, KHV), Han Chinese (CHS, CHB), and JPT. The

Han individuals were between the JPT and SEAS. The 504 individuals

originating from EAS were divided into three subgroups (SEAS, Han,

JPT), and EAS ancestry inference models were constructed based on RF,

SVM, XGBoost, LDA, KNN and NN classiers. The results of the best

parameters of these six models were shown in Supplementary Table 6.

The performances of the six models were displayed in Table 2 and

Fig. 4B-D (Confusion Matrix and ROC curves). The XGBoost model was

the best ancestor inference efcacy with MCC value of 0.89, and the

highest AUC-ROC values and f1 scores in all three subgroups. The

XGBoost and RF models were the 92% prediction accuracy in EAS. In

Figs. 4C and D, the XGBoost and RF models had the highest and same

AUC-ROC values (0.99) among six models.

3.4. ADMIXTURE analysis based on selected 58 AISNP loci

Population genetic structure analysis can identify the components of

subgroups within a population, the degree of genetic exchange between

populations, and can also reveal human origin, migration, evolutionary

history and background. ADMIXTURE analysis was performed based on

the genotyping data of selected 58 AISNP loci on 26 reference pop-

ulations in this study (Fig. 5). The genetic structures and cross-validation

M. Cai et al.

Forensic Science International 357 (2024) 111975

errors of the 26 reference populations were analyzed using ADMIXTURE

software. Fig. 5A showed the results of the population structure analyses

when the number of ancestors was K=1–7. Each individual was repre-

sented by a vertical line that was divided into K color segments, the

length of which was related to the proportions of ancestral components

of the tested sample. At K=3, the EAS, AFR populations were distin-

guished from the other intercontinental populations, and then the

populations from SAS were further separated at K = 4. When K=5, the

MXL and PEL from AMR cannot be distinguished from each other and

can be considered as a clustering group, and showed light blue domi-

nated ancestral component, while the other two American populations,

CLM and PUR, exhibited the strong mixture of ancestral components,

with larger proportion of European ancestral component. At K = 6, all

the individuals were assigned to six ancestral clusters: AFR, SAS, EUR,

North EAS, South EAS and AMR. And JPT, CHB and CHS displayed

predominantly North EAS ancestral component (light blue), while CDX

and KHV showed strong South EAS ancestral component (green).

Cross-validation error for each K value estimated by the ADMIX-

TURE software could be used to determine the optimal K value, and the

results were shown in Fig. 5B. The results of the cross-validation errors

suggested that the optimal K value was six (cross-validation

error=0.4686), i.e., when six kinds of ancestral components were

assumed for 26 populations, it could maximize the explanation of

structural differences among populations. We visualized the proportions

of ancestral components of the EAS and AMR subgroups at the optimal K

= 6 as a stacked plot (Fig. 5C). It can be observed that JPT predomi-

nantly accounted for the North EAS ancestral component of 0.7755, and

CDX and KHV accounted for the South EAS ancestral components of

Fig. 3. PCA analysis for the continental level, results of the six model predictions, and efcacy assessments. (A) PCA analysis based on the raw genotyping data from

2504 individuals at 24 AISNP loci. (B) Confusion matrix results of the six prediction models. (C) Micro-averaging ROC curves of the six prediction models. (D) Macro-

averaging ROC curves of the six prediction models.

Table 1

Performance of the six optimal models using 24 continental-specic AISNPs. The results were measured in terms of AUC-ROC, MCC, accuracy and f1 score.

Method MCC Accuracy AUC-ROC f1 score

AFR AMR EAS EUR SAS AFR AMR EAS EUR SAS

RF 0.94 0.94 1.00 0.97 1.00 0.99 1.00 0.98 0.83 1.00 0.95 0.96

SVM 0.91 0.92 1.00 0.96 1.00 0.99 0.99 0.99 0.74 0.99 0.90 0.92

XGBoost 0.94 0.94 1.00 0.98 1.00 1.00 1.00 0.99 0.83 1.00 0.93 0.95

LDA 0.90 0.91 1.00 0.96 1.00 0.99 0.99 0.99 0.72 0.98 0.90 0.92

KNN 0.86 0.88 0.99 0.82 1.00 0.97 0.98 0.98 0.53 0.97 0.84 0.91

NN 0.90 0.92 1.00 0.96 1.00 0.99 0.99 0.98 0.68 0.98 0.91 0.91

M. Cai et al.

Forensic Science International 357 (2024) 111975

0.7345 and 0.6223, respectively. Whereas the CHB and CHS groups

accounted for the North EAS ancestral components of 0.5836 and

0.4703, and the South EAS ancestral components of 0.2912 and 0.3840,

respectively. MXL and PEL groups mainly accounted for the AMR

ancestry components of 0.4578 and 0.6415, while CLM and PUR mainly

accounted for the EUR ancestry components of 0.5114 and 0.5209,

respectively. ADMIXTURE results further conrmed that the selected 58

AISNP loci had better efcacy in distinguishing the ve continental

origin populations, and also better distinguished these EAS populations.

4. Discussion

How to screen those genetic markers with high ancestry inference

efcacy and select the optimal combination of a small set of AIMs to

develop a biogeographic ancestral information inference system with

higher inference accuracy and practicality is of great value in the

application of forensic ancestry inference. In this study, we obtained a

small set of AISNP loci through genome-wide screening of AIMs, liter-

ature search and feature selection, and this combination of 58 AISNP loci

was not only capable of distinguishing intercontinental populations, but

also had good efcacy in discriminating East Asian populations. Based

on the dimensionality reduction analyses, it can be seen that 1750

AISNP loci selected on basis of the selection loci tool and literature

search could distinguish the ve continental populations, some of these

loci can also be found to distinguish the East Asian populations, namely

JPT, SEAS, Han Chinese using the t-SNE analysis. Compared with MDS,

Fig. 4. PCA analysis for the East Asian level, the results of six model predictions, and the efcacy assessment results. (A) PCA analysis based on the raw genotyping

data of 34 AISNP loci at 504 individuals of EAS populations. (B) Confusion matrix results of the six prediction models. (C) Micro-averaging ROC curves of the six

prediction models. (D) Macro-averaging ROC curves of the six prediction models.

Table 2

Performance of the six optimal models using 34 East Asian-specic AISNPs. The results were measured in terms of AUC-ROC, MCC, accuracy and f1 score.

Method MCC Accuracy AUC-ROC f1 score

Han JPT SEAS Han JPT SEAS

RF 0.86 0.92 0.97 1.00 0.99 0.90 0.91 0.91

SVM 0.79 0.87 0.93 0.98 0.97 0.85 0.91 0.85

XGBoost 0.89 0.92 0.97 0.99 0.99 0.92 0.92 0.95

LDA 0.84 0.86 0.95 0.99 0.97 0.89 0.93 0.88

KNN 0.69 0.80 0.85 0.92 0.94 0.79 0.79 0.80

NN 0.79 0.85 0.92 0.97 0.96 0.85 0.89 0.86

M. Cai et al.

Forensic Science International 357 (2024) 111975

PCA and UAMP methods, t-SNE is more suitable for discovering local

structures and clustering[7,20]. In order to remove redundant loci to

obtain a small set of AISNPs with high performances, the embedded

feature selection was performed based on the RF model using the

one-vs-rest classication strategy. We obtained a combination of 58

AISNP loci, which can not only be used for ancestry inference of inter-

continental populations, but also have important value in ne differ-

entiations of East Asian populations.

According to the results of the ADMIXTURE analysis, this new

combination obtained an optimal K value of six, i.e., it was able to divide

the 26 populations from 1000 Genomes Project into six kinds of ances-

tral components, thus distinguishing the East Asian and American sub-

groups. Among 58 AISNP loci, 24 AISNPs were nally screened to be

used for continental ancestry inference, while 34 AISNPs were used for

ancestry inference within the East Asian populations. Based on geno-

typing data of 26 populations in ve continents from 1000 Genomes

Project, the results of PCA analyses demonstrated that 24 AISNPs could

distinguish AFR, EAS, EUR and SAS, but were less effective for AMR. The

34 AISNPs were able to better differentiate between East Asian

populations, including Han, Japan and SEAS. The method of screening

loci in this study is simple and easy to implement, which fully utilizes the

advantages of machine learning. On the other hand, this study chose the

commonly used RF classic algorithm in the eld of machine learning as

the classier and achieved good results in ancestry inference for East

Asian populations.

Choosing a suitable classication algorithm to construct a biogeo-

graphic ancestral information inference model can signicantly improve

the recognition accuracy. RF has high accuracy and robustness, can

handle a large number of features and samples, and evaluate the

importance of features. SVM is effective in high dimensional space, can

deal with nonlinear differentiable problems, and has strong general-

ization ability. XGBoost has high accuracy and robustness, can deal with

large-scale data, and can automatically deal with missing values. LDA is

simple and easy to explain, can deal with multi-categorization problems,

and can reduce the dimensionality. KNN is simple and easy to imple-

ment, and is suitable for multi-category problems. NN can learn complex

nonlinear relationships, is suitable for large-scale data, and has strong

expressive power. In this study, we combined 24 AISNPs and 34 AISNPs

Fig. 5. Population genetic structure analysis based on 26 reference population data. (A) 26 reference population structures with K = 2–7 based on 58 AISNP loci via

ADMIXTURE analysis. (B) Cross validation error obtained based on the result of the ADMIXTURE analysis. (C) Stacked plot of the proportions of population ancestry

components at the optimal K value of 6. CDX, Chinese Dai in Xishuangbanna; JPT, Japanese in Tokyo; KHV, Kinh in Ho Chi Minh City; CHB, Chinese Beijing Han;

CHS, Chinese Southern Han; ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest USA; MSL, Mende in Sierra Leone; ESN, Esan in Nigeria; YRI,

Yoruba in Ibadan; GWD, Gambian in Western Division; LWK, Luhya in Webuye, Kenya; FIN, Finnish in Finland; CEU, Utah residents with Northern and Western

European ancestry; GBR, British in England and Scotland; IBS, Iberian populations in Spain; TSI, Toscani in Italy; ITU, Indian Telugu in the UK; PJL, Punjabi in

Lahore; STU, Sri Lankan Tamil in the UK; BEB, Bengali in Bangladesh; GIH, Gujarati Indian in Houston; CLM, Colombian in Medellin; PUR, Puerto Rican in Puerto

Rico; MXL, Mexican Ancestry in Los Angeles; PEL, Peruvian in Lima. AFR, African populations; SAS, South Asian populations; EUR, European populations; EAS, East

Asian populations; AMR, American populations.

M. Cai et al.

Forensic Science International 357 (2024) 111975

with six classication algorithms to construct classication models and

analyzed the discriminative ability of these two combinations of AISNPs

for intercontinental and intra-East Asian populations, respectively. The

results showed that compared with other models, XGBoost and RF

models have the high efcacy. The XGBoost and RF models achieved

94%, and 92% prediction accuracy in intercontinental, and intra-East

Asian populations, respectively. The XGBoost and RF models may be

more suitable for biogeographic ancestry information inference based

on SNP genotypes than other models.

Compared with the previous studies which distinguished continental

populations[4,21], this present study used fewer AIMs (24 AISNPs) and

was better able to distinguish the four continents (except AMR).

Compared to the same study on population substructures within a

continental subregion (East Asia)[12–14], the AIMs (34 AISNPs) used in

this study were categorize East Asia into three major clusters, namely

JPT, SEAS, Han Chinese, thus better differentiating the intra-East Asian

populations. Although the model we constructed based on the screened

AISNPs could predict the biogeographic ancestries of the 26 populations

with relative accuracy, the actual genetic structures of populations in

different regions are actually very complex due to historical migrations,

population interactions and genetic drift. Therefore, it is necessary to

utilize real samples from different regions and ethnic origins to study the

differences in genetic structures among populations. In the future, we

will construct a multiple amplication system on the basis of the

screened AISNPs, and utilize the real samples for further validation, so as

to make it of practical application value.

In summary, this study identied the optimal AISNP combinations

and corresponding classication algorithms for identifying the ve

continental and intra-East Asian populations by analyzing the geno-

typing data of 1750 AISNP loci in 2504 individuals. We believed that our

results could be benecial for forensic biogeographical traceability of

individual source of on-site biomaterial and related population genetic

research. In this study, we pioneered a set of 34 AISNP loci which could

perform the genetic differentiations of the East Asian populations with

high distinguishing effectiveness and efciency balance. This combina-

tion would contribute to the development of subgroup ancestry infer-

ence system in East Asia and further enhance the recognition ability of

internal differentiations of the East Asian populations.

CRediT authorship contribution statement

Bofeng Zhu: Writing – review & editing, Writing – original draft.

Meiming Cai: Writing – review & editing, Writing – original draft,

Visualization. Fanzhang Lei: Writing – review & editing. Xiaolian Wu:

Writing – review & editing. Chen Mao: Writing – review & editing.

Meisen Shi: Writing – review & editing. Man Chen: Writing – review &

editing. Qiong Lan: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing nancial

interests or personal relationships that could have appeared to inuence

the work reported in this paper.

Acknowledgements

This work was supported by grants from the Qian duansheng

Distinguished Scholars Program of China University of the Political

Science and Law (No: 01140065140); Cross disciplinary construction

project of evidence investigation (No: 10322308); National Key R&D

Program of China (2022YFC3302004, 2022YFC3302004–1).

Appendix A. Supporting information

Supplementary data associated with this article can be found in the

online version at doi:10.1016/j.forsciint.2024.111975.

References

[1] X.Y. Jin, Y.X. Guo, C. Chen, W. Cui, Y.F. Liu, Y.C. Tai, B.F. Zhu, Ancestry prediction

comparisons of different AISNPs for ve continental populations and population

structure dissection of the xinjiang hui group via a self-developed panel, Genes 11

(2020) 505.

[2] T. Frudakis, K. Venkateswarlu, M.J. Thomas, Z. Gaskin, S. Ginjupalli, S. Gunturi,

V. Ponnuswamy, S. Natarajan, P.K. Nachimuthu, A classier for the SNP-based

inference of ancestry, J. Forensic Sci. 48 (2003) 771–782.

[3] C. Phillips, A. Salas, J.J. S

anchez, M. Fondevila, A. G

omez-Tato, J. Alvarez-Dios,

M. Calaza, M.C. de Cal, D. Ballard, M.V. Lareu, A. Carracedo, Inferring ancestral

origin using a single multiplex assay of ancestry-informative marker SNPs, Forensic

Sci. Int. Genet. 1 (2007) 273–280.

[4] Y.L. Wei, L. Wei, L. Zhao, Q.F. Sun, L. Jiang, T. Zhang, H.B. Liu, J.G. Chen, J. Ye,

L. Hu, C.X. Li, A single-tube 27-plex SNP assay for estimating individual ancestry

and admixture from three continents, Int. J. Leg. Med. 130 (2016) 27–37.

[5] K.K. Kidd, W.C. Speed, A.J. Pakstis, M.R. Furtado, R. Fang, A. Madbouly,

M. Maiers, M. Middha, F.R. Friedlaender, J.R. Kidd, Progress toward an efcient

panel of SNPs for ancestry inference, Forensic Sci. Int. Genet. 10 (2014) 23–32.

[6] A.J. Pakstis, L. Kang, L. Liu, Z. Zhang, T. Jin, E.L. Grigorenko, F.R. Wendt,

B. Budowle, S. Hadi, M.S. Al Qahtani, N. Morling, H.S. Mogensen, G.E. Themudo,

U. Soundararajan, H. Rajeevan, J.R. Kidd, K.K. Kidd, Increasing the reference

populations for the 55 AISNP panel: the need and benets, Int. J. Leg. Med. 131

(2017) 913–917.

[7] E. Pilli, S. Morelli, B. Poggiali, E. Alladio, Biogeographical ancestry, variable

selection, and PLS-DA method: a new panel to assess ancestry in forensic samples

via MPS technology, forensic science international, Genetics 62 (2023) 102806.

[8] S. Zhao, C.M. Shi, L. Ma, Q. Liu, Y. Liu, F. Wu, L. Chi, H. Chen, AIM-SNPtag: a

computationally efcient approach for developing ancestry-informative SNP

panels, Forensic science international, Genetics 38 (2019) 245–253.

[9] A. Auton, L.D. Brooks, R.M. Durbin, E.P. Garrison, H.M. Kang, J.O. Korbel, J.

L. Marchini, S. McCarthy, G.A. McVean, G.R. Abecasis, A global reference for

human genetic variation, Nature 526 (2015) 68–74.

[10] R. Kosoy, R. Nassir, C. Tian, P.A. White, L.M. Butler, G. Silva, R. Kittles, M.

E. Alarcon-Riquelme, P.K. Gregersen, J.W. Belmont, F.M. De La Vega, M.F. Seldin,

Ancestry informative marker sets for determining continental origin and admixture

proportions in common populations in America, Hum. Mutat. 30 (2009) 69–78.

[11] G. He, J. Liu, M. Wang, X. Zou, T. Ming, S. Zhu, H.Y. Yeh, C. Wang, Z. Wang,

Y. Hou, Massively parallel sequencing of 165 ancestry-informative SNPs and

forensic biogeographical ancestry inference in three southern Chinese Sinitic/Tai-

Kadai populations, forensic science international, Genetics 52 (2021) 102475.

[12] S. Qu, J. Zhu, Y. Wang, L. Yin, M. Lv, L. Wang, H. Jian, Y. Tan, R. Zhang, Y. Liu,

F. Li, S. Huang, W. Liang, L. Zhang, Establishing a second-tier panel of 18 ancestry

informative markers to improve ancestry distinctions among asian populations,

Forensic Sci. Int. Genet. 41 (2019) 159–167.

[13] X.Y. Jin, Y.Y. Wei, Q. Lan, W. Cui, C. Chen, Y.X. Guo, Y.T. Fang, B.F. Zhu, A set of

novel SNP loci for differentiating continental populations and three Chinese

populations, PeerJ 7 (2019) e6508.

[14] C.X. Li, A.J. Pakstis, L. Jiang, Y.L. Wei, Q.F. Sun, H. Wu, O. Bulbul, P. Wang, L.

L. Kang, J.R. Kidd, K.K. Kidd, A panel of 74 AISNPs: improved ancestry inference

within Eastern Asia, Forensic Sci. Int. Genet. 23 (2016) 101–110.

[15] O. Bulbul, W.C. Speed, C. Gurkan, U. Soundararajan, H. Rajeevan, A.J. Pakstis, K.

K. Kidd, Improving ancestry distinctions among Southwest asian populations,

Forensic Sci. Int. Genet. 35 (2018) 14–20.

[16] H.L. Hwa, C.P. Lin, T.Y. Huang, P.H. Kuo, W.H. Hsieh, C.Y. Lin, H.I. Yin, L.

H. Tseng, J.C. Lee, A panel of 130 autosomal single-nucleotide polymorphisms for

ancestry assignment in ve asian populations and in caucasians, Forensic Sci. Med.

Pathol. 13 (2017) 177–187.

[17] C.M. Nievergelt, A.X. Maihofer, T. Shekhtman, O. Libiger, X. Wang, K.K. Kidd, J.

R. Kidd, Inference of human continental origin and admixture proportions using a

highly discriminative ancestry informative 41-SNP panel, Invest. Genet. 4 (2013)

13.

[18] C. Phillips, A. Freire Aradas, A.K. Kriegel, M. Fondevila, O. Bulbul, C. Santos,

F. Serrulla Rech, M.D. Perez Carceles,

A. Carracedo, P.M. Schneider, M.V. Lareu,

Eurasiaplex: a forensic SNP assay for differentiating European and South Asian

ancestries, Forensic Sci. Int. Genet. 7 (2013) 359–366.

[19] Y. Yang, An evaluation of statistical approaches to text categorization, Inf. Retr. 1

(1999) 69–90.

[20] E. Alladio, B. Poggiali, G. Cosenza, E. Pilli, Multivariate statistical approach and

machine learning for the evaluation of biogeographical ancestry inference in the

forensic eld, Sci. Rep. 12 (2022) 8974.

[21] X.Y. Jin, W. Cui, C. Chen, Y.X. Guo, Y.W. Tao, Q. Lan, T.T. Kong, B.F. Zhu,

Biogeographic origin prediction of three continental populations through 42

ancestry informative SNPs, Electrophoresis 41 (2020) 235–245.

M. Cai et al.

Bấm Tải xuống để xem toàn bộ.

Preview text:

Forensic Science International 357 (2024) 111975
Contents lists available at ScienceDirect
Forensic Science International
journal homepage: www.elsevier.com/locate/forsciint
Systematic analyses of AISNPs screening and classification algorithms
based on genome-wide data for forensic biogeographic ancestry inference
Meiming Cai a, Fanzhang Lei a, Man Chen a, Qiong Lan a,b, Xiaolian Wu a, Chen Mao c,*,
Meisen Shi d,*, Bofeng Zhu a,*
a Guangzhou Key Laboratory of Forensic Multi-Omics for Precision Identification, School of Forensic Medicine, Southern Medical University, Guangzhou, Guangdong, China
b Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, Guangdong, China
c Department of Epidemiology, School of Public Health, Southern Medical University, Guangzhou, Guangdong, China
d Criminal Justice College of China University of Political Science and Law, Beijing, China A R T I C L E I N F O A B S T R A C T Keywords:
Identifying the biogeographic ancestral origin of biological sample left at a crime scene can provide important Biogeographical ancestry
evidence for judicial case, as well as clue for narrowing down suspect. Ancestry informative single nucleotide AISNP
polymorphism (AISNP) has become one of the most important genetic markers in recent years for screening Machine learning
ancestry information loci and analyzing the population genetic background and structure due to their high Feature selection
number and wide distributions in the human genome. In this study, based on data from 26 populations in the
1000 Genomes Project Phase 3, a Random Forest classification model was constructed with one-vs-rest classi-
fication strategy for embedded feature selection in order to obtain a panel with a small number of efficient
AISNPs. The research aim was to clarify differentiations of population genetic structures among continents and
subregions of East Asia. ADMIXTURE results showed that based on the 58 AISNPs selected by the machine
learning algorithm, the 26 populations involved in the study could be categorized into six intercontinental
ancestry components: North East Asia, South East Asia, Africa, Europe, South Asia, and America. The 24
continental-specific AISNPs and 34 East Asian-specific AISNPs were finally obtained, and used to construct the
ancestry prediction model using XGBoost algorithm, resulting in the Matthews correlation coefficients of 0.94
and 0.89, and accuracies of 0.94 and 0.92, respectively. The machine learning models that we constructed using
population-specific AISNPs were able to accurately predict the ancestral origins of continental and intra-East
Asian populations. To summarize, screening a set of high-perform AISNPs to infer biogeographical ancestral
information using embedded feature selection has potential application in creating a layered inference system
that accurately differentiates from intercontinental populations to local subpopulations. 1. Introduction
polymorphisms, low mutation rates, and shorter amplified fragments
which are better suited for multiplex polymerase chain reaction (PCR)
If the biogeographical ancestor of the person from whom the bio-
based on capillary electrophoresis platform and successful genotyping of
logical samples originated can be deduced, it will be possible to provide
severely degraded biological materials. Many AISNP panels have been
key investigative clues for the case, lock in the scope of the suspect, and
published previously, most of which could distinguished among inter-
clarify the direction of the investigation[1]. Ancestry inference is to
continental populations[2–6]. However, the above previously published
calculation the ancestral composition of a population’s genetic struc-
detection systems of ancestry inference were less able to further subdi-
ture, or infer the ancestral origin of an individual by evaluating a series
vision of the East Asian populations. Constructing a hierarchical infer-
of ancestry informative markers (AIM). Researchers mostly performed
ence system to distinguish the accuracy from inferring intercontinental
ancestry inference research with single nucleotide polymorphism (SNP)
populations to localized subgroups is a must for future forensic genetic
genetic markers, which provide many advantages: relatively higher
development. Nevertheless, the majority of SNP combinations were * Corresponding authors.
E-mail addresses: maochen9@smu.edu.cn (C. Mao), shimeisen2000@163.com (M. Shi), zhubofeng7372@126.com (B. Zhu).
https://doi.org/10.1016/j.forsciint.2024.111975
Received 19 November 2023; Received in revised form 23 January 2024; Accepted 1 March 2024 Available online 2 March 2024
0379-0738/© 2024 Elsevier B.V. All rights reserved. M. Cai et
Forensic Science International al. 357 (2024) 111975
derived by assessing the genetic differentiation index (FST) and the allele
distances of the loci on the same chromosome are greater than 10 Mb
frequency difference (δ) values among populations. This approach
and the pairwise loci conform to linkage equilibrium (R2 less than 0.4);
focused more on the selection of single optimal SNP and was less
(4) all selected AISNP loci conform to the Hardy-Weinberg equilibrium
effective for near-population delineation. Hence, it is a crucial task to
(HWE) in all reference populations; (5) pairwise intercontinental pop-
develop machine learning algorithms to screen the best combinations of
ulations with δ values greater than 0.5; δ values greater than 0.3 between
AISNPs, and then explore promising discriminative methods for forensic
the AMR and the other populations, and between the EUR and SAS ancestry inference.
populations, (6) δ values are greater than 0.25 between EAS populations,
Machine learning algorithms excel at extracting those loci with high
(7) loci with δ values in the top 10 loci in descending order per chro-
ancestral information inference efficiency from large-scale genome-wide
mosome between the populations are selected. We also added some
datasets, due to their capacity to handle high-dimensional data. Several
AISNP loci from previously published studies[5,10–12]. Removing
feature selection techniques, using various models, have been developed
duplicate loci in two methods mentioned above and pairwise AISNP loci
to identify the appropriate set of features for population genetics data[7,
in linkage disequilibrium (LD), a total of 1750 AISNPs with potential for
8]. Feature selection is able to eliminate irrelevant or redundant fea-
ancestry information inference were finally obtained in this study.
tures, thus achieving the purpose of reducing the number of features and
improving the model accuracy. According to the form of feature selec-
2.3. Dimensionality reduction analysis and visualization
tion, it is mainly divided into three categories: filtering, packing and
embedding. The feature selection processes of filtering and packing al-
The theoretical basis for realizing high-dimensional data visualiza-
gorithms are obviously different from the learner training process, while
tion is based on dimensionality reduction algorithms. The dimension-
the embedded feature selection integrates the feature selection process
ality reduction algorithms are generally classified into two categories:
with the learner training process, and the two are completed in the same
(1) algorithms such as principal component analysis (PCA), multidi-
optimization process, i.e., feature selection is carried out automatically
mensional scaling (MDS), can be good for presenting global character-
in the learner training process. Therefore, this study was based on the
istics of the data; and (2) t-distributed stochastic neighbor embedding (t-
Random Forest (RF) algorithm using one-vs-rest classification strategy
SNE), uniform manifold approximation and projection (UMAP), tend to
for the embedded feature selection. In RF algorithm, each decision tree
preserve the local structural features of the data. The nature of PCA is an
was trained based on a random subset of features. By calculating the
unsupervised model, in the model the grouping of each sample is un-
importance of each feature, its degree of contribution to the model
known, and the analysis is performed purely on the basis of the char-
performance could be evaluated. And the one-vs-rest classification
acteristics of the data. The MDS algorithm makes the visualization and
strategy could transform a multi-category problem into multiple binary
analysis of the data more intuitive by downscaling the high-dimensional
classification problems, so that feature selection could be performed for
data into two or three dimensions. Both t-SNE and UMAP are commonly
each category. Furthermore, in order to explore an optimal method to
used nonlinear dimensionality reduction techniques for mapping high-
construct the discriminative model, this study built the inference pre-
dimensional data into low-dimensional spaces.
diction models based on six different machine learning algorithms,
In order to clarify the approximate genetic distribution pattern of
including the K Nearest Neighbors (KNN), Linear Discriminant Analysis
2504 individuals in the 1000 Genomes Project Phase 3 based on the
(LDA), Support Vector Machines (SVM), Neural Networks (NN),
selected 1750 AISNPs, the PLINK 2.0 software was used for PCA at the XGBoost, and RF algorithms.
individual level. In addition, we performed MDS, t-SNE and UMAP
In this study, we performed feature selection for distinguishing
dimensionality reduction analyses and visualization operations succes-
intercontinental and intra-East Asian populations based on the RF
sively using the ‘stats’, ‘Rtsne’, ‘umap’ and ‘ggplot2’ R packages on the R
classification algorithm using the one-vs-rest classification strategy to
software (v4.2.2; http://www.r-project.org/), respectively.
obtain a set of 58 AISNPs. The 24 AISNPs and 34 AISNPs were used to
infer biogeographic ancestry information at the continental and intra- 2.4. Feature selection
East Asian levels, respectively. In addition, six machine learning algo-
rithms were used to construct predictive models for ancestry informa-
‘OneVsRestClassifier’ and ‘RandomForestClassifier’ from the scikit-
tion inference, and the performances of the six methods were evaluated
learn library of python v3.8 software were used to construct multiclass
by the f1 score, Matthews correlation coefficient (MCC), and area under
classifiers for intercontinental and intra-EAS distinction purposes,
the receiver operating characteristic curve (AUC-ROC).
respectively. And the ‘GridSearchCV’ from scikit-learn library was also
used to search for optimal Random Forest model parameters. Then,
2. Materials and methods
adjust the parameters such as the number of trees, depth, and minimum
number of leaf nodes. Meanwhile, use ‘learning_curve’ function in scikit-
2.1. Sample set sources
learn library for learning curve plotting. The feature importance of each
classifier was determined, and the top 10 loci specific to the continent
This study used population data from the 1000 Genomes Project
and the top 20 loci specific to East Asia in descending order were
Phase 3[9], which consists of 26 populations with a total of 2504 in-
selected. Additionally, the learning curve of each classifier was plotted
dividuals. There were five East Asian populations (EAS), five European
to illustrate the 10-fold cross-validated classification accuracy of the RF
populations (EUR), five South Asian populations (SAS), seven African
model when using different numbers of loci.
populations (AFR), and four American populations (AMR). Detailed
information of the 26 populations could be found in Supplementary
2.5. Modeling biogeographical ancestry inference, model testing and Table 1. efficacy evaluation
2.2. Preliminary selection of AISNP loci
Various classification algorithms were used in this study, including
RF, SVM, XGBoost, LDA, KNN and NN. Corresponding classifiers were
Initial screening of AISNP loci in the 1000 Genomes Project Phase 3
introduced from the scikit-learn library of python v3.8 software,
was performed on a genome-wide scale using PLINK 2.0 software (http
including ‘RandomForestClassifier’, ‘SVC’, ‘XGBClassifier’, ‘Line-
s://www.cog-genomics.org/plink/2.0/). The basic conditions for the
arDiscriminantAnalysis’, ‘KNeighborsClassifier’ and ‘MLPClassifier’. All
screening of AISNP loci were as follows: (1) loci belong to autosomal
methods use the same dataset for multiclass classification. RF is an
SNP genetic markers and show biallelic polymorphisms; (2) AISNP loci
ensemble learning method that combines multiple decision trees to
with minimum allele frequencies greater than 0.01; (3) the physical
make predictions. It randomly selects subsets of data to build each tree, 2 M. Cai et
Forensic Science International al. 357 (2024) 111975
and then aggregates the results from all trees to make the final predic-
fold cross validation accuracy was also used as an indicator to assess the
tion. Support Vector Classification (SVC) is an implementation of SVM, a
efficacy of each prediction model.
support vector classifier that uses some training samples to construct a
hyperplane or a set of hyperplanes which can be used for classification. 3. Results
XGBoost is an optimized gradient boosting framework that uses decision
trees as base learners. And it improves upon traditional gradient
3.1. Dimensionality reduction analyses for 1750 candidate AISNP loci
boosting by incorporating regularization technique and parallel pro-
cessing. LDA can perform supervised dimensionality reduction by pro-
The MDS, PCA, t-SNE and UMAP dimensionality reduction analyses
jecting the input data into a linear subspace consisting of directions
based on the raw genotyping data of 2504 individuals at 1750 SNP loci
which maximize the separation between classes. In classification prob-
were performed, which were shown in Fig. 1. Dots of the same color in
lems, KNN is based on the idea of finding the k nearest neighbors to a
the figure indicated different individuals of the same continental origin,
query point and using their labels to make predictions. A multilayer
with red, green, purple, blue and yellow indicating individuals from
perceptron (MLP) method has been implemented in NN, which uses
AFR, AMR, EAS, EUR and SAS, respectively. The MDS plot (Fig. 1A)
backpropagation algorithm to generate a nonlinear function approx-
revealed that in the dimensional space constituted by MDS1 and MDS2, imator for classification.
the significant separation occurred between the AFR, EAS, SAS and EUR
The dataset in this study was randomly divided into 70% training set
individuals. In contrast, the AMR clustered between SAS and EUR in-
and 30% testing set. The training set was used to train the model and the
dividuals. The results of the PCA analysis of the five continental pop-
testing data was used to independently evaluate the performance. The
ulations at the individual level (Fig. 1B) showed that populations from
six classification methods were evaluated by f1 score, MCC, and AUC-
the same geographic region clustered together and these 26 populations
ROC, respectively. In addition, all prediction models were tested using
were divided into five clusters. Since the t-SNE and UMAP methods are
10-fold cross validation, and the statistically obtained mean value of 10-
secondary dimensionality reduction treatment of PCA analysis, both
Fig. 1. The MDS (A), PCA (B), t-SNE (C) and UMAP (D) plots on basis of 1750 AISNP loci data in 2504 individuals. The AFR, AMR, EAS, EUR and SAS were labeled in
red, green, purple, blue and yellow, respectively. CDX, Chinese Dai in Xishuangbanna; JPT, Japanese in Tokyo; KHV, Kinh in Ho Chi Minh City; CHB, Chinese Beijing
Han; CHS, Chinese Southern Han; ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest USA; MSL, Mende in Sierra Leone; ESN, Esan in Nigeria;
YRI, Yoruba in Ibadan; GWD, Gambian in Western Division; LWK, Luhya in Webuye, Kenya; FIN, Finnish in Finland; CEU, Utah residents with Northern and Western
European ancestry; GBR, British in England and Scotland; IBS, Iberian populations in Spain; TSI, Toscani in Italy; ITU, Indian Telugu in the UK; PJL, Punjabi in
Lahore; STU, Sri Lankan Tamil in the UK; BEB, Bengali in Bangladesh; GIH, Gujarati Indian in Houston; CLM, Colombian in Medellin; PUR, Puerto Rican in Puerto
Rico; MXL, Mexican Ancestry in Los Angeles; PEL, Peruvian in Lima. AFR, African populations; SAS, South Asian populations; EUR, European populations; EAS, East
Asian populations; AMR, American populations. 3 M. Cai et
Forensic Science International al. 357 (2024) 111975
methods can effectively visualize the 10 principal components (PCs)
plot was roughly similar to that of t-SNE, but the former focuses on
obtained from the original analysis in a two-dimensional space. In t-SNE
preserving the global structure, and thus the distances of individuals
plot (Fig. 1C), it can be seen that EAS was divided into three clusters,
within the clusters were small, making it difficult to distinguish the
Japan, Han Chinese (CHS and CHB) and CDX and KHV (Southern East
subgroup structure. The results of the dimensionality reduction analysis
Asian populations, SEAS). And GIH was distinguished from South Asian
confirmed that the 1750 AISNPs could effectively discriminate between
populations. West African individuals (ESN, GWD) in AFR were also
AFR, EUR, SAS, AMR and EAS populations, and also suggested that the
distinguished in this plot. The distribution pattern in UMAP (Fig. 1D)
selected loci would help further differentiate subgroups in EAS.
Fig. 2. (A) Learning curves based on the top 10 AISNP loci of feature importance in each classifier of the continental ancestry inference models were plotted to
represent the10-fold cross-validated classification correctness when different numbers of AISNP loci were used for the RF classification model. (B) Learning curves
were drawn based on the 20 AISNP loci with the highest feature importance in each classifier of the EAS ancestry inference model. (C) The t-SNE downscaling
analysis was performed based on raw genotyping data of 58 AISNP loci in 2504 individuals. CDX, Chinese Dai in Xishuangbanna; JPT, Japanese in Tokyo; KHV, Kinh
in Ho Chi Minh City; CHB, Chinese Beijing Han; CHS, Chinese Southern Han; ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest USA; MSL,
Mende in Sierra Leone; ESN, Esan in Nigeria; YRI, Yoruba in Ibadan; GWD, Gambian in Western Division; LWK, Luhya in Webuye, Kenya; FIN, Finnish in Finland;
CEU, Utah residents with Northern and Western European ancestry; GBR, British in England and Scotland; IBS, Iberian populations in Spain; TSI, Toscani in Italy;
ITU, Indian Telugu in the UK; PJL, Punjabi in Lahore; STU, Sri Lankan Tamil in the UK; BEB, Bengali in Bangladesh; GIH, Gujarati Indian in Houston; CLM, Colombian
in Medellin; PUR, Puerto Rican in Puerto Rico; MXL, Mexican Ancestry in Los Angeles; PEL, Peruvian in Lima. AFR, African populations; SAS, South Asian pop-
ulations; EUR, European populations; EAS, East Asian populations; AMR, American populations. 4 M. Cai et
Forensic Science International al. 357 (2024) 111975
3.2. Embedded feature selection based on random forest algorithm via
distributions of the samples were demonstrated in two dimensions. First,
multi-classification and one-vs-rest classification strategies
individual-level PCA analysis was performed based on the genotyping
data of the selected 24 AISNPs in 26 reference populations of the 1000
In order to generate a small number of features for efficient classi-
Genomes Project. In the Fig. 3A, the first two PCs explained 28.19% and
fication and to reduce model overfitting, we performed embedded
23.64% of the total variance of the genetic distributions among the
feature selection based on the RF algorithm using the multi-
intercontinental populations, respectively. In the dimensional space
classification and one-vs-rest strategies, respectively. Firstly, 746
formed by PC1 and PC2, the red, green, purple, blue and yellow dots
candidate AISNP loci were screened from 1750 AISNP loci by embedded
denoted the individuals from AFR, AMR, EAS, EUR and SAS, respec-
feature selection based on the RF model through multi-classification
tively, where the AFR individuals distributed in the lower-left quadrant,
strategy. The 357 of 746 loci showed potential for continental
EAS individuals clustered in the lower-right quadrant, SAS individuals
ancestry inference; while 432 loci had the potential for EAS ancestry
distributed in the center, and EUR individuals distributed in the upper-
inference. To further obtain fewer number loci with stronger specificity,
center, and the populations from the same geographic region roughly
a one-vs-rest classification strategy was employed for the above two sets
clustered together, whereas the partial AMR individuals distributed
of specific AISNP loci to obtain the top 10 (each continental-specific)
among the SAS, EAS and EUR individuals.
and 20 (EAS-specific) loci in terms of feature importance for each clas-
Then RF, SVM, XGBoost, LDA, KNN and NN methods were used to
sifier. And the learning curves were plotted to represent the 10-fold
build classification models for intercontinental biogeographic ancestry
cross-validated classification correctness of the RF model when using
prediction, respectively. The results of the best parameters of these six
different number of loci, and the results can be seen in Fig. 2A and B. We
models were shown in Supplementary Table 6. The performances of six
selected 24 continental-specific loci, including seven AFR ancestry loci
models were exhibited in Table 1 and Fig. 3B-D (Confusion Matrix and
(99.32%), one AMR ancestry locus (88.74%), seven EAS ancestry loci
ROC curve). The XGBoost and RF models achieved better classification
(98.88%), six EUR ancestry loci (95.25%) and three SAS ancestry loci
performances, with both MCC and accuracy values of 0.94. The XGBoost
(92.21%), which reached an average of 95% of the correct rate of five-
and RF models could fully identify EAS, but the XGBoost model mis-
continent categorization. For the EAS-specific loci, 12 Han Chinese
identified 1% of AFR, 7% of EUR and 5% of SAS, respectively; and the RF
ancestry loci (84.29%), 19 Japanese ancestry loci (97.22%) and 11 SEAS
model also misidentified 2% of AFR, 5% of EUR and 4% of SAS,
ancestry loci (93.47%), and by removing duplicates, we finally obtained
respectively. The f1 scores predicted by both models for AMR were 0.83.
34 EAS-specific loci, which achieved an average accuracy rate of 92%
By comparing the AUC-ROC values (Table 1), it can be found that the six
for the three-EAS classifications.
models had the best ancestral inference efficacy for EAS (AUC-ROC=1),
Details of the 58 AISNPs screened from 1000 Genomes Project Phase
and except for KNN, the other five models had the best ancestral infer-
3 using machine learning algorithms were shown in Supplementary
ence efficacy for AFR (AUC-ROC=1). In this study, the micro-averaging
Table 2. The 24 of these loci have already been reported[1,3–5,10–18],
(Fig. 3C) and macro-averaging (Fig. 3D) values of the six models were
and other 34 AISNPs are novel loci. Supplementary Tables 3 and 4
also calculated, and the ROC curves were plotted. Macro-averaging and
showed the FST and δ values of each AISNP locus for the pairwise
micro-averaging are two different methods used to calculate metrics in
intercontinental populations, and pairwise East Asian populations,
multi-category classification problems. Macro-averaging focuses on the
respectively. Among the 24 continental-specific loci, the FST and In sta-
performance of each category, while micro-averaging focuses on the
tistics for the five continental populations were 0.1270–0.6218, and
overall performance[19]. As can be seen in Figs. 3C and D, the AUC-ROC
0.0501–0.2739, respectively. The rs575377 (FST=0.1270, In=0.0501),
values of XGBoost and RF models were the same and the highest in the
rs513265 (FST=0.2581, In=0.1219), and rs2072053 (FST=0.4875, six models.
In=0.1995) have not been reported previously. At the intercontinental
In addition, we also performed individual-level PCA analysis based
level, AFR with other intercontinental populations (AMR, EAS, EUR,
on the genotyping data of the 34 AISNPs selected from 504 individuals
SAS) had 14, 16, 14, and 10 loci with δ > 0.3; EAS with other pop-
of EAS populations in the 1000 Genomes Project (Fig. 4A). The 504
ulations (AMR, EUR, SAS) had 11, 14, and 10 loci with δ > 0.3; EUR with
individuals were categorized according to the geographical origins
other populations (AMR, SAS) had five and eight loci with δ > 0.3,
represented by different colors. And red, green, purple, blue and yellow
respectively. And there were four loci with δSAS/AMR>0.3. Among the 34
dots respectively represented the individuals from CDX, CHB, CHS, JPT
East Asian-specific loci, the FST and In statistics of the five East Asian
and KHV populations, respectively, and EAS was categorized into three
populations were 0.0472–0.3388, and 0.0235–0.1624, respectively. The
clusters, i.e., SEAS (CDX, KHV), Han Chinese (CHS, CHB), and JPT. The
31 East Asian-specific loci have not been reported. At the East Asian
Han individuals were between the JPT and SEAS. The 504 individuals
level, the δCHB/KHV, δCHB/CDX, and δCHS/CDX values of rs434124 reached
originating from EAS were divided into three subgroups (SEAS, Han,
more than 0.4, and δJPT/CDX and δKHV/JPT values of rs11629323 were all
JPT), and EAS ancestry inference models were constructed based on RF,
greater than 0.5. The loci with the highest δKHV/CHS, δJPT/CHS, and
SVM, XGBoost, LDA, KNN and NN classifiers. The results of the best
δCHB/JPT values were rs149768401, rs543086096, and rs2920295,
parameters of these six models were shown in Supplementary Table 6. respectively.
The performances of the six models were displayed in Table 2 and
In order to visualize the distribution characteristics of the genotyping
Fig. 4B-D (Confusion Matrix and ROC curves). The XGBoost model was
data of 58 AISNP loci in the 26 population (Supplementary Table 5), we
the best ancestor inference efficacy with MCC value of 0.89, and the
used the t-SNE method to downscale the high-dimensional data to two
highest AUC-ROC values and f1 scores in all three subgroups. The
dimensions, and then illustrated it in a two-dimensional coordinate
XGBoost and RF models were the 92% prediction accuracy in EAS. In
system (Fig. 2C). The AFR, EAS, SAS and EUR were separated from each
Figs. 4C and D, the XGBoost and RF models had the highest and same
other with obvious gaps, whereas some AMR individuals overlapped
AUC-ROC values (0.99) among six models.
with the EUR and SAS. In addition, it could also be found from the figure
that the EAS was divided into three clusters, namely, Japan, Han Chi-
3.4. ADMIXTURE analysis based on selected 58 AISNP loci
nese and SEAS clusters from top to bottom at the t-SNE2 level.
Population genetic structure analysis can identify the components of
3.3. Machine learning model construction, testing and evaluation
subgroups within a population, the degree of genetic exchange between
populations, and can also reveal human origin, migration, evolutionary
In order to investigate whether the selected AISNP molecular genetic
history and background. ADMIXTURE analysis was performed based on
markers have sufficient large differences in genetic differentiations
the genotyping data of selected 58 AISNP loci on 26 reference pop-
among the target populations under study, PCA was used, and the
ulations in this study (Fig. 5). The genetic structures and cross-validation 5 M. Cai et
Forensic Science International al. 357 (2024) 111975
Fig. 3. PCA analysis for the continental level, results of the six model predictions, and efficacy assessments. (A) PCA analysis based on the raw genotyping data from
2504 individuals at 24 AISNP loci. (B) Confusion matrix results of the six prediction models. (C) Micro-averaging ROC curves of the six prediction models. (D) Macro-
averaging ROC curves of the six prediction models. Table 1
Performance of the six optimal models using 24 continental-specific AISNPs. The results were measured in terms of AUC-ROC, MCC, accuracy and f1 score. Method MCC Accuracy AUC-ROC f1 score AFR AMR EAS EUR SAS AFR AMR EAS EUR SAS RF 0.94 0.94 1.00 0.97 1.00 0.99 1.00 0.98 0.83 1.00 0.95 0.96 SVM 0.91 0.92 1.00 0.96 1.00 0.99 0.99 0.99 0.74 0.99 0.90 0.92 XGBoost 0.94 0.94 1.00 0.98 1.00 1.00 1.00 0.99 0.83 1.00 0.93 0.95 LDA 0.90 0.91 1.00 0.96 1.00 0.99 0.99 0.99 0.72 0.98 0.90 0.92 KNN 0.86 0.88 0.99 0.82 1.00 0.97 0.98 0.98 0.53 0.97 0.84 0.91 NN 0.90 0.92 1.00 0.96 1.00 0.99 0.99 0.98 0.68 0.98 0.91 0.91
errors of the 26 reference populations were analyzed using ADMIXTURE
North EAS, South EAS and AMR. And JPT, CHB and CHS displayed
software. Fig. 5A showed the results of the population structure analyses
predominantly North EAS ancestral component (light blue), while CDX
when the number of ancestors was K=1–7. Each individual was repre-
and KHV showed strong South EAS ancestral component (green).
sented by a vertical line that was divided into K color segments, the
Cross-validation error for each K value estimated by the ADMIX-
length of which was related to the proportions of ancestral components
TURE software could be used to determine the optimal K value, and the
of the tested sample. At K=3, the EAS, AFR populations were distin-
results were shown in Fig. 5B. The results of the cross-validation errors
guished from the other intercontinental populations, and then the
suggested that the optimal K value was six (cross-validation
populations from SAS were further separated at K = 4. When K=5, the
error=0.4686), i.e., when six kinds of ancestral components were
MXL and PEL from AMR cannot be distinguished from each other and
assumed for 26 populations, it could maximize the explanation of
can be considered as a clustering group, and showed light blue domi-
structural differences among populations. We visualized the proportions
nated ancestral component, while the other two American populations,
of ancestral components of the EAS and AMR subgroups at the optimal K
CLM and PUR, exhibited the strong mixture of ancestral components,
= 6 as a stacked plot (Fig. 5C). It can be observed that JPT predomi-
with larger proportion of European ancestral component. At K = 6, all
nantly accounted for the North EAS ancestral component of 0.7755, and
the individuals were assigned to six ancestral clusters: AFR, SAS, EUR,
CDX and KHV accounted for the South EAS ancestral components of 6 M. Cai et
Forensic Science International al. 357 (2024) 111975
Fig. 4. PCA analysis for the East Asian level, the results of six model predictions, and the efficacy assessment results. (A) PCA analysis based on the raw genotyping
data of 34 AISNP loci at 504 individuals of EAS populations. (B) Confusion matrix results of the six prediction models. (C) Micro-averaging ROC curves of the six
prediction models. (D) Macro-averaging ROC curves of the six prediction models. Table 2
Performance of the six optimal models using 34 East Asian-specific AISNPs. The results were measured in terms of AUC-ROC, MCC, accuracy and f1 score. Method MCC Accuracy AUC-ROC f1 score Han JPT SEAS Han JPT SEAS RF 0.86 0.92 0.97 1.00 0.99 0.90 0.91 0.91 SVM 0.79 0.87 0.93 0.98 0.97 0.85 0.91 0.85 XGBoost 0.89 0.92 0.97 0.99 0.99 0.92 0.92 0.95 LDA 0.84 0.86 0.95 0.99 0.97 0.89 0.93 0.88 KNN 0.69 0.80 0.85 0.92 0.94 0.79 0.79 0.80 NN 0.79 0.85 0.92 0.97 0.96 0.85 0.89 0.86
0.7345 and 0.6223, respectively. Whereas the CHB and CHS groups
efficacy and select the optimal combination of a small set of AIMs to
accounted for the North EAS ancestral components of 0.5836 and
develop a biogeographic ancestral information inference system with
0.4703, and the South EAS ancestral components of 0.2912 and 0.3840,
higher inference accuracy and practicality is of great value in the
respectively. MXL and PEL groups mainly accounted for the AMR
application of forensic ancestry inference. In this study, we obtained a
ancestry components of 0.4578 and 0.6415, while CLM and PUR mainly
small set of AISNP loci through genome-wide screening of AIMs, liter-
accounted for the EUR ancestry components of 0.5114 and 0.5209,
ature search and feature selection, and this combination of 58 AISNP loci
respectively. ADMIXTURE results further confirmed that the selected 58
was not only capable of distinguishing intercontinental populations, but
AISNP loci had better efficacy in distinguishing the five continental
also had good efficacy in discriminating East Asian populations. Based
origin populations, and also better distinguished these EAS populations.
on the dimensionality reduction analyses, it can be seen that 1750
AISNP loci selected on basis of the selection loci tool and literature 4. Discussion
search could distinguish the five continental populations, some of these
loci can also be found to distinguish the East Asian populations, namely
How to screen those genetic markers with high ancestry inference
JPT, SEAS, Han Chinese using the t-SNE analysis. Compared with MDS, 7 M. Cai et
Forensic Science International al. 357 (2024) 111975
Fig. 5. Population genetic structure analysis based on 26 reference population data. (A) 26 reference population structures with K = 2–7 based on 58 AISNP loci via
ADMIXTURE analysis. (B) Cross validation error obtained based on the result of the ADMIXTURE analysis. (C) Stacked plot of the proportions of population ancestry
components at the optimal K value of 6. CDX, Chinese Dai in Xishuangbanna; JPT, Japanese in Tokyo; KHV, Kinh in Ho Chi Minh City; CHB, Chinese Beijing Han;
CHS, Chinese Southern Han; ACB, African Caribbean in Barbados; ASW, African Ancestry in Southwest USA; MSL, Mende in Sierra Leone; ESN, Esan in Nigeria; YRI,
Yoruba in Ibadan; GWD, Gambian in Western Division; LWK, Luhya in Webuye, Kenya; FIN, Finnish in Finland; CEU, Utah residents with Northern and Western
European ancestry; GBR, British in England and Scotland; IBS, Iberian populations in Spain; TSI, Toscani in Italy; ITU, Indian Telugu in the UK; PJL, Punjabi in
Lahore; STU, Sri Lankan Tamil in the UK; BEB, Bengali in Bangladesh; GIH, Gujarati Indian in Houston; CLM, Colombian in Medellin; PUR, Puerto Rican in Puerto
Rico; MXL, Mexican Ancestry in Los Angeles; PEL, Peruvian in Lima. AFR, African populations; SAS, South Asian populations; EUR, European populations; EAS, East
Asian populations; AMR, American populations.
PCA and UAMP methods, t-SNE is more suitable for discovering local
populations, including Han, Japan and SEAS. The method of screening
structures and clustering[7,20]. In order to remove redundant loci to
loci in this study is simple and easy to implement, which fully utilizes the
obtain a small set of AISNPs with high performances, the embedded
advantages of machine learning. On the other hand, this study chose the
feature selection was performed based on the RF model using the
commonly used RF classic algorithm in the field of machine learning as
one-vs-rest classification strategy. We obtained a combination of 58
the classifier and achieved good results in ancestry inference for East
AISNP loci, which can not only be used for ancestry inference of inter- Asian populations.
continental populations, but also have important value in fine differ-
Choosing a suitable classification algorithm to construct a biogeo-
entiations of East Asian populations.
graphic ancestral information inference model can significantly improve
According to the results of the ADMIXTURE analysis, this new
the recognition accuracy. RF has high accuracy and robustness, can
combination obtained an optimal K value of six, i.e., it was able to divide
handle a large number of features and samples, and evaluate the
the 26 populations from 1000 Genomes Project into six kinds of ances-
importance of features. SVM is effective in high dimensional space, can
tral components, thus distinguishing the East Asian and American sub-
deal with nonlinear differentiable problems, and has strong general-
groups. Among 58 AISNP loci, 24 AISNPs were finally screened to be
ization ability. XGBoost has high accuracy and robustness, can deal with
used for continental ancestry inference, while 34 AISNPs were used for
large-scale data, and can automatically deal with missing values. LDA is
ancestry inference within the East Asian populations. Based on geno-
simple and easy to explain, can deal with multi-categorization problems,
typing data of 26 populations in five continents from 1000 Genomes
and can reduce the dimensionality. KNN is simple and easy to imple-
Project, the results of PCA analyses demonstrated that 24 AISNPs could
ment, and is suitable for multi-category problems. NN can learn complex
distinguish AFR, EAS, EUR and SAS, but were less effective for AMR. The
nonlinear relationships, is suitable for large-scale data, and has strong
34 AISNPs were able to better differentiate between East Asian
expressive power. In this study, we combined 24 AISNPs and 34 AISNPs 8 M. Cai et
Forensic Science International al. 357 (2024) 111975
with six classification algorithms to construct classification models and
Appendix A. Supporting information
analyzed the discriminative ability of these two combinations of AISNPs
for intercontinental and intra-East Asian populations, respectively. The
Supplementary data associated with this article can be found in the
results showed that compared with other models, XGBoost and RF
online version at doi:10.1016/j.forsciint.2024.111975.
models have the high efficacy. The XGBoost and RF models achieved
94%, and 92% prediction accuracy in intercontinental, and intra-East References
Asian populations, respectively. The XGBoost and RF models may be
more suitable for biogeographic ancestry information inference based
[1] X.Y. Jin, Y.X. Guo, C. Chen, W. Cui, Y.F. Liu, Y.C. Tai, B.F. Zhu, Ancestry prediction
on SNP genotypes than other models.
comparisons of different AISNPs for five continental populations and population
structure dissection of the xinjiang hui group via a self-developed panel, Genes 11
Compared with the previous studies which distinguished continental (2020) 505.
populations[4,21], this present study used fewer AIMs (24 AISNPs) and
[2] T. Frudakis, K. Venkateswarlu, M.J. Thomas, Z. Gaskin, S. Ginjupalli, S. Gunturi,
was better able to distinguish the four continents (except AMR).
V. Ponnuswamy, S. Natarajan, P.K. Nachimuthu, A classifier for the SNP-based
inference of ancestry, J. Forensic Sci. 48 (2003) 771–782.
Compared to the same study on population substructures within a
[3] C. Phillips, A. Salas, J.J. S´anchez, M. Fondevila, A. G´omez-Tato, J. Alvarez-Dios,
continental subregion (East Asia)[12–14], the AIMs (34 AISNPs) used in
M. Calaza, M.C. de Cal, D. Ballard, M.V. Lareu, A. Carracedo, Inferring ancestral
this study were categorize East Asia into three major clusters, namely
origin using a single multiplex assay of ancestry-informative marker SNPs, Forensic Sci. Int. Genet. 1 (2007) 273
JPT, SEAS, Han Chinese, thus better differentiating the intra-East Asian –280.
[4] Y.L. Wei, L. Wei, L. Zhao, Q.F. Sun, L. Jiang, T. Zhang, H.B. Liu, J.G. Chen, J. Ye,
populations. Although the model we constructed based on the screened
L. Hu, C.X. Li, A single-tube 27-plex SNP assay for estimating individual ancestry
AISNPs could predict the biogeographic ancestries of the 26 populations
and admixture from three continents, Int. J. Leg. Med. 130 (2016) 27–37.
with relative accuracy, the actual genetic structures of populations in
[5] K.K. Kidd, W.C. Speed, A.J. Pakstis, M.R. Furtado, R. Fang, A. Madbouly,
M. Maiers, M. Middha, F.R. Friedlaender, J.R. Kidd, Progress toward an efficient
different regions are actually very complex due to historical migrations,
panel of SNPs for ancestry inference, Forensic Sci. Int. Genet. 10 (2014) 23–32.
population interactions and genetic drift. Therefore, it is necessary to
[6] A.J. Pakstis, L. Kang, L. Liu, Z. Zhang, T. Jin, E.L. Grigorenko, F.R. Wendt,
utilize real samples from different regions and ethnic origins to study the
B. Budowle, S. Hadi, M.S. Al Qahtani, N. Morling, H.S. Mogensen, G.E. Themudo,
U. Soundararajan, H. Rajeevan, J.R. Kidd, K.K. Kidd, Increasing the reference
differences in genetic structures among populations. In the future, we
populations for the 55 AISNP panel: the need and benefits, Int. J. Leg. Med. 131
will construct a multiple amplification system on the basis of the (2017) 913–917.
screened AISNPs, and utilize the real samples for further validation, so as
[7] E. Pilli, S. Morelli, B. Poggiali, E. Alladio, Biogeographical ancestry, variable
selection, and PLS-DA method: a new panel to assess ancestry in forensic samples
to make it of practical application value.
via MPS technology, forensic science international, Genetics 62 (2023) 102806.
In summary, this study identified the optimal AISNP combinations
[8] S. Zhao, C.M. Shi, L. Ma, Q. Liu, Y. Liu, F. Wu, L. Chi, H. Chen, AIM-SNPtag: a
and corresponding classification algorithms for identifying the five
computationally efficient approach for developing ancestry-informative SNP
panels, Forensic science international, Genetics 38 (2019) 245–253.
continental and intra-East Asian populations by analyzing the geno-
[9] A. Auton, L.D. Brooks, R.M. Durbin, E.P. Garrison, H.M. Kang, J.O. Korbel, J.
typing data of 1750 AISNP loci in 2504 individuals. We believed that our
L. Marchini, S. McCarthy, G.A. McVean, G.R. Abecasis, A global reference for
results could be beneficial for forensic biogeographical traceability of
human genetic variation, Nature 526 (2015) 68–74.
[10] R. Kosoy, R. Nassir, C. Tian, P.A. White, L.M. Butler, G. Silva, R. Kittles, M.
individual source of on-site biomaterial and related population genetic
E. Alarcon-Riquelme, P.K. Gregersen, J.W. Belmont, F.M. De La Vega, M.F. Seldin,
research. In this study, we pioneered a set of 34 AISNP loci which could
Ancestry informative marker sets for determining continental origin and admixture
perform the genetic differentiations of the East Asian populations with
proportions in common populations in America, Hum. Mutat. 30 (2009) 69–78.
high distinguishing effectiveness and efficiency balance. This combina-
[11] G. He, J. Liu, M. Wang, X. Zou, T. Ming, S. Zhu, H.Y. Yeh, C. Wang, Z. Wang,
Y. Hou, Massively parallel sequencing of 165 ancestry-informative SNPs and
tion would contribute to the development of subgroup ancestry infer-
forensic biogeographical ancestry inference in three southern Chinese Sinitic/Tai-
ence system in East Asia and further enhance the recognition ability of
Kadai populations, forensic science international, Genetics 52 (2021) 102475.
internal differentiations of the East Asian populations.
[12] S. Qu, J. Zhu, Y. Wang, L. Yin, M. Lv, L. Wang, H. Jian, Y. Tan, R. Zhang, Y. Liu,
F. Li, S. Huang, W. Liang, L. Zhang, Establishing a second-tier panel of 18 ancestry
informative markers to improve ancestry distinctions among asian populations,
CRediT authorship contribution statement
Forensic Sci. Int. Genet. 41 (2019) 159–167.
[13] X.Y. Jin, Y.Y. Wei, Q. Lan, W. Cui, C. Chen, Y.X. Guo, Y.T. Fang, B.F. Zhu, A set of
novel SNP loci for differentiating continental populations and three Chinese
Bofeng Zhu: Writing – review & editing, Writing – original draft.
populations, PeerJ 7 (2019) e6508.
Meiming Cai: Writing – review & editing, Writing – original draft,
[14] C.X. Li, A.J. Pakstis, L. Jiang, Y.L. Wei, Q.F. Sun, H. Wu, O. Bulbul, P. Wang, L.
Visualization. Fanzhang Lei: Writing
L. Kang, J.R. Kidd, K.K. Kidd, A panel of 74 AISNPs: improved ancestry inference
– review & editing. Xiaolian Wu:
within Eastern Asia, Forensic Sci. Int. Genet. 23 (2016) 101–110.
Writing – review & editing. Chen Mao: Writing – review & editing.
[15] O. Bulbul, W.C. Speed, C. Gurkan, U. Soundararajan, H. Rajeevan, A.J. Pakstis, K.
Meisen Shi: Writing – review & editing. Man Chen: Writing – review &
K. Kidd, Improving ancestry distinctions among Southwest asian populations,
editing. Qiong Lan: Writing
Forensic Sci. Int. Genet. 35 (2018) 14 – review –20. & editing.
[16] H.L. Hwa, C.P. Lin, T.Y. Huang, P.H. Kuo, W.H. Hsieh, C.Y. Lin, H.I. Yin, L.
H. Tseng, J.C. Lee, A panel of 130 autosomal single-nucleotide polymorphisms for
Declaration of Competing Interest
ancestry assignment in five asian populations and in caucasians, Forensic Sci. Med. Pathol. 13 (2017) 177–187.
The authors declare that they have no known competing financial
[17] C.M. Nievergelt, A.X. Maihofer, T. Shekhtman, O. Libiger, X. Wang, K.K. Kidd, J.
R. Kidd, Inference of human continental origin and admixture proportions using a
interests or personal relationships that could have appeared to influence
highly discriminative ancestry informative 41-SNP panel, Invest. Genet. 4 (2013)
the work reported in this paper. 13.
[18] C. Phillips, A. Freire Aradas, A.K. Kriegel, M. Fondevila, O. Bulbul, C. Santos,
F. Serrulla Rech, M.D. Perez Carceles, ´A. Carracedo, P.M. Schneider, M.V. Lareu, Acknowledgements
Eurasiaplex: a forensic SNP assay for differentiating European and South Asian
ancestries, Forensic Sci. Int. Genet. 7 (2013) 359–366.
This work was supported by grants from the Qian duansheng
[19] Y. Yang, An evaluation of statistical approaches to text categorization, Inf. Retr. 1 (1999) 69–90.
Distinguished Scholars Program of China University of the Political
[20] E. Alladio, B. Poggiali, G. Cosenza, E. Pilli, Multivariate statistical approach and
Science and Law (No: 01140065140); Cross disciplinary construction
machine learning for the evaluation of biogeographical ancestry inference in the
project of evidence investigation (No: 10322308); National Key R
forensic field, Sci. Rep. 12 (2022) 8974. &D
[21] X.Y. Jin, W. Cui, C. Chen, Y.X. Guo, Y.W. Tao, Q. Lan, T.T. Kong, B.F. Zhu,
Program of China (2022YFC3302004, 2022YFC3302004–1).
Biogeographic origin prediction of three continental populations through 42
ancestry informative SNPs, Electrophoresis 41 (2020) 235–245. 9
Document Outline

Systematic analyses of AISNPs screening and classification algorithms based on genome-wide data for forensic biogeographic ...
- 1 Introduction
- 2 Materials and methods
  - 2.1 Sample set sources
  - 2.2 Preliminary selection of AISNP loci
  - 2.3 Dimensionality reduction analysis and visualization
  - 2.4 Feature selection
  - 2.5 Modeling biogeographical ancestry inference, model testing and efficacy evaluation
- 3 Results
  - 3.1 Dimensionality reduction analyses for 1750 candidate AISNP loci
  - 3.2 Embedded feature selection based on random forest algorithm via multi-classification and one-vs-rest classification str ...
  - 3.3 Machine learning model construction, testing and evaluation
  - 3.4 ADMIXTURE analysis based on selected 58 AISNP loci
- 4 Discussion
- CRediT authorship contribution statement
- Declaration of Competing Interest
- Acknowledgements
- Appendix A Supporting information
- References

BGA Systematic analyses of AISNPs - môn Nhập môn học máy và khai phá dữ liệu | Đại học Bách Khoa Hà Nội

Tài liệu liên quan:

Báo cáo kỹ thuật Đề tài: dự đoán nguồn gốc tổ tiên địa lý-sinh học (bga) sử dụng dữ liệu dna - Học phần Nhập môn học máy và khai phá dữ liệu | Đại học Bách Khoa Hà Nội

Dự đoán nguồn gốc tổ tiên Địa lý-Sinh học (bga) sử dụng dữ liệu dna môn Nhập môn học máy và khai phá dữ liệu | Đại học Bách Khoa Hà Nội

California eda visualization - Bài tập môn Nhập môn học máy và khai phá dữ liệu | Đại học Bách Khoa Hà Nội

Tài liệu hướng dẫn thực hiện đồ án môn Khai phá dữ liệu | Đại học Bách Khoa Hà Nội