• Login
  • Register
  • Search

Prediction of Wordle Based on Machine Learning

Yufei Shi

Abstract


Wordle brings a lot of fun to people’s daily lives. Players are required to guess words in less than or equal to six attempts, and people can choose difficult or easy modes according to their own wills to experience the game. A series of data generated from the number of sign-ups in diff erent modes and the number of guesses of diff erent words. We use machine learning to count and process the data of this game and classify the difficulty of diff erent words to predict the situation that the players will face in guessing the words to be introduced in the future. We propose the following solutions for the questions posed[1].
For problem one, we derive a dataset to train the LSTM model by sliding window processing, and then predict the number of reported results for March 1, 2023, with a value of roughly 23321. After that we validate and normalize the model, and the computational results showed that our model has little error and strong prediction effect. For the proportion of word attributes on the number of enrollment in diff erent difficulty modes, we counted all combinations of word letters and obtained prob (F) = 0.587 by the linear regression model, which means that the significance of the model is low, indicating that enrollment is not aff ected by word attributes.
For problem two, to predict the distribution of the given word results for one day in the future, considering the circumstance of multiple inputs and multiple outputs, we adopt a regression chain model. Then we train a random forest regression algorithm based on the model, and divide the samples into training and test sets. Finally we derive percentage data for seven attempts: {0.2,3.5,18.3, 31.3,27.5,15.8,2.9}, whose MAPE are within acceptable limits. We then construct a mapping set on the attributes of the given word EERIE and derive the predicted data for the word. What’s more, we compare the result with that obtained from the data processed by the neural network algorithm and fi nd that the model used in the former is better[2].
For problem three, we divide the difficulty into three levels by RSR method, and export the data after the evaluation process. Then we train the data set by three machine learning algorithms, namely, logistic regression, decision tree and XGBOOST, and draw the corresponding learning curves. There are underfi tting and overfi tting phenomena, and the logistic regression model with the best effect among the three still failed to show a better fi t in the test set, with an F-score of 0.5. So we continue to use CNN for its classifi cation prediction, and the fi nal F-score of both training and test sets is about 0.8, which we think is a good effect. Finally, we analyze the difficulty of EERIE by this model, and the difficulty factor we get is 1, which means it’s easy.
For problem four, we present the data in graphical form and analyze its relevant features through correlation and descriptive analysis methods[3].

Keywords


Prediction of attempt numbers; Machine learning; significance test; difficulty coeffi cient

Full Text:

PDF

Included Database


References


[1] Haihong Fan Application of SVM classification algorithm based on convolution neural network in image classification [J]. Science and Technology Bulletin, 2022,38 (08): 24- 28.DOI: 10. 13774/j.cnki.kjtb.2022.08.005.

[2] Xiaotong Hu, Chen Cheng. Time series prediction based on multi-dimensional and cross- scale LSTM model [J]. Computer Engineering and Design, 2023,44 (02): 440-446. DOI: 10. 16208/j.issn1000-7024.2023.02.017.

[3] Shishi Dong, Zhexue Huang. Analysis of random forest theory [J]. Integrated Technology, 2013 (1): 1-7.

[4] Lei Liu. Research on Classification of breast cancer Diagnostic Data Based on Logical Regression Algorithm [J]. Software Engineering, 2018,21 (02): 21-23+17. DOI: 10. 19644/j.cnki.issn2096- 1472.2018.02.007.

[5] Xun Wang, Jia Qiao, Yanping Yu. Risk assessment of gas pipeline based on decision tree classification algorithm [J]. Gas and Heat, 2022,42 (10): 41-43+46. DOI:10. 13608/j.cnki.1000-4416.2022.01.015.

[6] Jiqing Yan, Zhiyuan Shen, Jing Lv, et al. Automatic classification of bidding documents based on XGBoost and text focus model [J]. Journal of Wuhan University (Engineering Edition), 2022,55 (03): 310-318. DOI: 10. 14188/j.1671-8844.2022-03-013.

[7] Chicco D, Warrens M J, Jurman G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation[J]. PeerJ Computer Science, 2021, 7: e623.




DOI: http://dx.doi.org/10.18686/ahe.v7i31.11564

Refbacks