Arabic/English Handwritten Digits Recognition using MLPs, CNN, RF, and CNN-RF

Day by day, machine learning and deep learning reduce the efforts needed by humans in many fields. Handwriting recognition is one such field. In Handwriting Recognition (HWR), a machine can interpret and recognize handwritten input from different sources like papers, touch screens, images, etc. by interpreting it into machine-readable formats. Arab countries often use Arabic digits in addition to English digits. In banks, business applications, etc. This article discusses four methods to recognize Arabic/English handwritten digits which are: random forest (RF), multi-layer perceptrons (MLPs), convolutional neural network (CNN), and CNN-RF. These methods were implemented with the help of the MNIST and MADBase datasets and the results appear that in comparison with the other algorithms, the highest accuracy was obtained by the Convolutional Neural Network (CNN) with a value of 99.11%


INTRODUCTION
Handwritten digit recognition has a big interest due to wide applications in different fields, such as office computerization, check verification, and data entry applications [1].Pattern recognition is a scientific system that objective to categorize objects into a number of classes or categories [2].Handwritten digit recognition is the capacity of the machine to receive and interpret the handwritten digits of plentiful sources such as images, documents, etc., and then classify them into ten classes [3].HWR may happen on computer tablets, recognition of vehicle plate numbers, processing of bank check amounts, digital entries on hand-filled forms, etc [4].Handwritten digits have faced many problems because the handwritten font is not always of the same size, width, direction, position relative to the margins, style differs from person to person, and handwriting can change from time to time for the same person and the similarity between digits [5].
As shown in Fig 1 the general structure of a classical pattern recognition system has five parts in which each sample is passed: sample acquisition, preprocessing, feature extraction, classifier design, and classification decision [6].In this world, each country has a language that is considered an official language.Arabic is one of the five most important languages in the world.Arabic digits are widely significant in countries that write in Arabic, most Arabs use Arabic digits in addition to English.There is a lot of research focusing on Latin languages, and there are fewer studies done on Arabic.[7].There are many classification algorithms available in machine learning and deep learning algorithms to recognize handwritten digits such as Naive Bayes, Linear Regression, Random Forest, Support Vector Machine (SVM), Decision Trees, K-Nearest Neighbor (KNN), K-mean clustering, neural network, Convolutional Neural Networking (CNN), etc.We cannot determine a good and effective model for recognizing handwritten digits.Each researcher may use a different model and give appropriate results [8].
In this paper, classification algorithms were performed in Random Forest (RF), Multilayer Receptors (MLPs), Convolutional Neural Networks (CNN), and RF-CNN, and made the comparison between them the algorithm that given the high accuracy is required.
The algorithms applied on the Modified National Institute of Standards and Technology (MNIST) dataset combined with the MADBase database (Arabic handwritten digits images).The MNIST dataset has 70000 handwritten digits also the Arabic dataset has70000 handwritten digits

LITERATURE REVIEW
Handwritten digit recognition is a confrontation issue in pattern recognition and computer vision.This issue has been studied for many methods and techniques like Convolutional Neural Networks (CNNs), Neural Networks (NNs), SVM, KNN, etc.And the results were obtained from different datasets with different languages [1].The authors of the paper [9] "Hand Written Digit Recognition Using Machine Learning" put a project to examine the variant of accuracies of CNN by using several numerous hidden layers and epochs and made the comparison between them within the MNIST database they were achieved an accuracy above 95% and predicted the model by CNN on their WebApp.They found that this model fits over 10 epochs and batch size equals 200 and two hidden layers of 128 neurons and 50 neurons.Sonia Flora and Anju Kakkad in the paper [10] have made a Comparison Study by using Artificial Neural Network and Convolutional Neural Network, they found that The error in CNN is less compared to the artificial neural network via the CPU also image classification is better performed in CNN but the CNN took longer time on CPU than artificial neural network and can get the optimum result in CNN over GPU.The authors in [11] made a review for different Methods of Handwriting Recognition and they noticed that the highest accuracy is in Convolutional Neural Network (CNN) but CNN takes more samples and time compared to the others and the lowest accuracy is in Slope and Slant Correction Method.Slope and slant correction is used to reduce the variation in the writing style.When the variation is slight but not the best in handwriting recognition.In [12] the authors in this paper made a comparison between three different classifiers Random Forest RF, K-Nearest Neighbors KNN, and Supervised SVM and they got the accuracy in RF 96.89% and KNN 96.67% and in SVM 97.91% and the highest accuracy in CNN with 3 hidden layers 98.72% which took maximum execution time.In the paper [13] they used three classifications to recognize the digits which are Neural Network, KNN, and SVM.The results in the MNIST dataset show that SVM and KNN have the right predictions while MLP Neural Network has some mistakes this happen because SVM and KNN predicted the extracted feature directly.While MLP is a non-linear function thus it is more appropriate, to learn non-linear mode as well, MLPs have various random weight initializations that lead to different accuracy.In Arabic digits, the authors in [14] proposed an algorithm based on deep learning neural networks which is CNN, and compared to MLPs.Both of the proposed models are trained and tested with the CMATERDB Arabic handwritten digit dataset with 1000 epochs and a batch size of 128.CNN model gives 97.4 % accuracy while MLPs give 93.8%.When properly trained on a dataset, the CNN model can do Well above MLP in classic image recognition.Amsal Pardamean, Dewy Yuliana, Sri Watmah, Sisferi Hikmawan, and Sfenrianto in the paper [15] proposed a method used LeNet-5 architecture to train and recognize the handwritten image of Arabic digits.The data set used is MADBase where they obtained an average accuracy of 97.67% by training the model using the backpropagation algorithm for 250 epochs.Rami S. Alkhawaldeh in [16] proposed a model that used CNN plus recurrent neural networks (RNN) especially long short-term memory (LSTM) to detect Arabic Handwritten Digits (AHD).The idea concludes with LSTM layers to take the features extracted from the CNN part.The model includes three parts which are the CNN model, LSTM layers, and dense fully connected layers.The LSTM layers, eliminate irrelevant information and remain the most important.Then, the last output of the LSTM layers is the most relevant information that will be Vol.28, No. 2, September 2023, pp.252-260 input to the dense fully connected layers.He achieved accuracy up to 98.92%.

Algorithms and Methods Used
Machine learning describes the ability of devices to learn from particular training data and then apply what they've learned to make informed decisions and solve associated tasks (e.g., decision tree, SVM, KNN, RF algorithm).Deep learning DL is a concept of machine learning related to artificial neural networks (e.g., CNN, RNN) [17].ANN is an intelligent technique that can be utilized in data processing, recognition, and others [18].In this research a famous deep learning algorithm CNN is used and compared to the (multi-layer perceptrons (MLPs), Random Forest RF, and CNN-RF) to show the effect of feature extraction before the classifier stage.

Multi-Layer Perceptrons (MLPs)
MLPs are a type of feedforward network which contains three types of layers that are connected including an input layer, one or more hidden layers, and an output layer.as shown in Fig 2 .Every layer consists of several nodes called neurons [19].Neurons or "processing units" take the inputs and calculate a weighted sum of these inputs and the threshold by applying the activation function [20].An activation function is a function applied to calculate the sum of the input weights and biases, to determine whether or not a neuron can fire.If the node is fired, the data is sent to the following layer of the network.else, no data is exceeds the next layer of the network, it can be either linear or nonlinear depending on the function.It is often referred to as the transfer function [21].

CONVOLUTIONAL NEURAL NETWORKS(CNNs)
Convolutional Neural Network is often called ConvNet [22] it is one of the most popular algorithms used in the DL field due to its superior performance in many computer vision and machine learning problems.CNNs are similar to feed-forward networks, but they are typically used exclusively for image-related tasks for instance image classification, speech recognition, facial expression recognition, vehicle recognition, face and object detection, and many others, pattern recognition [23].The input to the CNN goes sequentially through a series of processing.Each step of this processing is called a CNN layer and it includes three layers which are: the input layer, multiple hidden layers, and the output layer, the hidden layers include convolutional layers, pooling layers, fully connected (FC) layers, Batch normalization, flattening layers, etc [24].In CNN, the classification stage is combined with feature extraction [25].Convolution is a specialized type of linear operation used to extract features.CNN extracts the input features by implementing a kernel (filter) which is an array of weights and has an activation function to introduce nonlinearity into the system.Recently, the Rectified linear unit (ReLU) has been used more than other activation functions such as tanh and sigmoid [26,27].The pooling layer is goes after a convolution layer [23].The pooling layer will decrease the feature dimensions.It reduces the number of trainable parameters which means allowing the specified values to be passed to the next layer while excluding unnecessary ones.Max Pooling, one of the most common pooling techniques is reducing the amount of the parameters by selecting the maximum values [28].The pooling function generates another output vector.Batch Normalization is a technique that aims to improve training neural networks which solves features distribution that differ across training and test data, by stabilizing layer distributions.It is used to help in two ways; learning faster with higher overall accuracy [29].The fully connected layer is situated at the terminus of the CNN.Within this layer, each neuron becomes connected to all neurons in the previous layer.So it is called Full Connection (FC) [22].The fully connected layer works as an artificial neural network and the number of outputs depends on the number of classes [4].CNN has been modified in several ways for handwriting recognition systems.These modifications include increasing or decreasing the number of neurons in the hidden layer to match the target efficiency.Also, the numeral of hidden layers such as pooling and convolution layers.Some research increases the complexity of the system to achieve the best performance [30].RF is one of the Top-performing learning algorithms due to its simpliity and usability for both classification and regression tasks [31].RF is an ensemble learning algorithm that classifies by utilization of a voting model as is shown in Fig 4 [32].Prediction is made by gathering (majority vote for classification).It is used in various fields, such as banking, stock market, medicine, and ecommerce.The construction of RF is based on a set of decision trees, made from a random selection of samples from training data [33].The output categories are determined by the mode of a decision tree [34].The reason for its name is the suggestion of extra randomness during tree creation.When splitting a node, in place of searching for the most important feature, it searches for the best feature on a subset of random features This mode makes more trees [35].

MNIST Dataset
MNIST is a major dataset that is employed for the issue of handwritten digit classification.It is quite authenticated for the researchers and the learners [36] created by the National Institute of Standards and Technology (NIST).It comprises 70,000 images, with dimensions 28 x 28 of handwritten digits from zero to nine.60,000 images for training and the rest for the test.The training set consists of handwritten digits collected from 250 people, of whom fifty percent were employees of the Census Bureau and the rest were students of high school.The images are shown as a matrix of 28×28 grayscale pixel value.This means each image is a vector of 784 dimensional taken as the input [25].

MADBase Dataset
MADBase is an adjusted version of the ADBase.MADBase is based on the MNIST dataset.Both MADBase and ADBase databases consist of 70000 Arabic digits written by 700 persons each person writes each digit (0to9) ten times.The database was collected from different academies such as governmental, high school and students of Engineering and Medicine, which means different ages to get different writing styles.The size of each image is 28×28 pixels, the training set is 60,000 digits, and the test set 10,000 digits [37].

IMPLEMENTATION
The models have been implemented in hardware platform Intel cores i5-4310U CPU 2.60GHz.with 8 Gb RAM, the comparison was made between four methods (Random Forest, Neural Network, Convolutional Neural Network (CNN), CNN-RF) depending on the accuracy of each method and the runtime to obtain the preferable accuracy among them via training and testing our data set.by usage the Scikit-Learn package in the Python programming language [35].

Pre-Processing of Images
The first step is to prepare the data to make it fit the model as shown in Fig 5 .The tensor flow in Python already contains a mnist dataset that can be loaded with Keras and MADBase is available on the internet All datasets are CSV files representing the image pixel values and their corresponding label.CNNs require a 4D array as the input, so it will reshape the input images into a 4D tensor as follows (sample numbers, 28, 28, 1) with grayscale images of 28 x 28 pixels.The dataset used is a merge of MADBase and MNIST datasets.The dataset will be split into (70% for training and 30% for testing).The normalization procedure is applied to make the images into the range [0, 1]and the One Hot Encoding is to convert all Label values into categorical forms.Some samples from the merged dataset are shown in Fig 6.

Feature Extraction
the Selecting and extracting features from the image got highly interesting by the researcher [38] The operation of the feature extraction in (RF and MLPS) is incapable of creating discriminative features of the raw data therefore, the feature engineering for these methods is usually done manually, which requires a lot of prior experience [6].CNNs can spontaneously abstract the features of the images and classify them efficaciously, which means that it does not require manual feature extraction.Feature extraction is a not only cumbersome task but also plays a major part in the outcomes [32].

Implementation of the Considered Methods 4.3.1.
Multilayer perceptron: MLP was executed utilizing the scikit-learn package, a flatten Layer is used in the input to generate a vector from the input images so that the number of neurons is an array of 784 dimensions transformed from a 28 x 28 image.The hidden layer consists of 256 neurons with the ReLU activation function.A dropout layer was added to avoid over-fitting.The term "dropout" refers to dropping units in a neural network whether it is a hidden or visible layer by temporarily removing them from the network [39].The softmax classifier is used in the output layer with 20 neurons which is the number of classes.A softmax classifier is a kind of activation function, which is a linear classifier used in DL to classify linear feature variables, softmax determines the probability of each class and gives the sum of the probability of all vectors.So the target group is the one with the highest probability [21,40].

Convolutional Neural Network (CNN or
ConvNet): A simple convolutional neural network as shown in Fig 7 .The first hidden layer is a convolutional layer with activation function ReLu (Rectified Linear Unit).The layer has 64 feature maps, with a size of 2×2, this is the input layer that takes images with a shape of 28 x 28 x1, and the next hidden layer is the max-pooling layer.It is used to reduce the features to reduce overfitting.Following that a dropout layer of 25% was used for preventing overfitting.The second layer is convolutional with ReLu which is used to capture more features from the image followed by the second max-pooling layer and dropout.The last layer is the output layer with 20 neurons (number of output classes) and it uses softmax for multiclass classification.Random Forest: From the sklearn library the RandomForestClassifier is used to implement the algorithm.Random forest is a meta-estimator that fits several decision tree classifiers into different subsamples of a data set and uses the average to get better predictive accuracy and dominant over-fitting [41].RF Classifier only needs some parameters to be adjusted.The value of n_estimators = 100 represents the number of decision trees that will be used in a forest with the same data set, where 28 x 28 grayscale images are imported as a vector of length 784 with values between 0 and 255, and outputs determined by the majority of decision trees.

CNN-RF:
In this method, CNN with RF (CNN-RF) is used CNN is spontaneously catching features of the input images and these features will be entered into the RF classifier to recognize the digits as shown in Fig 6 .The CNN-RF model contains the same layers that built the CNN model except for the last two layers, the features are caught by the convolutional, max pooling, and fully connected layers that will be dimensioned (1 x 256) and this means each image has 256 features feeding with the labels to RF classifier.s

RESULTS AND DISCUSSIONS
The overall handwritten digits are 140000.In the training stage, 98000 handwritten digits (70%) are used as training samples, and 42000 handwritten digits (30%) are used as test samples to check the results.Both MLPs and CNNs were trained for 15 epochs and batch size 128.Adam optimizer was used as an optimization function and for calculating the loss the categorical cross-entropy is used.After executing all the models, CNN has the highest accuracy but the only drawback is that it takes a lot of time.As shown in Table 1.The advantage of CNNs over MLPs is the ability to extract features along with the good performance of classification [32].Accuracy in CNN is higher than MLP because CNN uses convolution in its layer for features extraction in more powerful methods [14] and this is shown in the confusion matrix in Fig 8,9.where the diagonal elements represent the correct classification and the other elements represent that were misclassified by the model.Thus, the data is fed into the softmax classifier in CNNs is a flattened feature map while in MLPs it is the input image vector.

Fig. 8 Confusion Matrix of CNN model
RF is characterized by low complexity, fast computing speed, power for a large number of data, etc. [32].RF classifier has higher accuracy than CNN-RF although the execution time is close between them and the reason is that the inputs of the RF classifier in CNN-RF are the extracted features, so the dimensions are lower and the RF is more suitable for classifying the high dimensional data [42].Random forests are suitable for the classification of large high dimensional data because it ensemble model uses decision tree induction to build the component classifiers This means that the CNN-RF method demands much of the data to train the model, [32].

CONCLUSION
In this paper, the recognition of handwritten digits was done by four methods in the same dataset.The given results appear that CNN is the best classifier for recognizing Arabic/English handwritten digits.With an accuracy of 99.11%.but it is more timeconsuming compared to other models, also observes that RF has fast computing speed, but it requires large and High dimensional data to achieve excellent accuracy.

Future Work
Firstly, we could use more pre-processing steps, to provide more training that is robust and resilient.Also, we could add a segmentation stage to deal with a sequence of data, in addition, we could propose a method to distinguish Arabic from English digits in the state of the mixed numbers.With regard to the model it can employ Transfer Learning (TL) which allows utilizing a pre-trained model, to train on the digits dataset, like VGG, ResNet, AlexNet, and MobileNet model.

ACKNOWLEDGEMENTS
The authors appreciate the Department of Computer Engineering at the University of Mosul's assistance in improving this paper's caliber.

Fig. 3
Fig. 3 Elementary constituents of CNN [23] 3.1.3.Random Forest (RF)RF is one of the Top-performing learning algorithms due to its simpliity and usability for both classification and regression tasks[31].RF is an ensemble learning algorithm that classifies by utilization of a voting model as is shown in Fig 4[32].Prediction is made by gathering (majority vote for classification).It is used in various fields, such as banking, stock market, medicine, and ecommerce.The construction of RF is based on a set of decision trees, made from a random selection of samples from training data[33].The output categories are determined by the mode of a decision tree[34].The reason for its name is the suggestion of extra randomness during tree creation.When splitting a node, in place of searching for the most important feature, it searches for the best feature on a subset of random features This mode makes more trees[35].

Fig. 7
Fig. 7 Convolutional neural network and (CNN-RF) models 4.3.3.Random Forest: From the sklearn library the RandomForestClassifier is used to implement the algorithm.Random forest is a meta-estimator that fits several decision tree classifiers into different subsamples of a data set and uses the average to get better predictive accuracy and dominant over-fitting[41].RF Classifier only needs some parameters to be adjusted.The value of n_estimators = 100 represents the number of decision trees that will be used in a forest with the same data set, where 28 x 28 grayscale images are imported as a vector of length 784 with values between 0 and 255, and outputs determined by the majority of decision trees.

Fig. 9
Fig. 9 Confusion Matrix of MLP model

Table 1 :
Performance comparison of methods