AN INVESTIGATION OF THE IMPACT OF DIFFERENT DATA CLEANING TECHNIQUES ON METRIC RESULT QUALITY IN MACHINE LEARNING

thumbnail.default.placeholder
Date
2022-06-14
Authors
ABBAS, Israa Mustafa
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Enormous growth of data due to e-commerce platforms and online applications has posed a big challenge for data analysis and processing. It is now a frequent practice for e-commerce web sites to enable their customers to write reviews of products that they have purchased. Such reviews provide valuable sources of information on these products. A product review has important data source for sentimental analysis is used in all online product firms. This huge volume of data influence leads to a great challenge. These datasets, however, contain different data’s issues. Typically, different data mining technique used in before deploying data in many cases. Spatially, in supervised machine learning models trained on historical and labelled data to predict unseen data, data that a model has never learned before. In this thesis, we focused on design of experiment study in machine learning too [1]. We applied Ronald Fisher theories [2] regularly to find cause- effect relationship .For carry out this design of experimental study, we chose supervised machine learning classification algorithms with sentimental analysis, it is an approach to natural language processing (NLP).This is a popular way for organizations to determine and categorize opinions about a product, service .It involves the use of data mining, machine learning and artificial intelligence to mine text for sentiment and subjective information [3].This study established with Multinominal Naïve Bays ,Random Forest and Logistic Regression to analysis impact of five experimental groups (duplicate data ,punctuation mark ,stop words, limmatezr, TF-IDF transform ) and compare with one control group (no data cleaning applied). To determine the impact experimental group on three models’ efficiency and classification ratio and explain the interesting observations. A simulation done on 353 projects chosen randomly from Amazon product review dataset from twenty-four different categories . Thus, Dataset was collected from Amazon.com by McAuley and Leskovec [4][5]. After collecting metric dataset, SPSS software used for analyzing. A repeated-measure ANOVA was performed to examine this research question and the descriptive statistics of metric used. Analysis result shows there are different impact for data cleansing on machine learning models performance . data cleaning in same cases impacted positively on Random Forest and negatively in Multinominal Naive Bays and Logistic Regression. In other cases, had no impact at all. In overall, experimental result showed Random Forest classifier more sensitive on data cleaning than Multinominal Naïve Bayes classifier and Logistic Regression classifier ,both algorithms get high classification score in un-cleaned data set. Moreover, the experiment results showed data issues behavior differ in machine learning model. We cannot consider data quality issues as irrelevant data in all machine learning algorithm. Analysis result will be explained in detail on result and discussion chapter 4 and 5.
Description
MAKİNE ÖĞRENMESİNDE, FARKLI VERİ TEMİZLEME TEKNİKLERLERİNİN SONUÇ ÖLÇEVLERİ ÜZERİNDEKİ ETKİSİNİN İNCELENMESİ
Keywords
computer engineering
Citation