Show simple item record

dc.contributor.authorNdaba, Simisani
dc.date.accessioned2022-04-19T09:08:40Z
dc.date.available2022-04-19T09:08:40Z
dc.date.issued2019
dc.identifier.urihttp://hdl.handle.net/10311/2388
dc.descriptionA dissertation submitted to the Dept. of Computer Science, Faculty of Science, University of Botswana in partial fulfillment of the requirement of the degree of Masters in Computer Information Systems. Citation: Ndaba, S. (2019) An empirical evaluation of writing style features in cross-topic and cross-genre documents in authorship identification, University of Botswana.en_US
dc.description.abstractThis dissertation describes an evaluation of writing style features for cross-topic and cross-genre documents in Authorship Identification. The study sets out to investigate this by extracting writing style features from related works and evaluates which writing style features work best for cross-topic and cross-genre documents by using an ablation process. The ablation process demonstrates that writing style features increase or decrease performance with their removal from or addition to a classification model. This study uses the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology as it provides a structured approach. The classification techniques used include Naïve Bayes, Support Vector Machine and Random Forest, which were chosen because evidence from previous studies suggest that they generally perform well in a variety of tasks. The study first investigates whether the writing style features used in successful related works that had single-topic and single-genre documents can be used effectively on cross-genre and cross-topic documents for Authorship Identification. The evaluation results showed that the writing style features used in single-topic and single-genre Authorship Identification can be used in cross-genre and cross-topic Authorship Identification because they performed reasonably well when used in the classification model. In addition, the study investigated which type of writing style features work ideally for cross-genre and cross-topic in Authorship Identification. The Syntactical writing style features that were identified as being ideal were; Parts of Speech Tag (POST) unigram, bigram, trigram and quad-gram and Punctuation Bigram. This shows that word-based adjectives have a positive contribution in Authorship Identification performance. Furthermore, the study continued to find out which writing style features can be combined to work best on cross-genre and cross-topic documents in Authorship Identification. It was found that the best combination of feature set that showed to be used in cross-genre and cross-topic documents for Authorship Identification with high results was the Lexical, Syntactical, Structural and Content feature combination set. This shows that a combination of adjectives (Content), layout (Structural) and character-word collocations (Lexical, Syntactical) features attributes to a successful cross-genre and cross-topic document Authorship Identification. Finally, the study also set out to find out whether the results from this study generalise across the three different family of classifiers. The results generally showed that regardless of the classifier used, most of the highest results were generated from Syntactical set, then secondly Lexical, then Content followed by Structural set. This generalisation is the same as the initial evaluation and after the ablation process. When the feature set are combined, the Syntactical and Lexical feature set generated the highest results. The combination of features that had mostly Content features performed moderately, and the combination features that had mostly Structural sets had the lowest results across the classifiers. The study achieved its highest result score of 0.837 from the Lexical, Syntactical, Structural and Content feature set.en_US
dc.language.isoenen_US
dc.publisherUniversity of Botswana, www.ub.bwen_US
dc.subjectEvaluation of writing styleen_US
dc.subjectcross-topic and cross-genreen_US
dc.subjectauthorship Identificationen_US
dc.subjectwriting style featuresen_US
dc.subjectdata Miningen_US
dc.subjectcross industry standard processen_US
dc.titleAn empirical evaluation of writing style features in cross-topic and cross-genre documents in authorship identificationen_US
dc.typeMasters Thesis/Dissertationen_US
dc.linkUnpublisheden_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record