An empirical evaluation of writing style features in cross-topic and cross-genre documents in authorship identification
Date
2019Author
Ndaba, Simisani
Publisher
University of Botswana, www.ub.bwLink
UnpublishedType
Masters Thesis/DissertationMetadata
Show full item recordAbstract
This dissertation describes an evaluation of writing style features for cross-topic and cross-genre documents in Authorship Identification. The study sets out to investigate this by extracting writing style features from related works and evaluates which writing style features work best for cross-topic and cross-genre documents by using an ablation process. The ablation process demonstrates that writing style features increase or decrease performance with their removal from or addition to a classification model. This study uses the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology as it provides a structured approach. The classification techniques used include Naïve Bayes, Support Vector Machine and Random Forest, which were chosen because evidence from previous studies suggest that they generally perform well in a variety of tasks.
The study first investigates whether the writing style features used in successful related works that had single-topic and single-genre documents can be used effectively on cross-genre and cross-topic documents for Authorship Identification. The evaluation results showed that the writing style features used in single-topic and single-genre Authorship Identification can be used in cross-genre and cross-topic Authorship Identification because they performed reasonably well when used in the classification model. In addition, the study investigated which type of writing style features work ideally for cross-genre and cross-topic in Authorship Identification. The Syntactical writing style features that were identified as being ideal were; Parts of Speech Tag (POST) unigram, bigram, trigram and quad-gram and Punctuation Bigram. This shows that word-based adjectives have a positive contribution in Authorship Identification performance.
Furthermore, the study continued to find out which writing style features can be combined to work best on cross-genre and cross-topic documents in Authorship Identification. It was found that the best combination of feature set that showed to be used in cross-genre and cross-topic documents for Authorship Identification with high results was the Lexical, Syntactical, Structural and Content feature combination set. This shows that a combination of adjectives (Content), layout (Structural) and character-word collocations (Lexical, Syntactical) features attributes to a successful cross-genre and cross-topic document Authorship Identification. Finally, the study also set out to find out whether the results from this study generalise across the three different family of classifiers. The results generally showed that regardless of the classifier used, most of the highest results were generated from Syntactical set, then secondly Lexical, then Content followed by Structural set. This generalisation is the same as the initial evaluation and after the ablation process. When the feature set are combined, the Syntactical and Lexical feature set generated the highest results. The combination of features that had mostly Content features performed moderately, and the combination features that had mostly Structural sets had the lowest results across the classifiers. The study achieved its highest result score of 0.837 from the Lexical, Syntactical, Structural and Content feature set.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Adherence to anti-diabetic drugs among patients with Type 2 diabetes mellitus at Muhimbili National Hospital, Dar es Salaam, Tanzania- A cross-sectional study
Rwegerera, Godfrey (Pan African Medical Journal in partnership with AFNET,www.panafrican-med-journal.com, 2014-04-07)Introduction: Adherence to diabetes mellitus treatment regimens among Type 2 diabetes patients in Tanzania has not been well documented. This study sought to assess adherence to antidiabetic drugs and associated factors ... -
Influence of weak and strong donor groups on the fluorescence parameters and the intersystem crossing rate constant
Nijegorodov, N.; Mabbs, R.; Winkoun, D.P. (Elsevier Science Ltd. www.elsevier.com/locate/saa, 2003)Please refer to the attached article for an ABSTRACT. The abstract was not uploaded here due to formula appearance problem in UBRISA. -
Photon interaction cross sections in the low energy region in Mg and V
Murty, V.R.K.; Devan, K.R.S. (Elsevier Science Ltd. www.elsevier.com/locate/radphyschem, 2004)The interaction of photons with matter has been extensively studied over the past several decades in view of its importance in basic radiation physics research, medical, industrial and other applied fields. Large amounts ...