UBRISA

View Item 
  •   Ubrisa Home
  • Theses and Dissertations
  • Faculty of Science Theses and Dissertations
  • Masters Dissertations
  • View Item
  •   Ubrisa Home
  • Theses and Dissertations
  • Faculty of Science Theses and Dissertations
  • Masters Dissertations
  • View Item
    • Login
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    An empirical evaluation of writing style features in cross-topic and cross-genre documents in authorship identification

    Thumbnail
    View/Open
    Ndaba_Unpublished (MSc)_2019.pdf (1.825Mb)
    Date
    2019
    Author
    Ndaba, Simisani
    Publisher
    University of Botswana, www.ub.bw
    Link
    Unpublished
    Type
    Masters Thesis/Dissertation
    Metadata
    Show full item record
    Abstract
    This dissertation describes an evaluation of writing style features for cross-topic and cross-genre documents in Authorship Identification. The study sets out to investigate this by extracting writing style features from related works and evaluates which writing style features work best for cross-topic and cross-genre documents by using an ablation process. The ablation process demonstrates that writing style features increase or decrease performance with their removal from or addition to a classification model. This study uses the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology as it provides a structured approach. The classification techniques used include Naïve Bayes, Support Vector Machine and Random Forest, which were chosen because evidence from previous studies suggest that they generally perform well in a variety of tasks. The study first investigates whether the writing style features used in successful related works that had single-topic and single-genre documents can be used effectively on cross-genre and cross-topic documents for Authorship Identification. The evaluation results showed that the writing style features used in single-topic and single-genre Authorship Identification can be used in cross-genre and cross-topic Authorship Identification because they performed reasonably well when used in the classification model. In addition, the study investigated which type of writing style features work ideally for cross-genre and cross-topic in Authorship Identification. The Syntactical writing style features that were identified as being ideal were; Parts of Speech Tag (POST) unigram, bigram, trigram and quad-gram and Punctuation Bigram. This shows that word-based adjectives have a positive contribution in Authorship Identification performance. Furthermore, the study continued to find out which writing style features can be combined to work best on cross-genre and cross-topic documents in Authorship Identification. It was found that the best combination of feature set that showed to be used in cross-genre and cross-topic documents for Authorship Identification with high results was the Lexical, Syntactical, Structural and Content feature combination set. This shows that a combination of adjectives (Content), layout (Structural) and character-word collocations (Lexical, Syntactical) features attributes to a successful cross-genre and cross-topic document Authorship Identification. Finally, the study also set out to find out whether the results from this study generalise across the three different family of classifiers. The results generally showed that regardless of the classifier used, most of the highest results were generated from Syntactical set, then secondly Lexical, then Content followed by Structural set. This generalisation is the same as the initial evaluation and after the ablation process. When the feature set are combined, the Syntactical and Lexical feature set generated the highest results. The combination of features that had mostly Content features performed moderately, and the combination features that had mostly Structural sets had the lowest results across the classifiers. The study achieved its highest result score of 0.837 from the Lexical, Syntactical, Structural and Content feature set.
    URI
    http://hdl.handle.net/10311/2388
    Collections
    • Masters Dissertations [34]

    Related items

    Showing items related by title, author, creator and subject.

    • Adherence to anti-diabetic drugs among patients with Type 2 diabetes mellitus at Muhimbili National Hospital, Dar es Salaam, Tanzania- A cross-sectional study 

      Rwegerera, Godfrey (Pan African Medical Journal in partnership with AFNET,www.panafrican-med-journal.com, 2014-04-07)
      Introduction: Adherence to diabetes mellitus treatment regimens among Type 2 diabetes patients in Tanzania has not been well documented. This study sought to assess adherence to antidiabetic drugs and associated factors ...
    • Influence of weak and strong donor groups on the fluorescence parameters and the intersystem crossing rate constant 

      Nijegorodov, N.; Mabbs, R.; Winkoun, D.P. (Elsevier Science Ltd. www.elsevier.com/locate/saa, 2003)
      Please refer to the attached article for an ABSTRACT. The abstract was not uploaded here due to formula appearance problem in UBRISA.
    • Photon interaction cross sections in the low energy region in Mg and V 

      Murty, V.R.K.; Devan, K.R.S. (Elsevier Science Ltd. www.elsevier.com/locate/radphyschem, 2004)
      The interaction of photons with matter has been extensively studied over the past several decades in view of its importance in basic radiation physics research, medical, industrial and other applied fields. Large amounts ...

    DSpace software copyright © 2002-2015  DuraSpace
    Contact Us | Send Feedback
    Theme by 
    @mire NV
     

     

    Browse

    All of UBRISA > Communities & Collections > By Issue Date > Authors > Titles > SubjectsThis Collection > By Issue Date > Authors > Titles > Subjects

    My Account

    > Login > Register

    Statistics

    > Most Popular Items > Statistics by Country > Most Popular Authors