Dimitra Niaouri (Ph.D. ETIS - CY Agora)
Title: Machine Learning is heading to the SUD (Socially Unacceptable Discourse) analysis: from Shallow Learning to Large Language Models to the rescue, where do we stand?
Resume:
During the rapid growth of social media, the prevalence of Socially Unacceptable Discourse (SUD) on online platforms presents a substantial challenge. SUD encompasses offensive language, controversial narratives, and distinctive grammatical features, necessitating a precise approach to characterization and detection within online discourse.
In this talk we introduce a comprehensive corpus with manually annotated texts from various online sources, facilitating a global assessment of SUD detection solutions over 12 different classes. Our investigation distinguishes between Masked Language Models (MLM) and Causal Language Models (CLM), exemplified by models such as BERT, Llama 2, Mistral, and MPT, and takes a step towards their interpretability. Traditional ML models, including Support Vector Machines (SVM), are also examined.
We will note that SUD classification is promising, but its susceptibility to class imbalance highlights the need for improved discriminant power. Our analysis emphasizes the nuanced trade-off between bidirectional contextual awareness (favoring MLMs) and sequential dependency modeling (advantageous for CLMs). We further underscore the necessity for consistent efforts in the ML community and broader implications for linguistics, discourse analysis, and semantics, advocating for formal guidelines. In fact, by focusing on interpretability, we avoid the black-box effect: we can learn about specific features of different aspects of our object, which can create fruitful scientific interactions with research on CMC corpora from corpus linguistics. Such scientific exchange is necessary for future annotation campaigns on SUD corpora or extremist narratives to ensure good interoperability between projects and works necessitating a precise approach to characterization and detection within online discourse.
Bio:
As a newly enrolled Ph.D. candidate within the ETIS and AGORA labs, Dimitra’s academic journey is deeply rooted in linguistics and natural language processing (NLP). She is driven by a profound passion for advancing the frontiers of understanding within the domain of extremist narrative detection. Throughout her research path, Dimitra is committed to delving into the forefront of machine learning and deep learning methodologies, with the aim of unveiling intricate narratives in the digital sphere. Her primary objective is to develop innovative tools that can effectively delineate extremist narratives within diverse corpora, spanning various contexts such as social media and political discourse.