An Analysis on Gender Pronouns Using SparkNLP

For a small project, I analyzed whether there was a statistically significant difference in the frequency of feminine versus masculine personal and possessive pronouns between fiction and nonfiction texts using classic literature in ASCII format from the 1980s. By processing the data with SparkNLP—utilizing tools for tokenization, part-of-speech tagging, dependency parsing, and named entity recognition—we counted pronoun occurrences and performed a chi-square test. The results showed that fiction texts contained a significantly higher frequency of masculine personal pronouns compared to nonfiction texts, reflecting the character-driven nature of fiction and potential gender bias, while nonfiction generally maintains a neutral, objective style. You can view my code in Colab here and the presentation slide deck here.

This project honed my skills in data processing and analysis. By using SparkNLP for text processing and statistical methods for hypothesis testing, I gained hands-on experience in handling natural language data and extracting meaningful insights. The project improved my proficiency in key areas like tokenization, part-of-speech tagging, and statistical analysis, and reinforced my ability to draw actionable conclusions from complex datasets.

Previous
Previous

Weather & Traffic

Next
Next

Kickstarter Data Exploration