COVID-19 Literature Clustering
🎯 BRIEF
This project utilizes a comprehensive approach to organize and visualize COVID-19 literature. By employing k-means clustering, t-SNE dimensionality reduction, and Latent Dirichlet Allocation (LDA) for topic modeling, the dataset's dimensionality is reduced, and thematic clusters are identified. K-means and t-SNE independently reveal relationships between papers, while LDA enhances the understanding of each cluster by identifying prevalent keywords. The evaluation involves plot examination, paper review, and classification model testing. Despite limitations in manual inspection, the tool proves effective for health professionals in efficiently accessing related publications. The project suggests future improvements, envisioning a user-friendly interface based on these analytical techniques.
🔧 TOOLS
Python, pandas, re, scikit-learn, Matplotlib, seaborn, spaCy, Bokeh
🤝 CONTRIBUTION
Created an interactive tool using Natural Language Processing (NLP) and clustering to simplify the COVID-19 literature search for researchers.
Applied TF-IDF, t-SNE, and PCA to generate feature vectors, embed them in a 2D plane, and remove noise.
Used K-means and LDA algorithms to label clusters and discover keywords and enhanced usability with Bokeh for filtering, searching, and accessing publications via keywords.
🏆 ACHIEVEMENTS
Revealed hidden connections in literature using unsupervised learning.
Implemented a portable, downloadable HTML tool, ensuring easy access, offline usage, and failover safety.
📷 SCREENSHOTS
Final Clustering Output