COVID-19 Literature Clustering


This project utilizes a comprehensive approach to organize and visualize COVID-19 literature. By employing k-means clustering, t-SNE dimensionality reduction, and Latent Dirichlet Allocation (LDA) for topic modeling, the dataset's dimensionality is reduced, and thematic clusters are identified. K-means and t-SNE independently reveal relationships between papers, while LDA enhances the understanding of each cluster by identifying prevalent keywords. The evaluation involves plot examination, paper review, and classification model testing. Despite limitations in manual inspection, the tool proves effective for health professionals in efficiently accessing related publications. The project suggests future improvements, envisioning a user-friendly interface based on these analytical techniques.

App → GitHub → 



Python, pandas, re, scikit-learn, Matplotlib, seaborn, spaCy, Bokeh




Final Clustering Output