The purpose of this project is to design and implement an automated, web-based framework that evaluates the quality of public GitHub repositories from a single URL input, producing a unified Repository Health Score without requiring any local installation, authentication, or repository cloning. Despite the presence of over 420 million public repositories on GitHub, no lightweight tool exists that combines API-driven metadata analysis with Natural Language Processing (NLP)-based documentation evaluation to provide a holistic, transparent quality score. The system integrates the GitHub REST API for real-time data retrieval, a Python NLP subprocess pipeline employing tokenisation, stop word removal, keyword analysis, and section completeness detection for README evaluation, and a rule-based AI engine for structural and activity assessment. A weighted scoring algorithm aggregates five dimension scores — documentation quality (30%), code structure organisation (20%), dependency management (20%), repository activity (15%), and repository popularity (15%) — into the final health score. The full-stack implementation uses React 18 with TypeScript on the frontend, Node.js and Express.js on the backend, and Python 3.10 for NLP processing. Empirical validation across ten diverse public repositories confirmed accurate, consistent scores within a ten-second response time, with 80% of user acceptance testers rating results as accurate or very accurate. The system addresses a documented gap by providing a zero-setup, URL-only, multi-dimensional evaluation platform accessible without technical overhead or financial cost.
Introduction
The text describes a web-based system designed to automatically evaluate the quality and health of GitHub repositories using a multi-dimensional scoring approach. With the rapid growth of open-source repositories on platforms like GitHub, manual assessment has become impractical and inconsistent, creating demand for automated solutions. Existing tools are limited because they often rely on setup-heavy workflows and focus only on basic metrics like stars or commit counts, ignoring deeper aspects such as documentation quality and code organization.
To address this gap, the proposed system — a Web-Based Repository Health Analyser — provides a zero-setup interface where users simply submit a repository URL. The system uses the GitHub REST API, a Node.js backend, and a Python-based NLP module to analyze repository metadata and README documentation. The NLP component evaluates text quality using tokenization, keyword analysis, and section completeness checks.
The system computes a composite “Repository Health Score” based on five weighted dimensions: documentation quality, code structure, dependency management, repository activity, and popularity. Each dimension contributes differently to the final score depending on its assigned weight.
Experimental evaluation on multiple GitHub repositories shows that the system produces meaningful and consistent scores that align with expert judgments, while maintaining fast response times. User testing also indicates that most users find the automated scores reliable. However, limitations include dependency on GitHub API limits, lack of language-specific code quality analysis, and inability to evaluate private repositories. Future improvements include authentication support and more advanced code analysis features.
Conclusion
This paper presented the design, implementation, and empirical validation of a Web-Based Repository Health Analyser providing multi-dimensional quality assessment of public GitHub repositories. Validation across ten repositories confirmed accurate, consistent health scores within a ten-second response window, with 80% of user acceptance testers rating results as accurate or very accurate.
The system successfully addresses a documented gap in the repository analysis landscape by providing a transparent, multi-dimensional quality evaluation tool accessible without authentication, local installation, or financial cost. The weighted five-dimension scoring framework offers a more holistic quality assessment than any single-metric approach currently available.
Future research directions include integration of language-specific static analysis, OAuth-based authenticated access for private repository support, longitudinal health tracking, and extension of the NLP pipeline for multi-language documentation analysis beyond English-language README content.
References
[1] GitHub, Inc., \"GitHub Octoverse 2024: The State of Open Source,\" GitHub, San Francisco, CA, USA, 2024. [Online]. Available: https://octoverse.github.com/
[2] V. Cosentino, J. L. C. Izquierdo, and J. Cabot, \"A Systematic Mapping Study of Software Quality in Open Source Projects,\" Journal of Systems and Software, vol. 133, pp. 28–39, 2017.
[3] D. Spinellis, \"Code Quality: The Open Source Perspective,\" Boston, MA, USA: Addison-Wesley, 2006.
[4] V. Cosentino, J. Canovas Izquierdo, and J. Cabot, \"Findings from GitHub: Methods, Datasets and Limitations,\" in Proc. 13th International Conference on Mining Software Repositories (MSR), 2016, pp. 137–141.
[5] C. Treude and M. P. Robillard, \"Augmenting API Documentation with Insights from Stack Overflow,\" in Proc. 38th ICSE, 2016, pp. 392–403.
[6] G. A. D. Prana et al., \"Categorising the Content of GitHub README Files,\" Empirical Software Engineering, vol. 24, no. 3, pp. 1296–1327, 2019.
[7] H. Borges and M. T. Valente, \"What\'s in a GitHub Star? Understanding Repository Starring Practices,\" Journal of Systems and Software, vol. 146, pp. 112–129, 2018.
[8] GitHub, \"GitHub REST API Documentation,\" GitHub Docs, 2024. [Online]. Available: https://docs.github.com/en/rest
[9] Natural Language Toolkit (NLTK), \"NLTK 3.8 Documentation,\" 2023. [Online]. Available: https://www.nltk.org/
[10] Meta Open Source, \"React 18 — New Features and Changes,\" React Documentation, 2022. [Online]. Available:
https://react.dev/blog/2022/03/29/react-v18
[11] Node.js Foundation, \"Node.js 20 LTS Release Documentation,\" Node.js, 2023. [Online]. Available: https://nodejs.org/en/blog/release/v20.0.0
[12] D. Spinellis, \"Git,\" IEEE Software, vol. 29, no. 3, pp. 100–101, 2012.