Codebase Analytics
This tutorial explains how codebase metrics are efficiently calculated using the codegen
library in the Codebase Analytics Dashboard. The metrics include indices of codebase maintainabilith and complexity.
View the full code and setup instructions in our codebase-analytics repository.
Complexity Metrics
Complexity metrics help quantify how easy or difficult a codebase is to understand and maintain. These metrics are calculated by analyzing various aspects of the code structure, including control flow, code volume, and inheritance patterns. The following metrics provide different perspectives on code complexity.
Cyclomatic Complexity
Cyclomatic Complexity measures the number of linearly independent paths through the codebase, making it a valuable indicator of how difficult code will be to test and maintain.
Calculation Method:
- Base complexity of 1
- +1 for each:
- if statement
- elif statement
- for loop
- while loop
- +1 for each boolean operator (and, or) in conditions
- +1 for each except block in try-catch statements
The calculate_cyclomatic_complexity()
function traverses the Codgen codebase object and uses the above rules to find statement objects within each function and calculate the overall cyclomatic complexity of the codebase.
Halstead Volume
Halstead Volume is a software metric which measures the complexity of a codebase by counting the number of unique operators and operands. It is calculated by multiplying the sum of unique operators and operands by the logarithm base 2 of the sum of unique operators and operands.
Halstead Volume: V = (N1 + N2) * log2(n1 + n2)
This calculation uses codegen’s expression types to make this calculation very efficient - these include BinaryExpression, UnaryExpression and ComparisonExpression. The function extracts operators and operands from the codebase object and calculated in calculate_halstead_volume()
function.
Depth of Inheritance (DOI)
Depth of Inheritance measures the length of inheritance chain for each class. It is calculated by counting the length of the superclasses list for each class in the codebase. The implementation is handled through a simple calculation using codegen’s class information in the calculate_doi()
function.
Maintainability Index
Maintainability Index is a software metric which measures how maintainable a codebase is. Maintainability is described as ease to support and change the code. This index is calculated as a factored formula consisting of SLOC (Source Lines Of Code), Cyclomatic Complexity and Halstead volume.
Maintainability Index: M = 171 - 5.2 * ln(HV) - 0.23 * CC - 16.2 * ln(SLOC)
This formula is then normalized to a scale of 0-100, where 100 is the maximum maintainability.
The implementation is handled through the calculate_maintainability_index()
function. The codegen codebase object is used to efficiently extract the Cyclomatic Complexity and Halstead Volume for each function and class in the codebase, which are then used to calculate the maintainability index.
Line Metrics
Line metrics provide insights into the size, complexity, and maintainability of a codebase. These measurements help determine the scale of a project, identify areas that may need refactoring, and track the growth of the codebase over time.
Lines of Code
Lines of Code refers to the total number of lines in the source code, including blank lines and comments. This is accomplished with a simple count of all lines in the source file.
Logical Lines of Code (LLOC)
LLOC is the amount of lines of code which contain actual functional statements. It excludes comments, blank lines, and other lines which do not contribute to the utility of the codebase. A high LLOC relative to total lines of code suggests dense, potentially complex code that may benefit from breaking into smaller functions or modules with more documentation.
Source Lines of Code (SLOC)
SLOC refers to the number of lines containing actual code, excluding blank lines. This includes programming language keywords and comments. While a higher SLOC indicates a larger codebase, it should be evaluated alongside other metrics like cyclomatic complexity and maintainability index to assess if the size is justified by the functionality provided.
Comment Density
Comment density is calculated by dividing the lines of code which contain comments by the total lines of code in the codebase. The formula is:
It measures the proportion of comments in the codebase and is a good indicator of how much code is properly documented. Accordingly, it can show how maintainable and easy to understand the codebase is.
General Codebase Statistics
The number of files is determined by traversing codegen’s FileNode objects in the parsed codebase. The number of functions is calculated by counting FunctionDef nodes across all parsed files. The number of classes is obtained by summing ClassDef nodes throughout the codebase.
The commit activity is calculated by using the git history of the repository. The number of commits is counted for each month in the last 12 months.
Using the Analysis Tool (Modal Server)
The tool is implemented as a FastAPI application wrapped in a Modal deployment. To analyze a repository:
- Send a POST request to
/analyze_repo
with the repository URL - The tool will:
- Clone the repository
- Parse the codebase using codegen
- Calculate all metrics
- Return a comprehensive JSON response with all metrics
This is the only endpoint in the FastAPI server, as it takes care of the entire analysis process. To run the FastAPI server locally, install all dependencies and run the server with modal serve modal_main.py
.
The server can be connected to the frontend dashboard. This web component is implemented as a Next.js application with appropriate comments and visualizations for the raw server data. To run the frontend locally, install all dependencies and run the server with npm run dev
. This can be connected to the FastAPI server by setting the URL in the request to the /analyze_repo
endpoint.
Was this page helpful?