MILTON Code Clustering Service
A language-agnostic code analysis and clustering service that analyzes source code repositories, extracts components, and groups them using the Leiden community detection algorithm.
Features
- Language Agnostic: Supports Python, JavaScript, TypeScript, Java, C, C++, and C#.
- Leiden Clustering: Uses the Leiden algorithm for community detection to group related components.
- REST API: FastAPI-based service for integration with the MILTON backend.
- Multi-Repository Support: Scan multiple repositories in a single request.
- Aspire Integration: Designed to run as a Docker container in .NET Aspire orchestration.
- Shared Volume: Shares a mounted volume with the MILTON API for accessing cloned repositories.
Installation
Local Development
-
Install dependencies:
pip install -r requirements.txtNote: You may need to install a C compiler for
tree-sitterto build language bindings if pre-built wheels are not available for your platform. -
Run the API:
python -m uvicorn api:app --host 0.0.0.0 --port 8000
Docker (via Aspire)
The service is automatically built and run by .NET Aspire. See the AppHost.cs in MILTON.AppHost for configuration.
# From the solution root, run the Aspire AppHost
cd MILTON.AppHost
dotnet runAPI Endpoints
Health Check
GET /healthReturns service health status.
Readiness Check
GET /readyChecks if the service can access the repos directory.
List Repositories
GET /reposLists all repositories in the shared repos directory (organized by tenant/project).
Scan Single Repository
POST /scan
Content-Type: application/json
{
"path": "tenant-id/project-id/repo-name",
"ignore": ["node_modules", "venv", ".git"]
}Returns clustered components for a single repository.
Scan Multiple Repositories
POST /scan/multi
Content-Type: application/json
{
"repositories": [
{"path": "tenant-id/project-id/repo1", "ignore": ["node_modules"]},
{"path": "tenant-id/project-id/repo2", "ignore": []}
],
"output_file": "tenant-id/project-id/clusters.json"
}Scans multiple repositories and returns combined results. Optionally writes results to a JSON file.
Output Format
The output is a JSON array of cluster objects, each containing related components:
[
{
"cluster_id": 0,
"components": [
{
"id": "path/to/file::ClassName",
"path": "path/to/file.py",
"dependencies": ["OtherClass", "AnotherClass.method"],
"signatures": ["def method_name(self, arg1: str) -> bool"]
}
]
}
]Architecture
- Repository Walker: Traverses the file system and identifies source files by language.
- Parser: Extracts AST facts using Tree-sitter for each supported language.
- Normalizer: Converts language-specific AST facts into a common Symbol format.
- Aggregator: Groups symbols into Components based on class membership.
- Clusterer: Builds a dependency graph and applies Leiden algorithm for community detection.
Environment Variables
| Variable | Default | Description |
|---|---|---|
REPOS_PATH | /app/repos | Base path for repository storage |
Integration with MILTON
The clustering service is called by the MILTON backend after repositories are cloned:
- MILTON API clones repositories to the shared
reposvolume - MILTON API calls the clustering service with repository paths
- Clustering service analyzes code and returns clusters
- MILTON uses clusters for document generation and traceability