MILTON Code Clustering Service

A language-agnostic code analysis and clustering service that analyzes source code repositories, extracts components, and groups them using the Leiden community detection algorithm.

Features

Language Agnostic: Supports Python, JavaScript, TypeScript, Java, C, C++, and C#.
Leiden Clustering: Uses the Leiden algorithm for community detection to group related components.
REST API: FastAPI-based service for integration with the MILTON backend.
Multi-Repository Support: Scan multiple repositories in a single request.
Aspire Integration: Designed to run as a Docker container in .NET Aspire orchestration.
Shared Volume: Shares a mounted volume with the MILTON API for accessing cloned repositories.

Installation

Local Development

Install dependencies:
```
pip install -r requirements.txt
```
Note: You may need to install a C compiler for tree-sitter to build language bindings if pre-built wheels are not available for your platform.

Run the API:

python -m uvicorn api:app --host 0.0.0.0 --port 8000

Docker (via Aspire)

The service is automatically built and run by .NET Aspire. See the AppHost.cs in MILTON.AppHost for configuration.

# From the solution root, run the Aspire AppHost
cd MILTON.AppHost
dotnet run

API Endpoints

Health Check

GET /health

Returns service health status.

Readiness Check

GET /ready

Checks if the service can access the repos directory.

List Repositories

GET /repos

Lists all repositories in the shared repos directory (organized by tenant/project).

Scan Single Repository

POST /scan
Content-Type: application/json
 
{
  "path": "tenant-id/project-id/repo-name",
  "ignore": ["node_modules", "venv", ".git"]
}

Returns clustered components for a single repository.

Scan Multiple Repositories

POST /scan/multi
Content-Type: application/json
 
{
  "repositories": [
    {"path": "tenant-id/project-id/repo1", "ignore": ["node_modules"]},
    {"path": "tenant-id/project-id/repo2", "ignore": []}
  ],
  "output_file": "tenant-id/project-id/clusters.json"
}

Scans multiple repositories and returns combined results. Optionally writes results to a JSON file.

Output Format

The output is a JSON array of cluster objects, each containing related components:

[
  {
    "cluster_id": 0,
    "components": [
      {
        "id": "path/to/file::ClassName",
        "path": "path/to/file.py",
        "dependencies": ["OtherClass", "AnotherClass.method"],
        "signatures": ["def method_name(self, arg1: str) -> bool"]
      }
    ]
  }
]

Architecture

Repository Walker: Traverses the file system and identifies source files by language.
Parser: Extracts AST facts using Tree-sitter for each supported language.
Normalizer: Converts language-specific AST facts into a common Symbol format.
Aggregator: Groups symbols into Components based on class membership.
Clusterer: Builds a dependency graph and applies Leiden algorithm for community detection.

Environment Variables

Variable	Default	Description
`REPOS_PATH`	`/app/repos`	Base path for repository storage

Integration with MILTON

The clustering service is called by the MILTON backend after repositories are cloned:

MILTON API clones repositories to the shared repos volume
MILTON API calls the clustering service with repository paths
Clustering service analyzes code and returns clusters
MILTON uses clusters for document generation and traceability

Quartz 4

Explorer

Clustering Service