MILTON Code Clustering Service

A language-agnostic code analysis and clustering service that analyzes source code repositories, extracts components, and groups them using the Leiden community detection algorithm.

Features

  • Language Agnostic: Supports Python, JavaScript, TypeScript, Java, C, C++, and C#.
  • Leiden Clustering: Uses the Leiden algorithm for community detection to group related components.
  • REST API: FastAPI-based service for integration with the MILTON backend.
  • Multi-Repository Support: Scan multiple repositories in a single request.
  • Aspire Integration: Designed to run as a Docker container in .NET Aspire orchestration.
  • Shared Volume: Shares a mounted volume with the MILTON API for accessing cloned repositories.

Installation

Local Development

  1. Install dependencies:

    pip install -r requirements.txt

    Note: You may need to install a C compiler for tree-sitter to build language bindings if pre-built wheels are not available for your platform.

  2. Run the API:

    python -m uvicorn api:app --host 0.0.0.0 --port 8000

Docker (via Aspire)

The service is automatically built and run by .NET Aspire. See the AppHost.cs in MILTON.AppHost for configuration.

# From the solution root, run the Aspire AppHost
cd MILTON.AppHost
dotnet run

API Endpoints

Health Check

GET /health

Returns service health status.

Readiness Check

GET /ready

Checks if the service can access the repos directory.

List Repositories

GET /repos

Lists all repositories in the shared repos directory (organized by tenant/project).

Scan Single Repository

POST /scan
Content-Type: application/json
 
{
  "path": "tenant-id/project-id/repo-name",
  "ignore": ["node_modules", "venv", ".git"]
}

Returns clustered components for a single repository.

Scan Multiple Repositories

POST /scan/multi
Content-Type: application/json
 
{
  "repositories": [
    {"path": "tenant-id/project-id/repo1", "ignore": ["node_modules"]},
    {"path": "tenant-id/project-id/repo2", "ignore": []}
  ],
  "output_file": "tenant-id/project-id/clusters.json"
}

Scans multiple repositories and returns combined results. Optionally writes results to a JSON file.

Output Format

The output is a JSON array of cluster objects, each containing related components:

[
  {
    "cluster_id": 0,
    "components": [
      {
        "id": "path/to/file::ClassName",
        "path": "path/to/file.py",
        "dependencies": ["OtherClass", "AnotherClass.method"],
        "signatures": ["def method_name(self, arg1: str) -> bool"]
      }
    ]
  }
]

Architecture

  1. Repository Walker: Traverses the file system and identifies source files by language.
  2. Parser: Extracts AST facts using Tree-sitter for each supported language.
  3. Normalizer: Converts language-specific AST facts into a common Symbol format.
  4. Aggregator: Groups symbols into Components based on class membership.
  5. Clusterer: Builds a dependency graph and applies Leiden algorithm for community detection.

Environment Variables

VariableDefaultDescription
REPOS_PATH/app/reposBase path for repository storage

Integration with MILTON

The clustering service is called by the MILTON backend after repositories are cloned:

  1. MILTON API clones repositories to the shared repos volume
  2. MILTON API calls the clustering service with repository paths
  3. Clustering service analyzes code and returns clusters
  4. MILTON uses clusters for document generation and traceability

0 items under this folder.