Clustering Feature
The Clustering Feature acts as the gateway to the external Python-based static analysis microservice. It analyzes repository source code using the Leiden community detection algorithm and Tree-sitter parsing to discover natural architectural modules and domain boundaries within raw codebases. The resulting clusters are used downstream to bootstrap the generation of SVD (Software Version Description) and architecture documents without manual module mapping.
Business Logic Intent
When a repository is imported into MILTON, standard text chunks and embeddings are insufficient to understand the overall architecture. The Clustering module discovers structural dependencies by parsing code into ASTs (Abstract Syntax Trees), mapping imports, and running graph-theory community detection to group highly cohesive files together.
To make these mathematical clusters human-readable, this feature implements an AI naming phase: it passes the clustered file paths and key function signatures to the configured project LLM (e.g., Writer or Coder preset) to synthesize a descriptive software engineering module name (e.g., “Authentication Service”, “Data Access Layer”). To save time and tokens, results are cached at the repository level.
Architecture & Integration
This module is primarily an orchestration proxy. It does not perform the tree-sitter parsing itself. Instead:
- It exposes
FastEndpointsto the React client. - It interacts with the
IClusteringServiceHTTP client (MILTON.Infrastructure.Clustering) which calls the Python microservice (discovered via Aspire). - It accesses shared repository source files via a shared volume mapped to
{baseReposPath}/{tenantId}/{projectId}/{repoName}. - It utilizes
IAIServiceandIEncryptionServiceto securely inject AI-generated descriptions for the discovered clusters.
Mermaid Sequence Diagram
The following sequence illustrates the ScanProjectRepoEndpoint workflow, highlighting how mathematical clusters are enriched with AI-generated semantic names.
sequenceDiagram autonumber actor Client participant API as ClusteringEndpoints participant DB as AppDbContext participant Python as Python Clustering Service participant AI as IAIService Client->>API: POST /projects/{ProjectId}/clustering/scan API->>DB: Get Project Config & Repo Metadata DB-->>API: ProjectConfig (includes cached ClusterSnapshotJson) alt Cache is valid & !ForceRescan API-->>Client: Return cached Cluster Results else Needs Scan API->>Python: POST /scan (tenantId/projectId/repoName) Note over Python: Runs Tree-sitter + Leiden algorithm Python-->>API: List of mathematical clusters (files + dependencies) alt UseLlmNaming == true && LLM Configured API->>DB: Read & Decrypt Project LLM Presets API->>AI: Generate names for clusters (paths + signatures) AI-->>API: ["Auth Module", "Database Access", ...] API->>API: Map AI names to clusters end API->>DB: Save updated ClusterSnapshotJson API-->>Client: Return named Cluster Results end
Modules and Components
Endpoints
ClusteringHealthEndpoint: Exposes the health status of the upstream Python microservice.ListRepositoriesEndpoint: Lists all repositories currently available on the shared volume for the tenant.ScanRepositoryEndpoint: Direct proxy to scan a specific repository by its internal volume path.ScanMultipleRepositoriesEndpoint: Batch operation for scanning multiple repositories simultaneously.ScanProjectRepoEndpoint: The primary business endpoint. Resolves project configurations, reads cache, triggers scans, drives the LLM naming orchestration, and stores results back intoAppDbContext.
Domain Models & DTOs
ScanProjectRepoRequest: Client request specifying project, repo URL, ignore lists, and AI naming flags.ScanProjectClusterResult: The aggregated result containing the mathematical cluster (ID, components) and the human-readable AI-assignedName.RepositoryInputDto,ScanMultipleRepositoriesRequest: DTOs for multi-repo processing.