Clustering Feature

The Clustering Feature acts as the gateway to the external Python-based static analysis microservice. It analyzes repository source code using the Leiden community detection algorithm and Tree-sitter parsing to discover natural architectural modules and domain boundaries within raw codebases. The resulting clusters are used downstream to bootstrap the generation of SVD (Software Version Description) and architecture documents without manual module mapping.

Business Logic Intent

When a repository is imported into MILTON, standard text chunks and embeddings are insufficient to understand the overall architecture. The Clustering module discovers structural dependencies by parsing code into ASTs (Abstract Syntax Trees), mapping imports, and running graph-theory community detection to group highly cohesive files together.

To make these mathematical clusters human-readable, this feature implements an AI naming phase: it passes the clustered file paths and key function signatures to the configured project LLM (e.g., Writer or Coder preset) to synthesize a descriptive software engineering module name (e.g., “Authentication Service”, “Data Access Layer”). To save time and tokens, results are cached at the repository level.

Architecture & Integration

This module is primarily an orchestration proxy. It does not perform the tree-sitter parsing itself. Instead:

  1. It exposes FastEndpoints to the React client.
  2. It interacts with the IClusteringService HTTP client (MILTON.Infrastructure.Clustering) which calls the Python microservice (discovered via Aspire).
  3. It accesses shared repository source files via a shared volume mapped to {baseReposPath}/{tenantId}/{projectId}/{repoName}.
  4. It utilizes IAIService and IEncryptionService to securely inject AI-generated descriptions for the discovered clusters.

Mermaid Sequence Diagram

The following sequence illustrates the ScanProjectRepoEndpoint workflow, highlighting how mathematical clusters are enriched with AI-generated semantic names.

sequenceDiagram
    autonumber
    actor Client
    participant API as ClusteringEndpoints
    participant DB as AppDbContext
    participant Python as Python Clustering Service
    participant AI as IAIService

    Client->>API: POST /projects/{ProjectId}/clustering/scan
    API->>DB: Get Project Config & Repo Metadata
    DB-->>API: ProjectConfig (includes cached ClusterSnapshotJson)
    
    alt Cache is valid & !ForceRescan
        API-->>Client: Return cached Cluster Results
    else Needs Scan
        API->>Python: POST /scan (tenantId/projectId/repoName)
        Note over Python: Runs Tree-sitter + Leiden algorithm
        Python-->>API: List of mathematical clusters (files + dependencies)
        
        alt UseLlmNaming == true && LLM Configured
            API->>DB: Read & Decrypt Project LLM Presets
            API->>AI: Generate names for clusters (paths + signatures)
            AI-->>API: ["Auth Module", "Database Access", ...]
            API->>API: Map AI names to clusters
        end
        
        API->>DB: Save updated ClusterSnapshotJson
        API-->>Client: Return named Cluster Results
    end

Modules and Components

Endpoints

  • ClusteringHealthEndpoint: Exposes the health status of the upstream Python microservice.
  • ListRepositoriesEndpoint: Lists all repositories currently available on the shared volume for the tenant.
  • ScanRepositoryEndpoint: Direct proxy to scan a specific repository by its internal volume path.
  • ScanMultipleRepositoriesEndpoint: Batch operation for scanning multiple repositories simultaneously.
  • ScanProjectRepoEndpoint: The primary business endpoint. Resolves project configurations, reads cache, triggers scans, drives the LLM naming orchestration, and stores results back into AppDbContext.

Domain Models & DTOs

  • ScanProjectRepoRequest: Client request specifying project, repo URL, ignore lists, and AI naming flags.
  • ScanProjectClusterResult: The aggregated result containing the mathematical cluster (ID, components) and the human-readable AI-assigned Name.
  • RepositoryInputDto, ScanMultipleRepositoriesRequest: DTOs for multi-repo processing.

0 items under this folder.