Programmatic Understanding of Large Code Repositories for Machines

DraftAI

Introduction

Large code repositories encode significantly more information than is directly expressed in comments or documentation. Architectural constraints, design trade-offs, and implicit contracts are often embedded in dependency structures, call graphs, and usage patterns rather than natural language. Human maintainers gradually reconstruct this mental model through code reviews, production incidents, and long-term exposure. Machines do not.

Most current approaches to repository question answering apply RAG directly over raw source files or lightly chunked code. While effective for localized factual queries, these approaches fail to recover global structure, design intent, and semantic relevance. As a result, retrieved context is often either incomplete or misleading.

We propose reframing repository analysis as a program comprehension problem, borrowing techniques from static analysis, graph theory, and software architecture recovery. Our goal is to transform a repository into a structured, machine-readable representation that exposes its theory of operation before any generative modeling is applied.

Program Comprehension and Architecture Recovery

Program comprehension has been studied extensively in software engineering research, with surveys highlighting the importance of structural and dependency-based representations for understanding large systems [Storey et al.]. Architecture recovery techniques aim to reconstruct high-level views from source code using dependency graphs, clustering, and pattern detection.

Code Property Graphs and Semantic Graphs

Tools such as Joern and LLVM-based analyses represent code as graphs combining syntax, control flow, and data flow. These representations have proven effective for vulnerability detection and static reasoning, but are rarely leveraged for documentation or RAG preprocessing.

Repository Question Answering and Deep Wiki Systems

Systems such as Deep Wiki and repository chat tools typically rely on:

heuristics for file importance (e.g., README proximity, size)
embedding-based retrieval over chunks
limited structural awareness

While effective for exploratory browsing, these systems struggle to surface architectural decisions or to produce stable, reproducible explanations. Our approach differs by treating relevance as a graph-derived quantity rather than an embedding-only signal.

Problem Statement

Given a repository already cloned at a known path ($REPO_PATH), we seek to:

Construct a structured semantic representation of the repository.
Identify key files and symbols based on quantitative relevance metrics.
Extract and propagate architectural decisions encoded in code structure.
Identify public interfaces intended for external or user-level consumption.
Generate deterministic, human-readable documentation suitable for downstream RAG and synthetic query generation.

We explicitly exclude runtime instrumentation and focus on static analysis augmented by limited LLM-based semantic interpretation.

System Overview

The system operates as a pipeline:

Repository Files
   → Static Parsing
   → Symbol & Dependency Graphs
   → Relevance Scoring
   → Decision Extraction
   → Public Interface Identification
   → Wiki Generation

Each stage produces artifacts consumed by later stages, with strong typing enforced throughout.

Semantic Representation

Core Entities

We model the repository using the following abstractions:

File: path, language, type (code, docs, config), metrics
Symbol: functions, classes, methods, constants, modules
Edges: imports, calls, implements, tests
Decision: inferred design constraints with evidence and confidence

This representation forms a multi-layer graph combining file-level and symbol-level relationships.

Parsing Strategy

We rely on Tree-sitter for multi-language parsing due to its:

uniform query interface
precise source span tracking
extensibility across ecosystems

This enables consistent extraction of imports, symbol definitions, and call sites across heterogeneous repositories.

Relevance Metrics for Key File Identification

Motivation

“Key files” are often described informally, but informal heuristics do not scale. We define file importance as a latent variable derived from graph structure.

Relevance Function

We define file importance $I(f)$ as:

I(f) = \alpha D_{in}(f) + \beta C(f) + \gamma P(f) + \delta T(f)

Where:

$D_{in}(f)$ : in-degree in import/call graphs
$C(f)$ : centrality (PageRank or eigenvector centrality)
$P(f)$ : contribution to public surface area
$T(f)$ : transitive fan-out depth

PageRank-style algorithms are particularly effective here, as they naturally weight files that serve as architectural hubs. This approach aligns with prior work using eigenvector centrality to identify key classes in large systems.

Comparison to Deep Wiki–style Heuristics

Unlike Deep Wiki systems, which rely primarily on textual salience or proximity heuristics, our method:

is language-agnostic
is deterministic
captures indirect architectural importance

Decision Extraction

Definition of a Decision

We define a decision as a stable constraint or intent that shapes system structure, such as:

enforcement of policy at a specific layer
centralization of cross-cutting concerns
abstraction boundaries chosen to enable extensibility

Decisions are not explicitly labeled in code and must be inferred.

Hybrid Extraction Pipeline

For each symbol, we collect:

local documentation
signature information
callers and callees
error handling paths

We then prompt an LLM with bounded, schema-constrained questions to infer candidate decisions, attaching evidence spans and confidence scores.

Decision Propagation

Decisions propagate through the dependency graph. If multiple downstream symbols consistently rely on an upstream constraint, the decision is elevated and merged via semantic clustering. This yields a decision graph layered atop the symbol graph.

Public Interface Identification

We define a public interface as a symbol that:

Is externally visible.
Is used outside its defining module.
Is invoked primarily by entrypoints or tests simulating user behavior.

A notable heuristic is that exported symbols called only by tests often correspond to user-facing APIs. This reframes interface discovery as a graph cut problem rather than naming heuristics.

Wiki Generation as Graph Traversal

For each public interface, documentation is generated via a constrained breadth-first traversal:

nodes are visited once
traversal halts at low-centrality utilities
each node contributes a bounded explanation

This produces deterministic, hierarchical documentation suitable for both humans and machines.

Evaluation and Metrics

Structural Metrics

Graph coverage
Decision density
Public API fan-in
Propagation depth

Context Capture Metrics for RAG

To evaluate readiness for RAG and HYDE-style generation, we propose:

Context completeness: proportion of required symbols retrieved for a task
Decision recall: presence of relevant architectural constraints
Retrieval stability: variance across runs
HYDE alignment: similarity between generated hypothetical queries and ground-truth maintainer queries

Applications to HYDE and RAG

Once the repository is represented as a decision-annotated graph, it becomes possible to generate high-quality HYDE pairs:

queries derived from public interfaces and decisions
documents constructed from minimal, relevant subgraphs

This enables RAG systems to retrieve conceptually coherent context rather than arbitrary chunks.

Conclusion

We argue that repository understanding is a first-class systems problem that must precede RAG. By grounding relevance, decision extraction, and documentation in graph-based program comprehension, we enable machines to recover not just what code does, but why it is structured the way it is.

This approach shifts repository AI tooling from text retrieval toward genuine architectural understanding.