Skip to content

Programmatic Understanding of Large Code Repositories for Machines

DraftAI

Large code repositories encode significantly more information than is directly expressed in comments or documentation. Architectural constraints, design trade-offs, and implicit contracts are often embedded in dependency structures, call graphs, and usage patterns rather than natural language. Human maintainers gradually reconstruct this mental model through code reviews, production incidents, and long-term exposure. Machines do not.

Most current approaches to repository question answering apply RAG directly over raw source files or lightly chunked code. While effective for localized factual queries, these approaches fail to recover global structure, design intent, and semantic relevance. As a result, retrieved context is often either incomplete or misleading.

We propose reframing repository analysis as a program comprehension problem, borrowing techniques from static analysis, graph theory, and software architecture recovery. Our goal is to transform a repository into a structured, machine-readable representation that exposes its theory of operation before any generative modeling is applied.

Program Comprehension and Architecture Recovery

Section titled “Program Comprehension and Architecture Recovery”

Program comprehension has been studied extensively in software engineering research, with surveys highlighting the importance of structural and dependency-based representations for understanding large systems [Storey et al.]. Architecture recovery techniques aim to reconstruct high-level views from source code using dependency graphs, clustering, and pattern detection.

Tools such as Joern and LLVM-based analyses represent code as graphs combining syntax, control flow, and data flow. These representations have proven effective for vulnerability detection and static reasoning, but are rarely leveraged for documentation or RAG preprocessing.

Repository Question Answering and Deep Wiki Systems

Section titled “Repository Question Answering and Deep Wiki Systems”

Systems such as Deep Wiki and repository chat tools typically rely on:

  • heuristics for file importance (e.g., README proximity, size)
  • embedding-based retrieval over chunks
  • limited structural awareness

While effective for exploratory browsing, these systems struggle to surface architectural decisions or to produce stable, reproducible explanations. Our approach differs by treating relevance as a graph-derived quantity rather than an embedding-only signal.

Given a repository already cloned at a known path ($REPO_PATH), we seek to:

  1. Construct a structured semantic representation of the repository.
  2. Identify key files and symbols based on quantitative relevance metrics.
  3. Extract and propagate architectural decisions encoded in code structure.
  4. Identify public interfaces intended for external or user-level consumption.
  5. Generate deterministic, human-readable documentation suitable for downstream RAG and synthetic query generation.

We explicitly exclude runtime instrumentation and focus on static analysis augmented by limited LLM-based semantic interpretation.

The system operates as a pipeline:

Repository Files
→ Static Parsing
→ Symbol & Dependency Graphs
→ Relevance Scoring
→ Decision Extraction
→ Public Interface Identification
→ Wiki Generation

Each stage produces artifacts consumed by later stages, with strong typing enforced throughout.

We model the repository using the following abstractions:

  • File: path, language, type (code, docs, config), metrics
  • Symbol: functions, classes, methods, constants, modules
  • Edges: imports, calls, implements, tests
  • Decision: inferred design constraints with evidence and confidence

This representation forms a multi-layer graph combining file-level and symbol-level relationships.

We rely on Tree-sitter for multi-language parsing due to its:

  • uniform query interface
  • precise source span tracking
  • extensibility across ecosystems

This enables consistent extraction of imports, symbol definitions, and call sites across heterogeneous repositories.

Relevance Metrics for Key File Identification

Section titled “Relevance Metrics for Key File Identification”

“Key files” are often described informally, but informal heuristics do not scale. We define file importance as a latent variable derived from graph structure.

We define file importance I(f)I(f) as:

I(f)=αDin(f)+βC(f)+γP(f)+δT(f)I(f) = \alpha D_{in}(f) + \beta C(f) + \gamma P(f) + \delta T(f)

Where:

  • Din(f)D_{in}(f): in-degree in import/call graphs
  • C(f)C(f): centrality (PageRank or eigenvector centrality)
  • P(f)P(f): contribution to public surface area
  • T(f)T(f): transitive fan-out depth

PageRank-style algorithms are particularly effective here, as they naturally weight files that serve as architectural hubs. This approach aligns with prior work using eigenvector centrality to identify key classes in large systems.

Comparison to Deep Wiki–style Heuristics

Section titled “Comparison to Deep Wiki–style Heuristics”

Unlike Deep Wiki systems, which rely primarily on textual salience or proximity heuristics, our method:

  • is language-agnostic
  • is deterministic
  • captures indirect architectural importance

We define a decision as a stable constraint or intent that shapes system structure, such as:

  • enforcement of policy at a specific layer
  • centralization of cross-cutting concerns
  • abstraction boundaries chosen to enable extensibility

Decisions are not explicitly labeled in code and must be inferred.

For each symbol, we collect:

  • local documentation
  • signature information
  • callers and callees
  • error handling paths

We then prompt an LLM with bounded, schema-constrained questions to infer candidate decisions, attaching evidence spans and confidence scores.

Decisions propagate through the dependency graph. If multiple downstream symbols consistently rely on an upstream constraint, the decision is elevated and merged via semantic clustering. This yields a decision graph layered atop the symbol graph.

We define a public interface as a symbol that:

  1. Is externally visible.
  2. Is used outside its defining module.
  3. Is invoked primarily by entrypoints or tests simulating user behavior.

A notable heuristic is that exported symbols called only by tests often correspond to user-facing APIs. This reframes interface discovery as a graph cut problem rather than naming heuristics.

For each public interface, documentation is generated via a constrained breadth-first traversal:

  • nodes are visited once
  • traversal halts at low-centrality utilities
  • each node contributes a bounded explanation

This produces deterministic, hierarchical documentation suitable for both humans and machines.

  • Graph coverage
  • Decision density
  • Public API fan-in
  • Propagation depth

To evaluate readiness for RAG and HYDE-style generation, we propose:

  • Context completeness: proportion of required symbols retrieved for a task
  • Decision recall: presence of relevant architectural constraints
  • Retrieval stability: variance across runs
  • HYDE alignment: similarity between generated hypothetical queries and ground-truth maintainer queries

Once the repository is represented as a decision-annotated graph, it becomes possible to generate high-quality HYDE pairs:

  • queries derived from public interfaces and decisions
  • documents constructed from minimal, relevant subgraphs

This enables RAG systems to retrieve conceptually coherent context rather than arbitrary chunks.

We argue that repository understanding is a first-class systems problem that must precede RAG. By grounding relevance, decision extraction, and documentation in graph-based program comprehension, we enable machines to recover not just what code does, but why it is structured the way it is.

This approach shifts repository AI tooling from text retrieval toward genuine architectural understanding.