Programmatic Understanding of Large Code Repositories for Machines
Introduction
Section titled “Introduction”Large code repositories encode significantly more information than is directly expressed in comments or documentation. Architectural constraints, design trade-offs, and implicit contracts are often embedded in dependency structures, call graphs, and usage patterns rather than natural language. Human maintainers gradually reconstruct this mental model through code reviews, production incidents, and long-term exposure. Machines do not.
Most current approaches to repository question answering apply RAG directly over raw source files or lightly chunked code. While effective for localized factual queries, these approaches fail to recover global structure, design intent, and semantic relevance. As a result, retrieved context is often either incomplete or misleading.
We propose reframing repository analysis as a program comprehension problem, borrowing techniques from static analysis, graph theory, and software architecture recovery. Our goal is to transform a repository into a structured, machine-readable representation that exposes its theory of operation before any generative modeling is applied.
Related Work
Section titled “Related Work”Program Comprehension and Architecture Recovery
Section titled “Program Comprehension and Architecture Recovery”Program comprehension has been studied extensively in software engineering research, with surveys highlighting the importance of structural and dependency-based representations for understanding large systems [Storey et al.]. Architecture recovery techniques aim to reconstruct high-level views from source code using dependency graphs, clustering, and pattern detection.
Code Property Graphs and Semantic Graphs
Section titled “Code Property Graphs and Semantic Graphs”Tools such as Joern and LLVM-based analyses represent code as graphs combining syntax, control flow, and data flow. These representations have proven effective for vulnerability detection and static reasoning, but are rarely leveraged for documentation or RAG preprocessing.
Repository Question Answering and Deep Wiki Systems
Section titled “Repository Question Answering and Deep Wiki Systems”Systems such as Deep Wiki and repository chat tools typically rely on:
- heuristics for file importance (e.g., README proximity, size)
- embedding-based retrieval over chunks
- limited structural awareness
While effective for exploratory browsing, these systems struggle to surface architectural decisions or to produce stable, reproducible explanations. Our approach differs by treating relevance as a graph-derived quantity rather than an embedding-only signal.
Problem Statement
Section titled “Problem Statement”Given a repository already cloned at a known path ($REPO_PATH), we seek to:
- Construct a structured semantic representation of the repository.
- Identify key files and symbols based on quantitative relevance metrics.
- Extract and propagate architectural decisions encoded in code structure.
- Identify public interfaces intended for external or user-level consumption.
- Generate deterministic, human-readable documentation suitable for downstream RAG and synthetic query generation.
We explicitly exclude runtime instrumentation and focus on static analysis augmented by limited LLM-based semantic interpretation.
System Overview
Section titled “System Overview”The system operates as a pipeline:
Repository Files → Static Parsing → Symbol & Dependency Graphs → Relevance Scoring → Decision Extraction → Public Interface Identification → Wiki GenerationEach stage produces artifacts consumed by later stages, with strong typing enforced throughout.
Semantic Representation
Section titled “Semantic Representation”Core Entities
Section titled “Core Entities”We model the repository using the following abstractions:
- File: path, language, type (code, docs, config), metrics
- Symbol: functions, classes, methods, constants, modules
- Edges: imports, calls, implements, tests
- Decision: inferred design constraints with evidence and confidence
This representation forms a multi-layer graph combining file-level and symbol-level relationships.
Parsing Strategy
Section titled “Parsing Strategy”We rely on Tree-sitter for multi-language parsing due to its:
- uniform query interface
- precise source span tracking
- extensibility across ecosystems
This enables consistent extraction of imports, symbol definitions, and call sites across heterogeneous repositories.
Relevance Metrics for Key File Identification
Section titled “Relevance Metrics for Key File Identification”Motivation
Section titled “Motivation”“Key files” are often described informally, but informal heuristics do not scale. We define file importance as a latent variable derived from graph structure.
Relevance Function
Section titled “Relevance Function”We define file importance as:
Where:
- : in-degree in import/call graphs
- : centrality (PageRank or eigenvector centrality)
- : contribution to public surface area
- : transitive fan-out depth
PageRank-style algorithms are particularly effective here, as they naturally weight files that serve as architectural hubs. This approach aligns with prior work using eigenvector centrality to identify key classes in large systems.
Comparison to Deep Wiki–style Heuristics
Section titled “Comparison to Deep Wiki–style Heuristics”Unlike Deep Wiki systems, which rely primarily on textual salience or proximity heuristics, our method:
- is language-agnostic
- is deterministic
- captures indirect architectural importance
Decision Extraction
Section titled “Decision Extraction”Definition of a Decision
Section titled “Definition of a Decision”We define a decision as a stable constraint or intent that shapes system structure, such as:
- enforcement of policy at a specific layer
- centralization of cross-cutting concerns
- abstraction boundaries chosen to enable extensibility
Decisions are not explicitly labeled in code and must be inferred.
Hybrid Extraction Pipeline
Section titled “Hybrid Extraction Pipeline”For each symbol, we collect:
- local documentation
- signature information
- callers and callees
- error handling paths
We then prompt an LLM with bounded, schema-constrained questions to infer candidate decisions, attaching evidence spans and confidence scores.
Decision Propagation
Section titled “Decision Propagation”Decisions propagate through the dependency graph. If multiple downstream symbols consistently rely on an upstream constraint, the decision is elevated and merged via semantic clustering. This yields a decision graph layered atop the symbol graph.
Public Interface Identification
Section titled “Public Interface Identification”We define a public interface as a symbol that:
- Is externally visible.
- Is used outside its defining module.
- Is invoked primarily by entrypoints or tests simulating user behavior.
A notable heuristic is that exported symbols called only by tests often correspond to user-facing APIs. This reframes interface discovery as a graph cut problem rather than naming heuristics.
Wiki Generation as Graph Traversal
Section titled “Wiki Generation as Graph Traversal”For each public interface, documentation is generated via a constrained breadth-first traversal:
- nodes are visited once
- traversal halts at low-centrality utilities
- each node contributes a bounded explanation
This produces deterministic, hierarchical documentation suitable for both humans and machines.
Evaluation and Metrics
Section titled “Evaluation and Metrics”Structural Metrics
Section titled “Structural Metrics”- Graph coverage
- Decision density
- Public API fan-in
- Propagation depth
Context Capture Metrics for RAG
Section titled “Context Capture Metrics for RAG”To evaluate readiness for RAG and HYDE-style generation, we propose:
- Context completeness: proportion of required symbols retrieved for a task
- Decision recall: presence of relevant architectural constraints
- Retrieval stability: variance across runs
- HYDE alignment: similarity between generated hypothetical queries and ground-truth maintainer queries
Applications to HYDE and RAG
Section titled “Applications to HYDE and RAG”Once the repository is represented as a decision-annotated graph, it becomes possible to generate high-quality HYDE pairs:
- queries derived from public interfaces and decisions
- documents constructed from minimal, relevant subgraphs
This enables RAG systems to retrieve conceptually coherent context rather than arbitrary chunks.
Conclusion
Section titled “Conclusion”We argue that repository understanding is a first-class systems problem that must precede RAG. By grounding relevance, decision extraction, and documentation in graph-based program comprehension, we enable machines to recover not just what code does, but why it is structured the way it is.
This approach shifts repository AI tooling from text retrieval toward genuine architectural understanding.