This guide demonstrates how to use Codegen to generate high-quality training data for large language models (LLMs) by extracting function implementations along with their dependencies and usages. This approach is similar to word2vec or node2vec - given the context of a function, learn to predict the function’s implementation.

View the full code in our examples repository
This example works with both Python and Typescript repositories without modification

Overview

The process involves three main steps:

  1. Finding all functions in the codebase
  2. Extracting their implementations, dependencies, and usages
  3. Generating structured training data

Let’s walk through each step using Codegen.

Step 1: Finding Functions and Their Context

First, we will do a “graph expansion” for each function - grab the function’s source, as well as the full source of all usages of the function and all dependencies.

See dependencies and usages to learn more about navigating the code graph

First, let’s import the types we need from Codegen:

import codegen
from codegen import Codebase
from codegen.sdk.core.external_module import ExternalModule
from codegen.sdk.core.import_resolution import Import
from codegen.sdk.core.symbol import Symbol

Here’s how we get the full context for each function:

def get_function_context(function) -> dict:
    """Get the implementation, dependencies, and usages of a function."""
    context = {
        "implementation": {"source": function.source, "filepath": function.filepath},
        "dependencies": [],
        "usages": [],
    }

    # Add dependencies
    for dep in function.dependencies:
        # Hop through imports to find the root symbol source
        if isinstance(dep, Import):
            dep = hop_through_imports(dep)

        context["dependencies"].append({"source": dep.source, "filepath": dep.filepath})

    # Add usages
    for usage in function.usages:
        context["usages"].append({
            "source": usage.usage_symbol.source,
            "filepath": usage.usage_symbol.filepath,
        })

    return context

Notice how we use hop_through_imports to resolve dependencies. When working with imports, symbols can be re-exported multiple times. For example, a helper function might be imported and re-exported through several files before being used. We need to follow this chain to find the actual implementation:

def hop_through_imports(imp: Import) -> Symbol | ExternalModule:
    """Finds the root symbol for an import."""
    if isinstance(imp.imported_symbol, Import):
        return hop_through_imports(imp.imported_symbol)
    return imp.imported_symbol

This creates a structured representation of each function’s context:

{
  "implementation": {
    "source": "def process_data(input: str) -> dict: ...",
    "filepath": "src/data_processor.py"
  },
  "dependencies": [
    {
      "source": "def validate_input(data: str) -> bool: ...",
      "filepath": "src/validators.py"
    }
  ],
  "usages": [
    {
      "source": "result = process_data(user_input)",
      "filepath": "src/api.py"
    }
  ]
}

Step 2: Processing the Codebase

Next, we process all functions in the codebase to generate our training data:

def run(codebase: Codebase):
    """Generate training data using a node2vec-like approach for code embeddings."""
    # Track all function contexts
    training_data = {
        "functions": [],
        "metadata": {
            "total_functions": len(codebase.functions),
            "total_processed": 0,
            "avg_dependencies": 0,
            "avg_usages": 0,
        },
    }

    # Process each function in the codebase
    for function in codebase.functions:
        # Skip if function is too small
        if len(function.source.split("\n")) < 2:
            continue

        # Get function context
        context = get_function_context(function)

        # Only keep functions with enough context
        if len(context["dependencies"]) + len(context["usages"]) > 0:
            training_data["functions"].append(context)

    # Update metadata
    training_data["metadata"]["total_processed"] = len(training_data["functions"])
    if training_data["functions"]:
        training_data["metadata"]["avg_dependencies"] = sum(
            len(f["dependencies"]) for f in training_data["functions"]
        ) / len(training_data["functions"])
        training_data["metadata"]["avg_usages"] = sum(
            len(f["usages"]) for f in training_data["functions"]
        ) / len(training_data["functions"])

    return training_data

Step 3: Running the Generator

Finally, we can run our training data generator on any codebase.

See parsing codebases to learn more
if __name__ == "__main__":
    print("Initializing codebase...")
    codebase = Codebase.from_repo("fastapi/fastapi")

    print("Generating training data...")
    training_data = run(codebase)

    print("Saving training data...")
    with open("training_data.json", "w") as f:
        json.dump(training_data, f, indent=2)
    print("Training data saved to training_data.json")

This will:

  1. Load the target codebase
  2. Process all functions
  3. Save the structured training data to a JSON file

You can use any Git repository as your source codebase by passing the repo URL to Codebase.from_repo(…).

Using the Training Data

The generated data can be used to train LLMs in several ways:

  1. Masked Function Prediction: Hide a function’s implementation and predict it from dependencies and usages
  2. Code Embeddings: Generate embeddings that capture semantic relationships between functions
  3. Dependency Prediction: Learn to predict which functions are likely to be dependencies
  4. Usage Pattern Learning: Train models to understand common usage patterns

For example, to create a masked prediction task:

def create_training_example(function_data):
    """Create a masked prediction example from function data."""
    return {
        "context": {
            "dependencies": function_data["dependencies"],
            "usages": function_data["usages"]
        },
        "target": function_data["implementation"]
    }

# Create training examples
examples = [create_training_example(f) for f in training_data["functions"]]

Was this page helpful?