Semantic Code Search
Codegen provides semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren’t present.
Basic Usage
Here’s how to create and use a semantic code search index:
Searching Code
Once you have an index, you can perform semantic searches:
The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches.
Available Indices
Codegen provides two types of semantic indices:
FileIndex
The FileIndex
operates at the file level:
- Indexes entire files, splitting large files into chunks
- Best for finding relevant files or modules
- Simpler and faster to create/update
SymbolIndex (Experimental)
The SymbolIndex
operates at the symbol level:
- Indexes individual functions, classes, and methods
- Better for finding specific code elements
- More granular search results
How It Works
The semantic indices:
- Process code at either file or symbol level
- Split large content into chunks that fit within token limits
- Use OpenAI’s text-embedding-3-small model to create embeddings
- Store embeddings efficiently for similarity search
- Support incremental updates when code changes
When searching:
- Your query is converted to an embedding
- Cosine similarity is computed with all stored embeddings
- The most similar items are returned with their scores
Creating embeddings requires an OpenAI API key with access to the embeddings endpoint.
Example Searches
Here are some example semantic searches:
The semantic search can understand concepts and return relevant results even when the exact terms aren’t present in the code.
Was this page helpful?