LogoLogo
Sign inVisit bito.aiVideo Library
  • 👋Welcome to Bito
  • 🆕Getting started
  • 🛡️Privacy and security
  • 🤖AI Code Review Agent
    • Overview
    • Key features
    • Supported programming languages and tools
    • Install/run using Bito Cloud
      • Guide for GitHub
      • Guide for GitHub (Self-Managed)
      • Guide for GitLab
      • Guide for GitLab (Self-Managed)
      • Guide for Bitbucket
      • Integrate the AI Code Review Agent into the CI/CD pipeline
      • Create or customize an Agent instance
      • Clone an Agent instance
      • Delete unused Agent instances
    • Install/run as a self-hosted service
      • Prerequisites
      • CLI vs webhooks service
      • Install/run via CLI
      • Install/run via webhooks service
      • Install/run via GitHub Actions
      • Agent Configuration: bito-cra.properties File
    • Available commands
    • Chat with AI Code Review Agent
    • Implementing custom code review rules
    • Excluding files, folders, or branches with filters
    • Code review analytics
    • FAQs
  • Other Bito AI tools
    • IDE extension
      • Quick Overview
      • Installation guide
        • Installing on Visual Studio Code
        • Installing on JetBrain IDEs
        • Vim/Neovim Plugin
      • Upgrading Bito plugin
      • AI Chat in Bito
        • Keyboard shortcuts
        • Chat session history
        • Share chat session
        • Appearance settings
        • Open Bito in a new tab or window
        • Use cases and examples
      • Templates
        • Standard templates
        • Custom prompt templates
        • Diff view
      • AI that Understands Your Code
        • Overview
        • How it Works?
        • Available Keywords
        • Example Questions
        • How does Bito Understand My Code?
        • Using in Visual Studio Code
        • Using in JetBrains IDEs
        • Managing Index Size
        • FAQs
      • AI Code Completions
        • Overview
        • How it works?
        • Enable/disable settings
        • Accept/reject suggestions
        • Keyboard shortcuts
        • Supported programming languages
        • Use cases and examples
      • Basic/Advanced AI models
      • Wingman Coding Agent
        • Key features
        • Supported tools
        • Getting started
    • Bito CLI
      • Overview
      • Install or uninstall
      • Configuration
      • How to use?
      • Available commands
      • FAQs
    • Google Chrome extension
  • Help
    • 🧠Bito's AI stack
      • Embeddings
      • Vector databases
      • Indexing
      • Generative AI
      • Large Language Models (LLM)
      • LLM tokens
      • LLM parameters
      • Retrieval Augmented Generation (RAG)
      • Prompts
      • Prompt engineering
    • 👥Account and settings
      • Creating a Bito account
      • Workspace
      • Managing workspace members
      • Setting AI output language
      • Managing user access levels
      • Access key
    • 💳Billing and plans
      • Overview
      • Payment methods
      • Managing workspace plan
      • Pay for additional workspace members
      • Advanced AI requests usage
      • Billing history
      • Billing details
      • Security
      • Refund policy
      • Discounts
    • ⚒️Support and questions
      • Getting support
      • Troubleshooting
      • Is it GPT-4?
  • 🆕Changelog
  • External Links
    • Git
    • Github Issues
    • Github Discussions
    • Bito.ai
    • VS Code Marketplace
    • JetBrain Marketplace
Powered by GitBook
LogoLogo

Bito Inc. (c) 2025

On this page
  • What is Embedding?
  • Why Embeddings?
  • Let’s look at an example
  • How Bito Uses Embeddings
  • Models for Generating Embeddings
  • Embeddings: More Than Just Numbers

Was this helpful?

Export as PDF
  1. Help
  2. Bito's AI stack

Embeddings

PreviousBito's AI stackNextVector databases

Last updated 8 months ago

Was this helpful?

Bito leverages the power of embeddings to . But WTF are these embeddings, and how do they help Bito understand your code?

If you are curious to know, this guide is for you!

What is Embedding?

Embeddings, at their essence, are like magic translators. They convert data—whether words, images, or, in Bito's case, code—into vectors in a dense numerical space. These vectors encapsulate meaning or semantics. Basically, these vectors help computers understand and work with data more efficiently.

Imagine an embedding as a vector (list) of floating-point numbers. If two vectors are close, they're similar. If they're far apart, they're different. Simple as that!

A vector embedding looks something like this: [0.02362240, -0.01716885, 0.00493248, ..., 0.01665339]

Why Embeddings?

In this section, we'll explore the most common and impactful ways embeddings are used in everyday tech and applications.

Word Similarity & Semantics: Word embeddings, like Word2Vec, map words to vectors such that semantically similar words are closer in the vector space. This allows algorithms to discern synonyms, antonyms, and more based on their vector representations.

Sentiment Analysis: By converting text into embeddings, machine learning models can be trained to detect and classify the sentiment of a text, such as determining if a product review is positive or negative.

Recommendation Systems: Embeddings can represent items (like movies, books, or products) and users. By comparing these embeddings, recommendation systems can suggest items similar to a user's preferences. For example, by converting audio or video data into embeddings, systems can recommend content based on similarity in the embedded space, leading to personalized user recommendations.

Document Clustering & Categorization: Text documents can be turned into embeddings using models like Doc2Vec. These embeddings can then be used to cluster or categorize documents based on their content.

Translation & Language Models: Models like BERT and GPT use embeddings to understand the context within sentences. This contextual understanding aids in tasks like translation and text generation.

Image Recognition: Images can be converted into embeddings using convolutional neural networks (CNNs). These embeddings can then be used to recognize and classify objects within the images.

Anomaly Detection: By converting data points into embeddings, algorithms can identify outliers or anomalies by measuring the distance between data points in the embedded space.

Chatbots & Virtual Assistants: Conversational models turn user inputs into embeddings to understand intent and context, enabling more natural and relevant responses.

Search Engines: Text queries can be converted into embeddings, which are then used to find relevant documents or information in a database by comparing embeddings.

Let’s look at an example

Suppose you have two functions in your codebase:

Function # 1:

def add(x, y):
    return x + y

Function # 2:

def subtract(x, y):
    return x - y

Using embeddings, Bito might convert these functions into two vectors. Because these functions perform different operations, their embeddings would be at a certain distance apart. Now, if you had another function that also performed addition but with a slight variation, its embedding would be closer to the add function than the subtract function.

Let's oversimplify and imagine these embeddings visually:

Embedding for Function # 1 (add):

[0.9, 0.2, 0.1]

Embedding for Function # 2 (subtract):

[0.2, 0.9, 0.1]

Notice the numbers? The first positions in these lists are quite different: 0.9 for addition and 0.2 for subtraction. This difference signifies the varied operations these functions perform.

Now, let's add a twist. Suppose you wrote another addition function, but with an extra print statement:

Function # 3:

def add_and_print(x, y):
    result = x + y
    print(result)
    return result

Bito might give an embedding like:

[0.85, 0.3, 0.15]

If you compare, this new list is more similar to the add function's list than the subtract one, especially in the first position. But it's not exactly the same as the pure add function because of the added print operation.

This distance or difference between lists is what Bito uses to determine how similar functions or chunks of code are to one another. So, when you ask Bito about a piece of code, it quickly checks these number lists, finds the closest match, and guides you accordingly!

How Bito Uses Embeddings

When you ask Bito a question or seek assistance with a certain piece of code, Bito doesn't read the code the way we do. Instead, it refers to these vector representations (embeddings). By doing so, it can quickly find related pieces of code in your repository or understand the essence of your query.

For example, if you ask Bito, "Where did I implement addition logic?", Bito will convert your question into an embedding and then look for the most related (or closest) embeddings in its index. Since it already knows the add function's embedding represents addition, it can swiftly point you to that function.

Models for Generating Embeddings

When we talk about turning data into these nifty lists of numbers (embeddings), several models and techniques come into play. These models have been designed to extract meaningful patterns from vast amounts of data and represent them as compact vectors. Here are some of the standout models:

Word2Vec: One of the pioneers in the world of embeddings, this model, developed by researchers at Google, primarily focuses on words. Given a large amount of text, Word2Vec can produce a vector for each word, capturing its context and meaning.

Doc2Vec: An extension of Word2Vec, this model is designed to represent entire documents or paragraphs as vectors, making it suitable for larger chunks of text.

GloVe (Global Vectors for Word Representation): Developed by Stanford, GloVe is another method to generate word embeddings. It stands out because it combines both global statistical information and local semantic details from a text.

BERT (Bidirectional Encoder Representations from Transformers): A more recent and advanced model from Google, BERT captures context from both left and right (hence, bidirectional) of a word in all layers. This deep understanding allows for more accurate embeddings, especially in complex linguistic scenarios.

FastText: Developed by Facebook’s AI Research lab, FastText enhances Word2Vec by considering sub-words. This means it can generate embeddings even for misspelled words or words not seen during training by breaking them into smaller chunks.

ELMo (Embeddings from Language Models): This model dynamically generates embeddings based on the context in which words appear, allowing for richer representations.

Universal Sentence Encoder: This model, developed by Google, is designed to embed entire sentences, making it especially useful for tasks that deal with larger text chunks or require understanding the nuances of entire sentences.

GPT (Generative Pre-trained Transformer): Developed by OpenAI, GPT is a series of models (from GPT-1 to GPT-4o) that use the Transformer architecture to generate text. While GPT models are famous for generating text, they can also produce vector embeddings. Their latest embeddings model is text-embedding-ada-002 which can generate embeddings for text search, code search, sentence similarity, and text classification tasks.

These models, among many others, power a wide range of applications, from natural language processing tasks like sentiment analysis and machine translation to aiding assistants like Bito in understanding and processing code or any other form of data.

Embeddings: More Than Just Numbers

While embeddings might seem like just another technical term or a mere list of numbers, they are crucial bridges that connect human logic and machine understanding. The ability to convert complex data, be it code, images, or even human language, into such vectors, and then use the 'distance' between these vectors to find relatedness, is nothing short of magic.

In the context of Bito, embeddings aren't just a feature—it's the core that powers its deep understanding of your code, making it an indispensable tool for developers. So, the next time you think of Bito's answers as magical, remember, it's the power of embeddings at work!

Bito uses text-embedding-ada-002 from OpenAI and we’re also trying out some open-source embedding models for our feature.

🧠
AI that Understands Your Code
understand your entire codebase