Text Embedding

Created: Apr 03, 2024 by Pradeep Gowda Updated: Apr 11, 2024 Tagged: llm

See also: llm

via ➡️ Text embedding models represent natural language as dense vectors, positioning semantically similar text near each other within the embedding space. Or in simple terms, text embedding models are like translators for computers. They take text and convert it into numbers in a way the computer can understand.

The numerical representations, also known as embeddings, capture semantic information about the words or sentences in the text. By allowing computers to process natural language, these embeddings are used to carry out a wide range of downstream tasks including document retrieval, sentence similarity, classification, and clustering.

An intuitive introduction to text embeddings - Stack Overflow blog

We can frame a lot of useful tasks in terms of text similarity.

Search: How similar is a query to a document in your database?
Spam filtering: How close is an email to examples of spam?
Content moderation: How close is a social media message to known examples of abuse?
Conversational agent: Which examples of known intents are closest to the user’s message?

Effective and fast text embedding methods transform textual input into a numeric form, which allows models such as GPT-4 to process immense volumes of data and show a remarkable level of natural language understanding.

A deep, intuitive understanding of text embeddings can help you follow the advances of these models, letting you effectively incorporate them into your own systems without combing through the technical specs of each new improvement as it emerges.

The Beginner’s Guide to Text Embeddings | deepset

The foundational technology for modern-day NLP is the text embedding, also known as a “vector”: without word and sentence vectors, there would be no cutting-edge translation apps, no automated summaries, and no semantic search engines as we know them. What’s more, embeddings can be used to represent other data types as well, like images and audio files.

Embeddings 101: The foundation of large language models

Gecko

Gecko is an text embedding model from Google, it’s 768-dim vectors being competitive with models that have 7x more parameters and 5x larger embeddings. Lee, Jinhyuk, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, et al. “Gecko: Versatile text embeddings distilled from large language models,” 2024. https://arxiv.org/abs/2403.20327. Results: Gecko is best-in-class for models with BERT-sized embeddings, and due to its support for Matryoshka Learning, it outperforms all such models with just 256 dimensions. Furthermore, when using all 768 dimensions, it competes directly with titans such as E5-Mistral and GritLM, which have 7B parameters (vs Gecko’s 1.2B) and 4096 embedding dimensions. Impressive! An interesting finding that gives credit to the use of synthetic data can be seen in the last row, where the model trained exclusively on LLM-generated data (instead of the typical mixture that includes MTEB datasets) remains quite performant for its size.

NOTE: likely not usable right now since the model weights are not available.