Mon, Aug 21, 2023
Read in 2 minutes
Tweet from @jerrryliu0. RAG is Retrieval-Augmented-Generation
Compilation for the lecture from Llama-index found on youtube here
Here are 8 key considerations for building production-grade LLM apps over your data (RAG) 💡 (see 🧵):
1️⃣ Chunks used for retrieval shouldn’t necessarily be the same as chunks used for LLM synthesis ( @md_rumpf )
2️⃣ Embeddings should live in a different latent space than what you get from raw text, which can contain filler that biases the embeddings. Consider passing a transformed text representation or finetuning. ( @md_rumpf )
3️⃣ If retrieval isn’t returning the right context, you may need to dynamically load/update the data itself ( @md_rumpf , @bobvanluijt )
4️⃣ Design your pipeline for scalability. Latency times in prototyping do not translate to production workloads. Start with easy-to-use, high latency modules but aggressively try to shrink latency component by component ( @bobvanluijt )
5️⃣ Store data in a hierarchical fashion: summaries of documents and chunks for each document ( @bobvanluijt )
6️⃣ Robust data pipelines in prod don’t matter as much if you only load the data once. But they are especially important when the source data is constantly changing ( @tuanacelik , @md_rumpf )
7️⃣ RAG isn’t just question-answering: for summaries, you may need ALL chunks, for question-answering, you may need specific chunks. Your chunk sizes may differ depending on the use case ( @tuanacelik )
8️⃣ Embedding-based retrieval does not work well for entity lookup. Hybrid search combines benefits of keyword lookup with additional context ( @tuanacelik )