Preprocessing Different File Types

The Haystack elves live in the forest. Every year, after winter, Elf Bilge writes a detailed report on their winter preparations, food collection, memorable moments, and the lessons learned. Other curious elves seek her guidance yearly, asking questions like “Which foods should we collect?” or “What should we do against water scarcity?” 🌲

This year, Elf Bilge has this idea: make a generative system that replaces her so elves can shoot questions and get elf-style answers. As she plays with LLMs, she realizes these winter reports are too big to just throw at LLMs. Also, not every part of the report usually fits with questions. Being a Haystack elf, she knows how to solve this issue: PREPROCESSING! 💡

So, she comes up with a plan. Elf Bilge will convert all report files into Haystack Documents, break them into smaller bits, create semantic doodads ( embeddings), and toss them into a document store. That way, she can later use these docs in her RAG pipeline for their generative system. 🌟

For this challenge, you must help Elf Bilge create a pipeline to preprocess documents and index them to the document store with their embeddings.

🎯 Requirements:

Each split should have 200 words, and the overlap size should be 50 words.

Use all winter reports (winter_report_one.txt, winter_report_two.pdf, winter_report_three.md)

🧡 Some Hints:

Use FileTypeRouter to route files to the correct converters

Use DocumentJoiner to join documents from multiple converters into one list of documents.

You have seen how to connect components in Day 1.

💚 Here is the Starter Colab

Advent of Haystack