20 lines
1.9 KiB
Markdown
20 lines
1.9 KiB
Markdown
# ERLM - Edge Recursive Language Model
|
|
|
|
This program is an AI assistant designed to extract information from a massive, unseen text corpus (`RAW_CORPUS`) without directly accessing the text itself. It operates within a persistent Python REPL environment, constrained by a limited context window and a need to avoid overwhelming output.
|
|
|
|
Here's a breakdown of its core functionality:
|
|
|
|
1. **Blind Data Exploration:** The AI cannot directly view the `RAW_CORPUS`. Instead, it must infer its structure (structured, semi-structured, or unstructured) through Python code execution.
|
|
|
|
2. **Data Engineering Approach:** The program follows a structured workflow:
|
|
* **Shape Discovery:** It first analyzes the `RAW_CORPUS` to determine its format (JSON, CSV, XML, log lines, etc.).
|
|
* **Access Layer Creation:** It then builds a system of persistent Python variables (lists, dictionaries, etc.) to efficiently access and manipulate the data. This involves splitting the text into manageable chunks, parsing log lines, or extracting relevant sections.
|
|
* **Dense Execution:** Finally, it uses the created access layer to perform targeted searches and extractions, avoiding redundant scanning of the entire corpus.
|
|
|
|
3. **Limited Output:** To manage the context window, the AI is restricted to printing small snippets of output (less than 1000 characters) and is encouraged to summarize findings in Python variables.
|
|
|
|
4. **Iterative Process:** The program operates in a series of steps, each designed to build upon the previous one. It prioritizes creating reusable tools and verifying results at each stage.
|
|
|
|
5. **JSON Output:** All outputs are formatted as JSON, ensuring consistent and parsable data.
|
|
|
|
In essence, this is a sophisticated data extraction and analysis tool that mimics the process of a human data scientist, carefully exploring and structuring a large dataset before performing targeted queries. |