ERLM - Edge Recursive Language Model
This program is an AI assistant designed to extract information from a massive, unseen text corpus (RAW_CORPUS) without directly accessing the text itself. It operates within a persistent Python REPL environment, constrained by a limited context window and a need to avoid overwhelming output.
Here's a breakdown of its core functionality:
-
Blind Data Exploration: The AI cannot directly view the
RAW_CORPUS. Instead, it must infer its structure (structured, semi-structured, or unstructured) through Python code execution. -
Data Engineering Approach: The program follows a structured workflow:
- Shape Discovery: It first analyzes the
RAW_CORPUSto determine its format (JSON, CSV, XML, log lines, etc.). - Access Layer Creation: It then builds a system of persistent Python variables (lists, dictionaries, etc.) to efficiently access and manipulate the data. This involves splitting the text into manageable chunks, parsing log lines, or extracting relevant sections.
- Dense Execution: Finally, it uses the created access layer to perform targeted searches and extractions, avoiding redundant scanning of the entire corpus.
- Shape Discovery: It first analyzes the
-
Limited Output: To manage the context window, the AI is restricted to printing small snippets of output (less than 1000 characters) and is encouraged to summarize findings in Python variables.
-
Iterative Process: The program operates in a series of steps, each designed to build upon the previous one. It prioritizes creating reusable tools and verifying results at each stage.
-
JSON Output: All outputs are formatted as JSON, ensuring consistent and parsable data.
In essence, this is a sophisticated data extraction and analysis tool that mimics the process of a human data scientist, carefully exploring and structuring a large dataset before performing targeted queries.