Integrate metadata documentation and jhalfs manifests

2025-10-01 06:58:04 +02:00 · 2025-10-01 06:58:04 +02:00 · 3ce470e019
commit 3ce470e019
parent 74bf8a32d6
34 changed files with 5544 additions and 240 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@ -0,0 +1,117 @@
+# Architecture Overview
+
+This project is split into a reusable Rust library crate (`package_management`)
+and several binaries that orchestrate day-to-day workflows. The sections below
+outline the main entry points and how the supporting modules fit together.
+
+## CLI entry points
+
+| Binary | Location | Purpose |
+| ------ | -------- | ------- |
+| `lpkg` | `src/main.rs` | Primary command-line interface with workflow automation and optional TUI integration. |
+| `metadata_indexer` | `src/bin/metadata_indexer.rs` | Harvests LFS/BLFS/GLFS package metadata, validates it against the JSON schema, and keeps `ai/metadata/index.json` up to date. |
+
+### `lpkg` workflows
+
+`lpkg` uses [Clap](https://docs.rs/clap) to expose multiple subcommands:
+
+- `EnvCheck` – fetches `<pre>` blocks from an LFS-style HTML page and runs the
+  embedded `ver_check` / `ver_kernel` scripts.
+- `FetchManifests` – downloads the book’s canonical `wget-list` and `md5sums`
+  files and writes them to disk.
+- `BuildBinutils` – parses the Binutils Pass 1 page, mirrors the documented
+  build steps, and executes them in a Tokio runtime.
+- `ScaffoldPackage` – generates a new module under `src/pkgs/by_name/` with
+  optimisation defaults (LTO/PGO/`-O3`) and persists metadata via the DB
+  helpers.
+- `ImportMlfs` – walks the MLFS catalogue, scaffolding definitions and storing
+  them in the database (with optional `--dry-run`, `--limit`, and `--overwrite`).
+
+When compiled with the `tui` feature flag, the CLI also exposes
+`lpkg tui disk-manager`, which drops the user into the terminal UI defined in
+`src/tui/`.
+
+### `metadata_indexer`
+
+The `metadata_indexer` binary is a companion tool for maintaining the JSON
+artifacts under `ai/metadata/`:
+
+- `validate` – validates every `packages/**.json` file against
+  `ai/metadata/schema.json` and reports schema or summary extraction issues.
+- `index` – revalidates the metadata and regenerates
+  `ai/metadata/index.json` (use `--compact` for single-line JSON).
+- `harvest` – fetches a given book page, extracts build metadata, and emits a
+  schema-compliant JSON skeleton. When direct HTML parsing does not locate the
+  source tarball, it falls back to the jhalfs `wget-list` data to populate
+  `source.urls`.
+
+## Module layout
+
+```
+src/
+  ai/             // JSON loaders for repository personas, tasks, and bugs
+  db/             // Diesel database setup and models
+  html.rs         // Lightweight HTML helpers (fetch + parse <pre> blocks)
+  ingest/         // Parsers for LFS / MLFS / BLFS / GLFS book content
+  md5_utils.rs    // Fetches canonical md5sums from the book mirror
+  mirrors.rs      // Lists official source mirrors for downloads
+  pkgs/           // Package scaffolding and metadata definition helpers
+  tui/            // Optional terminal UI (crossterm + tui)
+  version_check.rs// Executes ver_check / ver_kernel snippets
+  wget_list.rs    // Fetches jhalfs-maintained wget-list manifests
+  bin/metadata_indexer.rs // AI metadata CLI described above
+```
+
+### Notable modules
+
+- **`src/pkgs/scaffolder.rs`**
+  - Generates filesystem modules and `PackageDefinition` records based on a
+    `ScaffoldRequest`.
+  - Normalises directory layout (prefix modules, `mod.rs` entries) and applies
+    optimisation defaults (LTO, PGO, `-O3`).
+
+- **`src/ingest/`**
+  - Provides HTML parsers tailored to each book flavour (LFS, MLFS, BLFS,
+    GLFS). The parsers emit `BookPackage` records consumed by the scaffolder
+    and metadata importer.
+
+- **`src/db/`**
+  - Diesel models and schema for persisting package metadata. `lpkg` uses these
+    helpers when scaffolding or importing packages.
+
+- **`src/tui/`**
+  - Houses the optional terminal interface (disk manager, main menu, settings,
+    downloader). The entry points are conditionally compiled behind the `tui`
+    cargo feature.
+
+## Data & metadata assets
+
+The repository keeps long-lived ARTifacts under `ai/`:
+
+- `ai/metadata/` – JSON schema (`schema.json`), package records, and a generated
+  index (`index.json`). The `metadata_indexer` binary maintains these files.
+- `ai/personas.json`, `ai/tasks.json`, `ai/bugs.json` – contextual data for
+  automated assistance.
+- `ai/notes.md` – scratchpad for future work (e.g., jhalfs integration).
+
+`data/` currently contains catalogues derived from the MLFS book and can be
+extended with additional book snapshots.
+
+## Database and persistence
+
+The Diesel setup uses SQLite (via the `diesel` crate with `sqlite` and `r2d2`
+features enabled). Connection pooling lives in `src/db/mod.rs` and is consumed
+by workflows that scaffold or import packages.
+
+## Optional terminal UI
+
+The TUI resolves around `DiskManager` (a crossterm + tui based interface for
+GPT partition inspection and creation). Additional stubs (`main_menu.rs`,
+`settings.rs`, `downloader.rs`) are present for future expansion. The main CLI
+falls back to `DiskManager::run_tui()` whenever `lpkg` is invoked without a
+subcommand and is compiled with `--features tui`.
+
+---
+
+For more operational details around metadata harvesting, refer to
+[`docs/METADATA_PIPELINE.md`](./METADATA_PIPELINE.md).
--- a/docs/METADATA_PIPELINE.md
+++ b/docs/METADATA_PIPELINE.md
@ -0,0 +1,83 @@
+# Metadata Harvesting Pipeline
+
+This repository tracks AI-friendly package metadata under `ai/metadata/`.
+The `metadata_indexer` binary orchestrates validation and harvesting tasks.
+This document explains the workflow and the supporting assets.
+
+## Directory layout
+
+- `ai/metadata/schema.json` – JSON Schema (Draft 2020-12) describing one
+  package record.
+- `ai/metadata/packages/<book>/<slug>.json` – harvested package metadata.
+- `ai/metadata/index.json` – generated summary table linking package IDs to
+  their JSON files.
+- `ai/notes.md` – scratchpad for future improvements (e.g., jhalfs integration).
+
+## `metadata_indexer` commands
+
+| Command | Description |
+| ------- | ----------- |
+| `validate` | Loads every package JSON file and validates it against `schema.json`. Reports schema violations and summary extraction errors. |
+| `index` | Re-runs validation and regenerates `index.json`. Use `--compact` to write a single-line JSON payload. |
+| `harvest` | Fetches a book page, scrapes build instructions, and emits a draft metadata record (to stdout with `--dry-run` or into `ai/metadata/packages/`). |
+
+### Harvesting flow
+
+1. **Fetch HTML** – the requested page is downloaded with `reqwest` and parsed
+   using `scraper` selectors.
+2. **Heading metadata** – the `h1.sect1` title provides the chapter/section,
+   canonical package name, version, and optional variant hints.
+3. **Build steps** – `<pre class="userinput">` blocks become ordered `build`
+   phases (`setup`, `configure`, `build`, `test`, `install`).
+4. **Artifact stats** – `div.segmentedlist` entries supply SBU and disk usage.
+5. **Source URLs** – the harvester tries two strategies:
+   - Inline HTML links inside the page (common for BLFS articles).
+   - Fallback to the jhalfs `wget-list` for the selected book (currently MLFS)
+     using `package-management::wget_list::get_wget_list` to find matching
+     `<package>-<version>` entries.
+6. **Checksums** – integration with the book’s `md5sums` mirror is pending;
+   placeholder wiring exists (`src/md5_utils.rs`).
+7. **Status** – unresolved items (missing URLs, anchors, etc.) are recorded in
+   `status.issues` so humans can interrogate or patch the draft before
+   promoting it.
+
+### Known gaps
+
+- **Source links via tables** – some MLFS chapters list download links inside a
+  “Package Information” table. The current implementation relies on the
+  jhalfs `wget-list` fallback instead of parsing that table.
+- **Checksums** – MD5 lookups from jhalfs are planned but not yet wired into
+  the harvest pipeline.
+- **Anchor discovery** – if the heading lacks an explicit `id` attribute, the
+  scraper attempts to locate child anchors or scan the raw HTML. If none are
+  found, a warning is recorded and `status.issues` contains a reminder.
+
+## Using jhalfs manifests
+
+The maintained `wget-list`/`md5sums` files hosted by jhalfs provide canonical
+source URLs and hashes. The helper modules `src/wget_list.rs` and
+`src/md5_utils.rs` download these lists for the multilib LFS book. The
+harvester currently consumes the wget-list as a fallback; integrating the
+`md5sums` file will let us emit `source.checksums` automatically.
+
+Planned enhancements (see `ai/notes.md` and `ai/bugs.json#metadata-harvest-no-source-urls`):
+
+1. Abstract list fetching so BLFS/GLFS variants can reuse the logic.
+2. Normalise the match criteria for package + version (handling pass stages,
+   suffixes, etc.).
+3. Populate checksum entries alongside URLs.
+
+## Manual review checklist
+
+When a new metadata file is generated:
+
+- `schema_version` should match `schema.json` (currently `v0.1.0`).
+- `package.id` should be unique (format `<book>/<slug>`).
+- `source.urls` must include at least one primary URL; add mirrors/patches as
+  needed.
+- Clear any `status.issues` before promoting the record from `draft`.
+- Run `cargo run --bin metadata_indexer -- --base-dir . index` to regenerate
+  the global index once the draft is finalised.
+
+Refer to `README.md` for usage examples and to `docs/ARCHITECTURE.md` for a
+broader overview of the crate layout.