Integrate metadata documentation and jhalfs manifests
This commit is contained in:
parent
74bf8a32d6
commit
3ce470e019
34 changed files with 5544 additions and 240 deletions
117
docs/ARCHITECTURE.md
Normal file
117
docs/ARCHITECTURE.md
Normal file
|
|
@ -0,0 +1,117 @@
|
|||
# Architecture Overview
|
||||
|
||||
This project is split into a reusable Rust library crate (`package_management`)
|
||||
and several binaries that orchestrate day-to-day workflows. The sections below
|
||||
outline the main entry points and how the supporting modules fit together.
|
||||
|
||||
## CLI entry points
|
||||
|
||||
| Binary | Location | Purpose |
|
||||
| ------ | -------- | ------- |
|
||||
| `lpkg` | `src/main.rs` | Primary command-line interface with workflow automation and optional TUI integration. |
|
||||
| `metadata_indexer` | `src/bin/metadata_indexer.rs` | Harvests LFS/BLFS/GLFS package metadata, validates it against the JSON schema, and keeps `ai/metadata/index.json` up to date. |
|
||||
|
||||
### `lpkg` workflows
|
||||
|
||||
`lpkg` uses [Clap](https://docs.rs/clap) to expose multiple subcommands:
|
||||
|
||||
- `EnvCheck` – fetches `<pre>` blocks from an LFS-style HTML page and runs the
|
||||
embedded `ver_check` / `ver_kernel` scripts.
|
||||
- `FetchManifests` – downloads the book’s canonical `wget-list` and `md5sums`
|
||||
files and writes them to disk.
|
||||
- `BuildBinutils` – parses the Binutils Pass 1 page, mirrors the documented
|
||||
build steps, and executes them in a Tokio runtime.
|
||||
- `ScaffoldPackage` – generates a new module under `src/pkgs/by_name/` with
|
||||
optimisation defaults (LTO/PGO/`-O3`) and persists metadata via the DB
|
||||
helpers.
|
||||
- `ImportMlfs` – walks the MLFS catalogue, scaffolding definitions and storing
|
||||
them in the database (with optional `--dry-run`, `--limit`, and `--overwrite`).
|
||||
|
||||
When compiled with the `tui` feature flag, the CLI also exposes
|
||||
`lpkg tui disk-manager`, which drops the user into the terminal UI defined in
|
||||
`src/tui/`.
|
||||
|
||||
### `metadata_indexer`
|
||||
|
||||
The `metadata_indexer` binary is a companion tool for maintaining the JSON
|
||||
artifacts under `ai/metadata/`:
|
||||
|
||||
- `validate` – validates every `packages/**.json` file against
|
||||
`ai/metadata/schema.json` and reports schema or summary extraction issues.
|
||||
- `index` – revalidates the metadata and regenerates
|
||||
`ai/metadata/index.json` (use `--compact` for single-line JSON).
|
||||
- `harvest` – fetches a given book page, extracts build metadata, and emits a
|
||||
schema-compliant JSON skeleton. When direct HTML parsing does not locate the
|
||||
source tarball, it falls back to the jhalfs `wget-list` data to populate
|
||||
`source.urls`.
|
||||
|
||||
## Module layout
|
||||
|
||||
```
|
||||
src/
|
||||
ai/ // JSON loaders for repository personas, tasks, and bugs
|
||||
db/ // Diesel database setup and models
|
||||
html.rs // Lightweight HTML helpers (fetch + parse <pre> blocks)
|
||||
ingest/ // Parsers for LFS / MLFS / BLFS / GLFS book content
|
||||
md5_utils.rs // Fetches canonical md5sums from the book mirror
|
||||
mirrors.rs // Lists official source mirrors for downloads
|
||||
pkgs/ // Package scaffolding and metadata definition helpers
|
||||
tui/ // Optional terminal UI (crossterm + tui)
|
||||
version_check.rs// Executes ver_check / ver_kernel snippets
|
||||
wget_list.rs // Fetches jhalfs-maintained wget-list manifests
|
||||
bin/metadata_indexer.rs // AI metadata CLI described above
|
||||
```
|
||||
|
||||
### Notable modules
|
||||
|
||||
- **`src/pkgs/scaffolder.rs`**
|
||||
- Generates filesystem modules and `PackageDefinition` records based on a
|
||||
`ScaffoldRequest`.
|
||||
- Normalises directory layout (prefix modules, `mod.rs` entries) and applies
|
||||
optimisation defaults (LTO, PGO, `-O3`).
|
||||
|
||||
- **`src/ingest/`**
|
||||
- Provides HTML parsers tailored to each book flavour (LFS, MLFS, BLFS,
|
||||
GLFS). The parsers emit `BookPackage` records consumed by the scaffolder
|
||||
and metadata importer.
|
||||
|
||||
- **`src/db/`**
|
||||
- Diesel models and schema for persisting package metadata. `lpkg` uses these
|
||||
helpers when scaffolding or importing packages.
|
||||
|
||||
- **`src/tui/`**
|
||||
- Houses the optional terminal interface (disk manager, main menu, settings,
|
||||
downloader). The entry points are conditionally compiled behind the `tui`
|
||||
cargo feature.
|
||||
|
||||
## Data & metadata assets
|
||||
|
||||
The repository keeps long-lived ARTifacts under `ai/`:
|
||||
|
||||
- `ai/metadata/` – JSON schema (`schema.json`), package records, and a generated
|
||||
index (`index.json`). The `metadata_indexer` binary maintains these files.
|
||||
- `ai/personas.json`, `ai/tasks.json`, `ai/bugs.json` – contextual data for
|
||||
automated assistance.
|
||||
- `ai/notes.md` – scratchpad for future work (e.g., jhalfs integration).
|
||||
|
||||
`data/` currently contains catalogues derived from the MLFS book and can be
|
||||
extended with additional book snapshots.
|
||||
|
||||
## Database and persistence
|
||||
|
||||
The Diesel setup uses SQLite (via the `diesel` crate with `sqlite` and `r2d2`
|
||||
features enabled). Connection pooling lives in `src/db/mod.rs` and is consumed
|
||||
by workflows that scaffold or import packages.
|
||||
|
||||
## Optional terminal UI
|
||||
|
||||
The TUI resolves around `DiskManager` (a crossterm + tui based interface for
|
||||
GPT partition inspection and creation). Additional stubs (`main_menu.rs`,
|
||||
`settings.rs`, `downloader.rs`) are present for future expansion. The main CLI
|
||||
falls back to `DiskManager::run_tui()` whenever `lpkg` is invoked without a
|
||||
subcommand and is compiled with `--features tui`.
|
||||
|
||||
---
|
||||
|
||||
For more operational details around metadata harvesting, refer to
|
||||
[`docs/METADATA_PIPELINE.md`](./METADATA_PIPELINE.md).
|
||||
83
docs/METADATA_PIPELINE.md
Normal file
83
docs/METADATA_PIPELINE.md
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
# Metadata Harvesting Pipeline
|
||||
|
||||
This repository tracks AI-friendly package metadata under `ai/metadata/`.
|
||||
The `metadata_indexer` binary orchestrates validation and harvesting tasks.
|
||||
This document explains the workflow and the supporting assets.
|
||||
|
||||
## Directory layout
|
||||
|
||||
- `ai/metadata/schema.json` – JSON Schema (Draft 2020-12) describing one
|
||||
package record.
|
||||
- `ai/metadata/packages/<book>/<slug>.json` – harvested package metadata.
|
||||
- `ai/metadata/index.json` – generated summary table linking package IDs to
|
||||
their JSON files.
|
||||
- `ai/notes.md` – scratchpad for future improvements (e.g., jhalfs integration).
|
||||
|
||||
## `metadata_indexer` commands
|
||||
|
||||
| Command | Description |
|
||||
| ------- | ----------- |
|
||||
| `validate` | Loads every package JSON file and validates it against `schema.json`. Reports schema violations and summary extraction errors. |
|
||||
| `index` | Re-runs validation and regenerates `index.json`. Use `--compact` to write a single-line JSON payload. |
|
||||
| `harvest` | Fetches a book page, scrapes build instructions, and emits a draft metadata record (to stdout with `--dry-run` or into `ai/metadata/packages/`). |
|
||||
|
||||
### Harvesting flow
|
||||
|
||||
1. **Fetch HTML** – the requested page is downloaded with `reqwest` and parsed
|
||||
using `scraper` selectors.
|
||||
2. **Heading metadata** – the `h1.sect1` title provides the chapter/section,
|
||||
canonical package name, version, and optional variant hints.
|
||||
3. **Build steps** – `<pre class="userinput">` blocks become ordered `build`
|
||||
phases (`setup`, `configure`, `build`, `test`, `install`).
|
||||
4. **Artifact stats** – `div.segmentedlist` entries supply SBU and disk usage.
|
||||
5. **Source URLs** – the harvester tries two strategies:
|
||||
- Inline HTML links inside the page (common for BLFS articles).
|
||||
- Fallback to the jhalfs `wget-list` for the selected book (currently MLFS)
|
||||
using `package-management::wget_list::get_wget_list` to find matching
|
||||
`<package>-<version>` entries.
|
||||
6. **Checksums** – integration with the book’s `md5sums` mirror is pending;
|
||||
placeholder wiring exists (`src/md5_utils.rs`).
|
||||
7. **Status** – unresolved items (missing URLs, anchors, etc.) are recorded in
|
||||
`status.issues` so humans can interrogate or patch the draft before
|
||||
promoting it.
|
||||
|
||||
### Known gaps
|
||||
|
||||
- **Source links via tables** – some MLFS chapters list download links inside a
|
||||
“Package Information” table. The current implementation relies on the
|
||||
jhalfs `wget-list` fallback instead of parsing that table.
|
||||
- **Checksums** – MD5 lookups from jhalfs are planned but not yet wired into
|
||||
the harvest pipeline.
|
||||
- **Anchor discovery** – if the heading lacks an explicit `id` attribute, the
|
||||
scraper attempts to locate child anchors or scan the raw HTML. If none are
|
||||
found, a warning is recorded and `status.issues` contains a reminder.
|
||||
|
||||
## Using jhalfs manifests
|
||||
|
||||
The maintained `wget-list`/`md5sums` files hosted by jhalfs provide canonical
|
||||
source URLs and hashes. The helper modules `src/wget_list.rs` and
|
||||
`src/md5_utils.rs` download these lists for the multilib LFS book. The
|
||||
harvester currently consumes the wget-list as a fallback; integrating the
|
||||
`md5sums` file will let us emit `source.checksums` automatically.
|
||||
|
||||
Planned enhancements (see `ai/notes.md` and `ai/bugs.json#metadata-harvest-no-source-urls`):
|
||||
|
||||
1. Abstract list fetching so BLFS/GLFS variants can reuse the logic.
|
||||
2. Normalise the match criteria for package + version (handling pass stages,
|
||||
suffixes, etc.).
|
||||
3. Populate checksum entries alongside URLs.
|
||||
|
||||
## Manual review checklist
|
||||
|
||||
When a new metadata file is generated:
|
||||
|
||||
- `schema_version` should match `schema.json` (currently `v0.1.0`).
|
||||
- `package.id` should be unique (format `<book>/<slug>`).
|
||||
- `source.urls` must include at least one primary URL; add mirrors/patches as
|
||||
needed.
|
||||
- Clear any `status.issues` before promoting the record from `draft`.
|
||||
- Run `cargo run --bin metadata_indexer -- --base-dir . index` to regenerate
|
||||
the global index once the draft is finalised.
|
||||
|
||||
Refer to `README.md` for usage examples and to `docs/ARCHITECTURE.md` for a
|
||||
broader overview of the crate layout.
|
||||
Loading…
Add table
Add a link
Reference in a new issue