lpkg/docs/METADATA_PIPELINE.md

80 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Metadata Harvesting Pipeline
This repository tracks AI-friendly package metadata under `ai/metadata/`.
The `metadata_indexer` binary orchestrates validation and harvesting tasks.
This document explains the workflow and the supporting assets.
## Directory layout
- `ai/metadata/schema.json` JSON Schema (Draft 2020-12) describing one
package record.
- `ai/metadata/packages/<book>/<slug>.json` harvested package metadata.
- `ai/metadata/index.json` generated summary table linking package IDs to
their JSON files.
- `ai/notes.md` scratchpad for future improvements (e.g., jhalfs integration).
## `metadata_indexer` commands
| Command | Description |
| ------- | ----------- |
| `validate` | Loads every package JSON file and validates it against `schema.json`. Reports schema violations and summary extraction errors. |
| `index` | Re-runs validation and regenerates `index.json`. Use `--compact` to write a single-line JSON payload. |
| `harvest` | Fetches a book page, scrapes build instructions, and emits a draft metadata record (to stdout with `--dry-run` or into `ai/metadata/packages/`). Falls back to jhalfs manifests when inline source links are absent. |
| `refresh` | Updates cached jhalfs manifests (`wget-list`, `md5sums`) under `ai/metadata/cache/`. Supports `--books` filtering and `--force` to bypass the cache. |
| `generate` | Translates harvested metadata into Rust modules under `src/pkgs/by_name` (or a specified directory), using the scaffolder to create `PackageDefinition` wrappers. |
### Harvesting flow
1. **Fetch HTML** the requested page is downloaded with `reqwest` and parsed
using `scraper` selectors.
2. **Heading metadata** the `h1.sect1` title provides the chapter/section,
canonical package name, version, and optional variant hints.
3. **Build steps** `<pre class="userinput">` blocks become ordered `build`
phases (`setup`, `configure`, `build`, `test`, `install`).
4. **Artifact stats** `div.segmentedlist` entries supply SBU and disk usage.
5. **Source URLs** the harvester tries two strategies:
- Inline HTML links inside the page (common for BLFS articles).
- Fallback to the cached jhalfs `wget-list` for the selected book to find
matching `<package>-<version>` entries.
6. **Checksums** the matching entry from the cached jhalfs `md5sums`
manifest populates `source.checksums` when the archive name is known.
7. **Status** unresolved items (missing URLs, anchors, etc.) are recorded in
`status.issues` so humans can interrogate or patch the draft before
promoting it.
### Known gaps
- **Source links via tables** some MLFS chapters list download links inside a
“Package Information” table. The current implementation relies on the
jhalfs `wget-list` fallback instead of parsing that table.
- **Anchor discovery** if the heading lacks an explicit `id` attribute, the
scraper attempts to locate child anchors or scan the raw HTML. If none are
found, a warning is recorded and `status.issues` contains a reminder.
## Using jhalfs manifests
The maintained `wget-list`/`md5sums` files hosted by jhalfs provide canonical
source URLs and hashes. The `metadata_indexer refresh` command keeps these
manifests cached under `ai/metadata/cache/`. Harvesting consumes the cached
copies to populate URLs and MD5 checksums.
Planned enhancements (see `ai/notes.md` and `ai/bugs.json#metadata-harvest-no-source-urls`):
1. Abstract list fetching so BLFS/GLFS variants can reuse the logic.
2. Normalise the match criteria for package + version (handling pass stages,
suffixes, etc.).
## Manual review checklist
When a new metadata file is generated:
- `schema_version` should match `schema.json` (currently `v0.1.0`).
- `package.id` should be unique (format `<book>/<slug>`).
- `source.urls` must include at least one primary URL; add mirrors/patches as
needed.
- Clear any `status.issues` before promoting the record from `draft`.
- Run `cargo run --bin metadata_indexer -- --base-dir . index` to regenerate
the global index once the draft is finalised.
Refer to `README.md` for usage examples and to `docs/ARCHITECTURE.md` for a
broader overview of the crate layout.