4 KiB
Metadata Harvesting Pipeline
This repository tracks AI-friendly package metadata under ai/metadata/.
The metadata_indexer binary orchestrates validation and harvesting tasks.
This document explains the workflow and the supporting assets.
Directory layout
ai/metadata/schema.json– JSON Schema (Draft 2020-12) describing one package record.ai/metadata/packages/<book>/<slug>.json– harvested package metadata.ai/metadata/index.json– generated summary table linking package IDs to their JSON files.ai/notes.md– scratchpad for future improvements (e.g., jhalfs integration).
metadata_indexer commands
| Command | Description |
|---|---|
validate |
Loads every package JSON file and validates it against schema.json. Reports schema violations and summary extraction errors. |
index |
Re-runs validation and regenerates index.json. Use --compact to write a single-line JSON payload. |
harvest |
Fetches a book page, scrapes build instructions, and emits a draft metadata record (to stdout with --dry-run or into ai/metadata/packages/). |
Harvesting flow
- Fetch HTML – the requested page is downloaded with
reqwestand parsed usingscraperselectors. - Heading metadata – the
h1.sect1title provides the chapter/section, canonical package name, version, and optional variant hints. - Build steps –
<pre class="userinput">blocks become orderedbuildphases (setup,configure,build,test,install). - Artifact stats –
div.segmentedlistentries supply SBU and disk usage. - Source URLs – the harvester tries two strategies:
- Inline HTML links inside the page (common for BLFS articles).
- Fallback to the jhalfs
wget-listfor the selected book (currently MLFS) usingpackage-management::wget_list::get_wget_listto find matching<package>-<version>entries.
- Checksums – integration with the book’s
md5sumsmirror is pending; placeholder wiring exists (src/md5_utils.rs). - Status – unresolved items (missing URLs, anchors, etc.) are recorded in
status.issuesso humans can interrogate or patch the draft before promoting it.
Known gaps
- Source links via tables – some MLFS chapters list download links inside a
“Package Information” table. The current implementation relies on the
jhalfs
wget-listfallback instead of parsing that table. - Checksums – MD5 lookups from jhalfs are planned but not yet wired into the harvest pipeline.
- Anchor discovery – if the heading lacks an explicit
idattribute, the scraper attempts to locate child anchors or scan the raw HTML. If none are found, a warning is recorded andstatus.issuescontains a reminder.
Using jhalfs manifests
The maintained wget-list/md5sums files hosted by jhalfs provide canonical
source URLs and hashes. The helper modules src/wget_list.rs and
src/md5_utils.rs download these lists for the multilib LFS book. The
harvester currently consumes the wget-list as a fallback; integrating the
md5sums file will let us emit source.checksums automatically.
Planned enhancements (see ai/notes.md and ai/bugs.json#metadata-harvest-no-source-urls):
- Abstract list fetching so BLFS/GLFS variants can reuse the logic.
- Normalise the match criteria for package + version (handling pass stages, suffixes, etc.).
- Populate checksum entries alongside URLs.
Manual review checklist
When a new metadata file is generated:
schema_versionshould matchschema.json(currentlyv0.1.0).package.idshould be unique (format<book>/<slug>).source.urlsmust include at least one primary URL; add mirrors/patches as needed.- Clear any
status.issuesbefore promoting the record fromdraft. - Run
cargo run --bin metadata_indexer -- --base-dir . indexto regenerate the global index once the draft is finalised.
Refer to README.md for usage examples and to docs/ARCHITECTURE.md for a
broader overview of the crate layout.