3 Commits

Author SHA1 Message Date
CC Worker
cdc105ae54 feat(seed): expand corpus to 1178 papers + download-only/unseed/granular reset
PRIMARY — corpus breadth (505->1178 papers, 18->60 specs, all URLs HEAD-verified):
- AQA (enumerated): Maths, English Lang/Lit, Geography, Computer Science, Business,
  Psychology, MFL (French/Spanish/German), GCSE + A-level, on top of round-1 sciences.
- Edexcel + OCR (confirmed direct URLs via research): Maths, English, Geography, History,
  Business, Computer Science, GCSE + A-level.
- generate_corpus_manifest.py: _subj/_mfl AQA builders, Edexcel/OCR spec+URL tables,
  derived exam_code (_mk_exam_code) matching the locked convention, concurrent re-verify.
Verified on dev .94: eb_specifications=60, eb_exams=1178, QP=469, doc_type all 'pdf',
seed idempotent (uploaded=673 new, skipped=505), failed=0.

SECONDARY:
- --download-only + persistent bucket-shaped local store (manifests/_corpus_store/, gitignored):
  download-once, seed-many, offline-repeatable; --store-dir/--no-store. (_store_path/_item_bytes/
  download_corpus). Verified: store populated, seed reads offline (download_cached).
- --unseed [--board/--spec]: inverse loader — storage objects (Storage API; protect_delete blocks
  raw SQL), first-sweep seed templates, eb_exams, eb_specifications. Verified reversible on .94.
- Granular admin reset: POST /admin/reset?scope=all|exam-corpus|timetable. reset_environment.reset(scope)
  adds EXAM_CORPUS_TABLES (10) + cc.examboards storage cleanup + TIMETABLE_TABLES (13); 'all' now also
  clears the exam subsystem the legacy reset missed. No schema migration required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 23:33:20 +00:00
CC Worker
5750413f43 feat(seed): implement exam-corpus loader + filled 505-paper manifest
Implements the seed_exam_corpus.py skeleton TODOs against the real APIs and
fills the public exam corpus from official board sources.

Loader (run/initialization/seed_exam_corpus.py):
- _resolve_source_bytes: local path | url: fetch with on-disk cache + PDF validation
- upload_file: real StorageAdmin.upload_file, skip-if-exists+sha256 unless --force
- upsert_specification/upsert_paper: real upserts on spec_code/exam_code.
  Fix: QP/MS/INSERT/ER role -> eb_exams.type_code; doc_type set to 'pdf'
  (doc_type is CHECK-constrained to file formats; the skeleton wrote the role there).
- copy_user_test_subset: copy a QP subset into a test user's cc.users exam space + files rows
- first_sweep: auto_map + the /auto-map row mapper over seeded QPs -> system-owned
  exam_templates + questions/response_areas/boundaries/layout (idempotent)
- identity discovery via institute_memberships.profile_id

Manifest (run/initialization/manifests/):
- exam-corpus.yaml: 505 papers / 18 specs / AQA+Edexcel+OCR, every source URL HEAD-verified.
  AQA sciences GCSE 8461/8462/8463/8464 + AS/A-level 7401-7408, sessions JUN18-JUN24, QP+MS+ER, F+H.
- generate_corpus_manifest.py: regenerates + re-verifies all URLs from official hosts.

seed_curriculum.py: deprecation banner -> superseded by seed_exam_corpus.py; storage_loc
standardised on cc.examboards.

Verified on dev .94: full 505-paper seed (eb_specifications=18, eb_exams=505, QP=211),
idempotent re-runs, first-sweep + user-subset, 6/6 buckets provisioned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 22:58:03 +00:00
CC Worker
9aabc12062 feat(seed): provision taxonomy buckets (infra) + exam-corpus loader skeleton
infra (buckets.py): add cc.public / cc.institutes / cc.admin to the bucket
provisioner alongside cc.examboards; make initialize_buckets idempotent
(already-exists treated as success). Bucket provisioning stays in infra init.

new (seed_exam_corpus.py): manifest-driven loader scaffold that USES the buckets
(does not create them) — validate -> upload to cc.examboards (canonical path) ->
upsert eb_specifications/eb_exams -> optional user test subset -> optional
--first-sweep auto-map pass. TODOs marked for the gathering task to complete.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 22:22:48 +00:00