en-US/about_AITriadIngestion.help.txt
|
TOPIC about_AITriadIngestion SHORT DESCRIPTION The document ingestion and summarization pipeline — import, conversion, AI metadata extraction, POV summarization, chunking, conflict detection, and quality control. LONG DESCRIPTION AI Triad processes external documents (PDFs, web pages, Word documents) through a multi-stage pipeline that converts them to Markdown, extracts metadata, generates multi-POV summaries, and detects factual conflicts across the corpus. STAGE 1: DOCUMENT INGESTION (Import-AITriadDocument) The entry point for all documents. Accepts a URL or local file path. Input types and conversion chain: .pdf → ConvertFrom-Pdf (markitdown > pdftotext > mutool) .html → ConvertFrom-Html (pandoc > built-in converter) .docx → ConvertFrom-Docx (markitdown > pandoc > ZIP/XML fallback) .pptx → ConvertFrom-Office (markitdown only) .xlsx → ConvertFrom-Office (markitdown only) URL → Invoke-WebRequest, then ConvertFrom-Html For each document, ingestion: 1. Generates a slug-based doc-id via New-Slug + Resolve-DocId. Example: "The Future of AI Governance" → "future-ai-governance-2026" 2. Creates sources/<doc-id>/ directory. 3. Saves the original file to sources/<doc-id>/raw/. 4. Converts to Markdown and saves as sources/<doc-id>/snapshot.md with a provenance header (via Add-SnapshotHeader). 5. Extracts metadata via AI (Get-AIMetadata) — title, authors, POV tags, topic tags. Falls back to HTML heuristics (Get-HtmlMeta) if AI fails. 6. Creates sources/<doc-id>/metadata.json (via New-Metadata). 7. Optionally submits the URL to the Wayback Machine for archival. 8. Enqueues the doc-id for summarization (via Add-ToSummaryQueue). Key parameters: -Url Source URL to fetch and ingest -File Local file path to ingest -Title Override the AI-extracted title -PovTag Pre-classify POV tags instead of AI extraction -TopicTag Pre-classify topic tags -SkipAI Skip AI metadata extraction (use heuristics only) STAGE 2: POV SUMMARIZATION (Invoke-POVSummary / Invoke-BatchSummary) Generates structured multi-perspective summaries for each document. Single document: Invoke-POVSummary -DocId 'ai-safety-report-2026' Batch (all pending): Invoke-BatchSummary The summarization process: 1. Loads snapshot.md and current taxonomy JSON. 2. Estimates token count (~4 chars/token). 3. If <= 20,000 tokens: single API call with density-scaled prompt. If > 20,000 tokens: chunked pipeline (see CHUNKING below). 4. Parses the JSON response (with truncation repair if needed). 5. Validates output density against per-field minimums. 6. Resolves unmapped_concepts against existing taxonomy nodes. 7. Writes summaries/<doc-id>.json. 8. Updates metadata.json with summary_status='current'. Output schema (summaries/<doc-id>.json): doc_id Document identifier taxonomy_version Taxonomy version used for generation generated_at ISO timestamp ai_model Model used temperature Sampling temperature pov_summaries Per-POV analysis (see below) factual_claims Extracted factual assertions unmapped_concepts Concepts not in the current taxonomy Per-POV summary (pov_summaries.<pov>): key_points[] Array of extracted points, each with: point The key point text taxonomy_node_id Mapped taxonomy node (or null if unmapped) stance strongly_aligned|aligned|neutral|opposed| strongly_opposed|not_applicable evidence Supporting evidence from the document Factual claims (factual_claims[]): claim The factual assertion claim_label Short identifier source_pov Which POV perspective the claim comes from confidence Estimated confidence level temporal_scope current_state|predictive|historical|timeless temporal_bound Specific time reference if applicable CHUNKING (LARGE DOCUMENTS) Documents over ~20,000 estimated tokens are processed in chunks: 1. Split-DocumentChunks divides the Markdown into semantically coherent pieces, preferring heading boundaries (##, ###, ####), falling back to paragraph breaks. Default: 15,000 tokens/chunk, 2,000 token minimum for the last chunk. 2. Each chunk is summarized independently with the same taxonomy context and output schema. 3. Merge-ChunkSummaries combines results: - key_points deduplicated by taxonomy_node_id + first 80 chars - factual_claims deduplicated by claim_label - unmapped_concepts deduplicated by suggested_label 4. Density check runs on the merged result (warn only, no retry). DENSITY SCALING To ensure summaries are proportionally detailed for document length, the system computes minimum output counts based on word count: Field Floor Formula ----------------- -------------------------------- key_points/camp max(3, words / 500) factual_claims max(3, words / 800) unmapped_concepts max(2, words / 2000) If a summary falls below these floors, the system retries once with an explicit nudge prompt identifying the specific shortfalls. Chunked summaries warn but do not retry (individual chunks are already smaller). STAGE 3: CONFLICT DETECTION (Find-Conflict) Compares factual_claims across all summaries to identify contradictions. Find-Conflict reads the factual_claims from each summary and looks for: - Direct contradictions (opposing claims about the same topic) - Temporal conflicts (claims valid at different times) - Perspective-dependent conflicts (same evidence, different conclusions) Conflicts are written to conflicts/*.json with: id Unique conflict identifier claim_a, claim_b The two conflicting claims source_a, source_b Document IDs for each claim conflict_type Type of conflict detected severity Estimated severity attack_type rebut|undercut|undermine (optional, AIF-aligned) target_claim Which claim is being attacked (optional) counter_evidence Evidence against the target (optional) verdict Resolution verdict from debate harvest (optional) STAGE 4: HEALTH AND QUALITY Several cmdlets monitor corpus quality: Get-TaxonomyHealth Analyzes coverage across the taxonomy — which nodes have summaries mapped to them, which are orphaned, where gaps exist. Measure-TaxonomyBaseline Captures a point-in-time snapshot of taxonomy metrics (node counts, edge counts, summary coverage, conflict density) for trend tracking. Test-TaxonomyIntegrity Validates structural consistency: orphaned references, missing files, schema compliance, edge validity. Get-TopicFrequency Clusters nodes by embedding similarity to identify overrepresented and underrepresented topic areas. Get-IngestionPriority Ranks potential documents by how much they would improve taxonomy coverage. UNMAPPED CONCEPT RESOLUTION When the AI maps a key_point to a taxonomy node that does not exist, or identifies a concept not in the taxonomy, it goes into unmapped_concepts. During finalization, Resolve-UnmappedConcepts attempts fuzzy matching against existing nodes using label similarity. Successfully matched concepts are removed from unmapped_concepts; the rest remain for manual review or taxonomy expansion. Repair-UnmappedConcepts can be run independently to re-resolve unmapped concepts across all summaries after taxonomy changes. SNAPSHOT MANAGEMENT Snapshots can go stale when conversion tools improve or when source documents are updated. Update-Snapshot -DocId 'ai-safety-report-2026' Regenerates snapshot.md from the raw/ source using the current best conversion tool. Update-Snapshot -All Regenerates all snapshots. Alias: Redo-Snapshots. ARCHIVAL Save-WaybackUrl submits URLs to the Internet Archive's Wayback Machine. This runs automatically during ingestion (unless -SkipArchive is set) and can be invoked manually: Save-WaybackUrl -Url 'https://example.com/important-paper.pdf' PII AUDIT Invoke-PIIAudit scans snapshots and summaries for personally identifiable information (email addresses, phone numbers, etc.) that should not be stored or published. EXAMPLE: FULL PIPELINE # 1. Ingest a new document Import-AITriadDocument -Url 'https://example.com/ai-report.pdf' ` -PovTag @('safetyist','cross-cutting') # 2. Summarize it Invoke-POVSummary -DocId 'ai-report-2026' # 3. Detect conflicts with existing corpus Find-Conflict # 4. Check overall health Get-TaxonomyHealth # 5. Review in the desktop editor Show-TaxonomyEditor EXAMPLE: BATCH OPERATIONS # Summarize all pending documents Invoke-BatchSummary -Model 'gemini-2.5-flash' -Temperature 0.1 # Re-resolve unmapped concepts after adding taxonomy nodes Repair-UnmappedConcepts # Regenerate all snapshots with improved conversion Update-Snapshot -All SEE ALSO about_AITriad about_AITriadTaxonomy Import-AITriadDocument Invoke-POVSummary Invoke-BatchSummary Find-Conflict Get-TaxonomyHealth Update-Snapshot Save-WaybackUrl Invoke-PIIAudit Repair-UnmappedConcepts |