AITriad

0.4.0

en-US/about_AITriadIngestion.help.txt

                                TOPIC

    about_AITriadIngestion

SHORT DESCRIPTION

    The document ingestion and summarization pipeline — import, conversion,

    AI metadata extraction, POV summarization, chunking, conflict detection,

    and quality control.

LONG DESCRIPTION

    AI Triad processes external documents (PDFs, web pages, Word documents)

    through a multi-stage pipeline that converts them to Markdown, extracts

    metadata, generates multi-POV summaries, and detects factual conflicts

    across the corpus.

  STAGE 1: DOCUMENT INGESTION (Import-AITriadDocument)

    The entry point for all documents.  Accepts a URL or local file path.

    Input types and conversion chain:

      .pdf   → ConvertFrom-Pdf (markitdown > pdftotext > mutool)

      .html  → ConvertFrom-Html (pandoc > built-in converter)

      .docx  → ConvertFrom-Docx (markitdown > pandoc > ZIP/XML fallback)

      .pptx  → ConvertFrom-Office (markitdown only)

      .xlsx  → ConvertFrom-Office (markitdown only)

      URL    → Invoke-WebRequest, then ConvertFrom-Html

    For each document, ingestion:

    1. Generates a slug-based doc-id via New-Slug + Resolve-DocId.

       Example: "The Future of AI Governance" → "future-ai-governance-2026"

    2. Creates sources/<doc-id>/ directory.

    3. Saves the original file to sources/<doc-id>/raw/.

    4. Converts to Markdown and saves as sources/<doc-id>/snapshot.md with

       a provenance header (via Add-SnapshotHeader).

    5. Extracts metadata via AI (Get-AIMetadata) — title, authors, POV tags,

       topic tags.  Falls back to HTML heuristics (Get-HtmlMeta) if AI fails.

    6. Creates sources/<doc-id>/metadata.json (via New-Metadata).

    7. Optionally submits the URL to the Wayback Machine for archival.

    8. Enqueues the doc-id for summarization (via Add-ToSummaryQueue).

    Key parameters:

      -Url         Source URL to fetch and ingest

      -File        Local file path to ingest

      -Title       Override the AI-extracted title

      -PovTag      Pre-classify POV tags instead of AI extraction

      -TopicTag    Pre-classify topic tags

      -SkipAI      Skip AI metadata extraction (use heuristics only)

  STAGE 2: POV SUMMARIZATION (Invoke-POVSummary / Invoke-BatchSummary)

    Generates structured multi-perspective summaries for each document.

    Single document: Invoke-POVSummary -DocId 'ai-safety-report-2026'

    Batch (all pending): Invoke-BatchSummary

    The summarization process:

    1. Loads snapshot.md and current taxonomy JSON.

    2. Estimates token count (~4 chars/token).

    3. If <= 20,000 tokens: single API call with density-scaled prompt.

       If > 20,000 tokens: chunked pipeline (see CHUNKING below).

    4. Parses the JSON response (with truncation repair if needed).

    5. Validates output density against per-field minimums.

    6. Resolves unmapped_concepts against existing taxonomy nodes.

    7. Writes summaries/<doc-id>.json.

    8. Updates metadata.json with summary_status='current'.

    Output schema (summaries/<doc-id>.json):

      doc_id             Document identifier

      taxonomy_version   Taxonomy version used for generation

      generated_at       ISO timestamp

      ai_model           Model used

      temperature        Sampling temperature

      pov_summaries      Per-POV analysis (see below)

      factual_claims     Extracted factual assertions

      unmapped_concepts  Concepts not in the current taxonomy

    Per-POV summary (pov_summaries.<pov>):

      key_points[]       Array of extracted points, each with:

        point              The key point text

        taxonomy_node_id   Mapped taxonomy node (or null if unmapped)

        stance             strongly_aligned|aligned|neutral|opposed|

                           strongly_opposed|not_applicable

        evidence           Supporting evidence from the document

    Factual claims (factual_claims[]):

      claim              The factual assertion

      claim_label        Short identifier

      source_pov         Which POV perspective the claim comes from

      confidence         Estimated confidence level

      temporal_scope     current_state|predictive|historical|timeless

      temporal_bound     Specific time reference if applicable

  CHUNKING (LARGE DOCUMENTS)

    Documents over ~20,000 estimated tokens are processed in chunks:

    1. Split-DocumentChunks divides the Markdown into semantically coherent

       pieces, preferring heading boundaries (##, ###, ####), falling back to

       paragraph breaks.  Default: 15,000 tokens/chunk, 2,000 token minimum

       for the last chunk.

    2. Each chunk is summarized independently with the same taxonomy context

       and output schema.

    3. Merge-ChunkSummaries combines results:

       - key_points deduplicated by taxonomy_node_id + first 80 chars

       - factual_claims deduplicated by claim_label

       - unmapped_concepts deduplicated by suggested_label

    4. Density check runs on the merged result (warn only, no retry).

  DENSITY SCALING

    To ensure summaries are proportionally detailed for document length, the

    system computes minimum output counts based on word count:

    Field              Floor Formula

    -----------------  --------------------------------

    key_points/camp    max(3, words / 500)

    factual_claims     max(3, words / 800)

    unmapped_concepts  max(2, words / 2000)

    If a summary falls below these floors, the system retries once with an

    explicit nudge prompt identifying the specific shortfalls.  Chunked

    summaries warn but do not retry (individual chunks are already smaller).

  STAGE 3: CONFLICT DETECTION (Find-Conflict)

    Compares factual_claims across all summaries to identify contradictions.

    Find-Conflict reads the factual_claims from each summary and looks for:

    - Direct contradictions (opposing claims about the same topic)

    - Temporal conflicts (claims valid at different times)

    - Perspective-dependent conflicts (same evidence, different conclusions)

    Conflicts are written to conflicts/*.json with:

      id                 Unique conflict identifier

      claim_a, claim_b   The two conflicting claims

      source_a, source_b Document IDs for each claim

      conflict_type      Type of conflict detected

      severity           Estimated severity

      attack_type        rebut|undercut|undermine (optional, AIF-aligned)

      target_claim       Which claim is being attacked (optional)

      counter_evidence   Evidence against the target (optional)

      verdict            Resolution verdict from debate harvest (optional)

  STAGE 4: HEALTH AND QUALITY

    Several cmdlets monitor corpus quality:

    Get-TaxonomyHealth

      Analyzes coverage across the taxonomy — which nodes have summaries

      mapped to them, which are orphaned, where gaps exist.

    Measure-TaxonomyBaseline

      Captures a point-in-time snapshot of taxonomy metrics (node counts,

      edge counts, summary coverage, conflict density) for trend tracking.

    Test-TaxonomyIntegrity

      Validates structural consistency: orphaned references, missing files,

      schema compliance, edge validity.

    Get-TopicFrequency

      Clusters nodes by embedding similarity to identify overrepresented

      and underrepresented topic areas.

    Get-IngestionPriority

      Ranks potential documents by how much they would improve taxonomy

      coverage.

  UNMAPPED CONCEPT RESOLUTION

    When the AI maps a key_point to a taxonomy node that does not exist, or

    identifies a concept not in the taxonomy, it goes into unmapped_concepts.

    During finalization, Resolve-UnmappedConcepts attempts fuzzy matching

    against existing nodes using label similarity.  Successfully matched

    concepts are removed from unmapped_concepts; the rest remain for manual

    review or taxonomy expansion.

    Repair-UnmappedConcepts can be run independently to re-resolve unmapped

    concepts across all summaries after taxonomy changes.

  SNAPSHOT MANAGEMENT

    Snapshots can go stale when conversion tools improve or when source

    documents are updated.

    Update-Snapshot -DocId 'ai-safety-report-2026'

      Regenerates snapshot.md from the raw/ source using the current best

      conversion tool.

    Update-Snapshot -All

      Regenerates all snapshots.  Alias: Redo-Snapshots.

  ARCHIVAL

    Save-WaybackUrl submits URLs to the Internet Archive's Wayback Machine.

    This runs automatically during ingestion (unless -SkipArchive is set)

    and can be invoked manually:

    Save-WaybackUrl -Url 'https://example.com/important-paper.pdf'

  PII AUDIT

    Invoke-PIIAudit scans snapshots and summaries for personally identifiable

    information (email addresses, phone numbers, etc.) that should not be

    stored or published.

  EXAMPLE: FULL PIPELINE

    # 1. Ingest a new document

    Import-AITriadDocument -Url 'https://example.com/ai-report.pdf' `

        -PovTag @('safetyist','cross-cutting')

    # 2. Summarize it

    Invoke-POVSummary -DocId 'ai-report-2026'

    # 3. Detect conflicts with existing corpus

    Find-Conflict

    # 4. Check overall health

    Get-TaxonomyHealth

    # 5. Review in the desktop editor

    Show-TaxonomyEditor

  EXAMPLE: BATCH OPERATIONS

    # Summarize all pending documents

    Invoke-BatchSummary -Model 'gemini-2.5-flash' -Temperature 0.1

    # Re-resolve unmapped concepts after adding taxonomy nodes

    Repair-UnmappedConcepts

    # Regenerate all snapshots with improved conversion

    Update-Snapshot -All

SEE ALSO

    about_AITriad

    about_AITriadTaxonomy

    Import-AITriadDocument

    Invoke-POVSummary

    Invoke-BatchSummary

    Find-Conflict

    Get-TaxonomyHealth

    Update-Snapshot

    Save-WaybackUrl

    Invoke-PIIAudit

    Repair-UnmappedConcepts