VBAF

5.0.0

VBAF.RL.Environment.ps1

                                #Requires -Version 5.1

<#

.SYNOPSIS

    Standardized Environment Interface for VBAF RL Algorithms

.DESCRIPTION

    Provides an OpenAI Gym-like environment interface for all VBAF RL algorithms.

    WHAT YOU ARE LEARNING HERE:

    ============================

    In reinforcement learning, the ENVIRONMENT is everything the agent

    interacts with. It defines:

      - What the agent can OBSERVE (the state space)

      - What the agent can DO (the action space)

      - What HAPPENS when the agent acts (the transition function)

      - How much REWARD the agent gets (the reward function)

    This file standardises environments so that ANY agent (DQN, PPO, A3C,

    Q-learning) can work with ANY environment without code changes.

    This is the same design as OpenAI Gym -- the most widely used

    RL environment library in the world.

    THE GYM INTERFACE -- THREE METHODS:

    =====================================

    Every environment in VBAF implements the same three methods:

      Reset()         -- start a new episode, return initial state

      Step(action)    -- apply action, return (nextState, reward, done)

      GetState()      -- return current state as a double array

    This is the standard contract. Any agent that calls these three methods

    works with any environment. Swap CartPole for GridWorld -- same agent code.

    STATE SPACE vs ACTION SPACE:

    ============================

    State space: what the agent can observe

      Continuous: real-valued numbers (position, velocity, angle)

      Discrete:   integer categories (grid cell, one of N options)

    Action space: what the agent can do

      Discrete:   a fixed set of choices (push left, push right)

      Continuous: a real-valued command (force between -1 and +1)

    VBAF currently uses discrete actions -- simpler to implement

    and sufficient for all examples in this framework.

    THE THREE ENVIRONMENTS:

    =======================

    CartPole:   balance a pole on a cart (classic control problem)

                State: [position, velocity, angle, angular_velocity]

                Actions: 0=push left, 1=push right

    GridWorld:  navigate a grid to reach a goal (spatial reasoning)

                State: [agent_row, agent_col, goal_row, goal_col]

                Actions: 0=up, 1=right, 2=down, 3=left

    RandomWalk: move along a 1D line to reach the center (simplest possible)

                State: [position]

                Actions: 0=left, 1=right

                Use this for quick sanity checks when debugging an agent.

    INHERITANCE:

    ============

    CartPoleEnvironment, GridWorldEnvironment and RandomWalkEnvironment

    all inherit from VBAFEnvironment. They share the same interface

    but implement different physics and reward functions.

    This is polymorphism in action -- the same agent code works with

    all three environments because they all respond to the same methods.

    THEORY REFERENCE:

    =================

    Brockman, G. et al. (2016). "OpenAI Gym."

    ArXiv:1606.01540.

    The OpenAI Gym paper established the standard environment interface

    that this file implements. Every major RL library uses this pattern.

.NOTES

    Part of VBAF (Visual AI & Reinforcement Learning Framework)

    Educational use -- swap environments to see how agents generalise.

#>

$basePath = $PSScriptRoot

# ============================================================

# VBAFSPACE -- describes a state or action space

# ============================================================

#

# WHAT IS A SPACE

# ----------------

# A space defines the valid values for states or actions.

#

# Discrete space (Type="discrete", Size=2):

#   Actions are integers 0, 1, ..., Size-1

#   Example: CartPole has 2 actions (0=left, 1=right)

#

# Continuous space (Type="continuous", Size=4, Low=-4.8, High=4.8):

#   States are arrays of Size real numbers, each in [Low, High]

#   Example: CartPole state has 4 values (position, velocity, angle, angular velocity)

#

# Knowing the space lets agents and environments work together without

# hardcoding sizes -- an agent can ask the environment "how many actions

# do you have" and configure itself accordingly.

class VBAFSpace {

    [string] $Type    # "discrete" or "continuous"

    [int]    $Size    # number of actions OR number of state dimensions

    [double] $Low     # minimum value (for continuous spaces)

    [double] $High    # maximum value (for continuous spaces)

    VBAFSpace([string]$type, [int]$size) {

        $this.Type = $type

        $this.Size = $size

        $this.Low  = -1.0

        $this.High =  1.0

    }

    VBAFSpace([string]$type, [int]$size, [double]$low, [double]$high) {

        $this.Type = $type

        $this.Size = $size

        $this.Low  = $low

        $this.High = $high

    }

    [string] ToString() {

        return "$($this.Type)($($this.Size)) [$($this.Low), $($this.High)]"

    }

}

# ============================================================

# VBAFENVIRONMENT -- base class for all environments

# ============================================================

#

# BASE CLASS PATTERN:

# -------------------

# VBAFEnvironment defines the INTERFACE that all environments share.

# Subclasses (CartPole, GridWorld, RandomWalk) override Reset() and Step()

# with their own physics and reward functions.

#

# The base class provides:

#   - Common properties (Name, Steps, MaxSteps, TotalReward, EpisodeCount)

#   - PrintInfo() -- works for all environments automatically

#   - Default implementations of Reset/Step (return zeros -- override these)

#

# This is the Template Method pattern -- define the structure here,

# fill in the details in subclasses.

class VBAFEnvironment {

    [string]    $Name

    [VBAFSpace] $ObservationSpace   # Defines what the agent can observe

    [VBAFSpace] $ActionSpace        # Defines what the agent can do

    [int]       $Steps              # Steps taken in current episode

    [int]       $MaxSteps           # Episode ends after this many steps

    [double]    $TotalReward        # Cumulative reward this episode

    [int]       $EpisodeCount       # Total episodes started

    VBAFEnvironment([string]$name, [int]$maxSteps) {

        $this.Name         = $name

        $this.MaxSteps     = $maxSteps

        $this.Steps        = 0

        $this.TotalReward  = 0.0

        $this.EpisodeCount = 0

    }

    # Override in subclass -- return initial state array

    [double[]] Reset()   { return @(0.0) }

    # Override in subclass -- return current state array

    [double[]] GetState() { return @(0.0) }

    # Override in subclass -- apply action, return (NextState, Reward, Done)

    [hashtable] Step([int]$action) {

        return @{ NextState = @(0.0); Reward = 0.0; Done = $true }

    }

    # Print environment summary -- works for all subclasses automatically

    [void] PrintInfo() {

        Write-Host "  Environment : $($this.Name)"                          -ForegroundColor Cyan

        Write-Host "  Obs Space   : $($this.ObservationSpace.ToString())"  -ForegroundColor Cyan

        Write-Host "  Act Space   : $($this.ActionSpace.ToString())"       -ForegroundColor Cyan

        Write-Host "  Max Steps   : $($this.MaxSteps)"                     -ForegroundColor Cyan

    }

}

# ============================================================

# CARTPOLE ENVIRONMENT

# ============================================================

#

# THE PROBLEM:

# ------------

# A pole is attached to a cart that slides left and right.

# The agent pushes the cart left or right to keep the pole balanced.

# The episode ends when the pole falls too far or the cart goes off track.

#

# State: [position, velocity, angle, angular_velocity]

#   position: cart position on the track (-2.4 to +2.4)

#   velocity: cart speed (negative = moving left)

#   angle:    pole angle from vertical in radians (0 = upright)

#   angular_velocity: how fast the pole is rotating

#

# Reward: +1 for every step the pole stays balanced

# Episode ends if:

#   |position| > 2.4 (cart went off track)

#   |angle| > 0.21 radians (~12 degrees -- pole fell)

#   steps >= MaxSteps (success -- survived the full episode)

#

# A random agent survives ~10-20 steps.

# A trained DQN survives 200 steps (the maximum).

# This gap measures how much the agent learned.

#

# INHERITANCE NOTE:

# -----------------

# ": base("CartPole", maxSteps)" calls the parent class constructor.

# This is PowerShell 5.1's syntax for inheritance.

class CartPoleEnvironment : VBAFEnvironment {

    [double] $Position

    [double] $Velocity

    [double] $Angle

    [double] $AngularVelocity

    hidden [System.Random] $Rng

    CartPoleEnvironment() : base("CartPole", 200) {

        $this.ObservationSpace = [VBAFSpace]::new("continuous", 4, -4.8, 4.8)

        $this.ActionSpace      = [VBAFSpace]::new("discrete",   2,  0.0, 1.0)

        $this.Rng              = [System.Random]::new()

        $this.Reset()

    }

    CartPoleEnvironment([int]$maxSteps) : base("CartPole", $maxSteps) {

        $this.ObservationSpace = [VBAFSpace]::new("continuous", 4, -4.8, 4.8)

        $this.ActionSpace      = [VBAFSpace]::new("discrete",   2,  0.0, 1.0)

        $this.Rng              = [System.Random]::new()

        $this.Reset()

    }

    CartPoleEnvironment([int]$maxSteps, [int]$seed) : base("CartPole", $maxSteps) {

        $this.ObservationSpace = [VBAFSpace]::new("continuous", 4, -4.8, 4.8)

        $this.ActionSpace      = [VBAFSpace]::new("discrete",   2,  0.0, 1.0)

        $this.Rng              = [System.Random]::new($seed)   # Fixed seed for reproducibility

        $this.Reset()

    }

    # Reset to a new random starting state.

    # Small perturbations (+-0.05) prevent the agent memorising one fixed sequence.

    [double[]] Reset() {

        $this.Position        = ($this.Rng.NextDouble() - 0.5) * 0.1

        $this.Velocity        = ($this.Rng.NextDouble() - 0.5) * 0.1

        $this.Angle           = ($this.Rng.NextDouble() - 0.5) * 0.1

        $this.AngularVelocity = ($this.Rng.NextDouble() - 0.5) * 0.1

        $this.Steps           = 0

        $this.TotalReward     = 0.0

        $this.EpisodeCount++

        return $this.GetState()

    }

    [double[]] GetState() {

        return @($this.Position, $this.Velocity, $this.Angle, $this.AngularVelocity)

    }

    # Apply physics equations for one time step (dt = 0.02 seconds).

    # These are standard CartPole equations from control systems literature.

    # Treat as a black box -- the important thing is the interface, not the physics.

    [hashtable] Step([int]$action) {

        $this.Steps++

        $force     = if ($action -eq 1) { 1.0 } else { -1.0 }

        $gravity   = 9.8

        $cartMass  = 1.0

        $poleMass  = 0.1

        $totalMass = $cartMass + $poleMass

        $halfLen   = 0.25

        $dt        = 0.02

        $cosA = [Math]::Cos($this.Angle)

        $sinA = [Math]::Sin($this.Angle)

        $temp = ($force + $poleMass * $halfLen * $this.AngularVelocity * $this.AngularVelocity * $sinA) / $totalMass

        $aAcc = ($gravity * $sinA - $cosA * $temp) / ($halfLen * (4.0/3.0 - $poleMass * $cosA * $cosA / $totalMass))

        $acc  = $temp - $poleMass * $halfLen * $aAcc * $cosA / $totalMass

        $this.Position        += $dt * $this.Velocity

        $this.Velocity        += $dt * $acc

        $this.Angle           += $dt * $this.AngularVelocity

        $this.AngularVelocity += $dt * $aAcc

        $done   = ($this.Steps -ge $this.MaxSteps) -or

                  ([Math]::Abs($this.Position) -gt 2.4) -or

                  ([Math]::Abs($this.Angle)    -gt 0.21)

        $reward = if (-not $done) { 1.0 } else { 0.0 }

        $this.TotalReward += $reward

        return @{ NextState = $this.GetState(); Reward = $reward; Done = $done }

    }

}

# ============================================================

# GRIDWORLD ENVIRONMENT

# ============================================================

#

# THE PROBLEM:

# ------------

# A grid of Size x Size cells. The agent starts at a random cell

# and must navigate to a goal cell, which is also random.

#

# This tests whether an agent can learn SPATIAL REASONING --

# moving toward a target rather than away from it.

#

# State: [agent_row, agent_col, goal_row, goal_col] (normalised 0 to 1)

#   Normalisation makes all values the same scale for the neural network.

#   Raw row/col values (0-4 for a 5x5 grid) would be fine for Q-learning

#   but neural networks train better on normalised inputs.

#

# Actions: 0=up, 1=right, 2=down, 3=left

#   Agent cannot move off the grid (clamped to boundary).

#

# Reward:

#   +10 for reaching the goal

#   -0.1 for each step (encourages finding the SHORT path)

#   -1.0 if episode ends without reaching goal (timeout)

#

# REWARD SHAPING NOTE:

# --------------------

# The -0.1 step penalty is reward shaping -- we add domain knowledge

# to help the agent learn faster. Without it, the agent might wander

# randomly and eventually reach the goal by chance, but learn nothing

# about finding efficient paths.

class GridWorldEnvironment : VBAFEnvironment {

    [int] $GridSize

    [int] $AgentRow

    [int] $AgentCol

    [int] $GoalRow

    [int] $GoalCol

    hidden [System.Random] $Rng

    GridWorldEnvironment() : base("GridWorld", 100) {

        $this.GridSize         = 5

        $this.ObservationSpace = [VBAFSpace]::new("continuous", 4, 0.0, 1.0)

        $this.ActionSpace      = [VBAFSpace]::new("discrete",   4, 0.0, 3.0)

        $this.Rng              = [System.Random]::new()

        $this.Reset()

    }

    GridWorldEnvironment([int]$gridSize, [int]$maxSteps) : base("GridWorld", $maxSteps) {

        $this.GridSize         = $gridSize

        $this.ObservationSpace = [VBAFSpace]::new("continuous", 4, 0.0, 1.0)

        $this.ActionSpace      = [VBAFSpace]::new("discrete",   4, 0.0, 3.0)

        $this.Rng              = [System.Random]::new()

        $this.Reset()

    }

    [double[]] Reset() {

        $this.AgentRow    = $this.Rng.Next(0, $this.GridSize)

        $this.AgentCol    = $this.Rng.Next(0, $this.GridSize)

        $this.GoalRow     = $this.Rng.Next(0, $this.GridSize)

        $this.GoalCol     = $this.Rng.Next(0, $this.GridSize)

        # Ensure agent and goal are not in the same cell

        while ($this.AgentRow -eq $this.GoalRow -and $this.AgentCol -eq $this.GoalCol) {

            $this.GoalRow = $this.Rng.Next(0, $this.GridSize)

            $this.GoalCol = $this.Rng.Next(0, $this.GridSize)

        }

        $this.Steps       = 0

        $this.TotalReward = 0.0

        $this.EpisodeCount++

        return $this.GetState()

    }

    # Normalise positions to [0, 1] range for neural network compatibility

    [double[]] GetState() {

        [int]      $g   = $this.GridSize - 1

        [double[]] $arr = @(0.0, 0.0, 0.0, 0.0)

        $arr[0] = $this.AgentRow / $g

        $arr[1] = $this.AgentCol / $g

        $arr[2] = $this.GoalRow  / $g

        $arr[3] = $this.GoalCol  / $g

        return $arr

    }

    [hashtable] Step([int]$action) {

        $this.Steps++

        $newRow = $this.AgentRow

        $newCol = $this.AgentCol

        switch ($action) {

            0 { $newRow-- }   # up

            1 { $newCol++ }   # right

            2 { $newRow++ }   # down

            3 { $newCol-- }   # left

        }

        # Clamp to grid boundaries -- agent cannot walk off the edge

        $newRow = [Math]::Max(0, [Math]::Min($this.GridSize - 1, $newRow))

        $newCol = [Math]::Max(0, [Math]::Min($this.GridSize - 1, $newCol))

        $this.AgentRow = $newRow

        $this.AgentCol = $newCol

        $atGoal = ($this.AgentRow -eq $this.GoalRow -and $this.AgentCol -eq $this.GoalCol)

        $done   = $atGoal -or ($this.Steps -ge $this.MaxSteps)

        $reward = if ($atGoal) { 10.0 } elseif ($done) { -1.0 } else { -0.1 }

        $this.TotalReward += $reward

        return @{ NextState = $this.GetState(); Reward = $reward; Done = $done }

    }

}

# ============================================================

# RANDOM WALK ENVIRONMENT

# ============================================================

#

# THE PROBLEM:

# ------------

# The simplest possible RL problem -- a 1D number line.

# Agent starts at a random position and must reach position 0 (center).

#

# State:  [position / range]  -- one number between -1 and +1

# Actions: 0=move left, 1=move right

#

# Reward: +10 for reaching 0, else -(distance from center) * 0.1

#   The distance-based penalty guides the agent toward the center.

#

# USE THIS ENVIRONMENT FOR:

# -------------------------

# Debugging a new agent implementation before trying CartPole.

# If an agent cannot solve RandomWalk, it will not solve anything harder.

# Much faster to train -- useful for quick sanity checks.

class RandomWalkEnvironment : VBAFEnvironment {

    [int] $Position

    [int] $Range

    hidden [System.Random] $Rng

    RandomWalkEnvironment() : base("RandomWalk", 50) {

        $this.Range            = 10

        $this.ObservationSpace = [VBAFSpace]::new("continuous", 1, -1.0, 1.0)

        $this.ActionSpace      = [VBAFSpace]::new("discrete",   2,  0.0, 1.0)

        $this.Rng              = [System.Random]::new()

        $this.Reset()

    }

    [double[]] Reset() {

        $this.Position    = $this.Rng.Next(-$this.Range, $this.Range)

        $this.Steps       = 0

        $this.TotalReward = 0.0

        $this.EpisodeCount++

        return $this.GetState()

    }

    [double[]] GetState() {

        return @([double]$this.Position / $this.Range)

    }

    [hashtable] Step([int]$action) {

        $this.Steps++

        if ($action -eq 0) { $this.Position-- } else { $this.Position++ }

        $this.Position = [Math]::Max(-$this.Range, [Math]::Min($this.Range, $this.Position))

        $atCenter = ($this.Position -eq 0)

        $done     = $atCenter -or ($this.Steps -ge $this.MaxSteps)

        $reward   = if ($atCenter) { 10.0 } else { -[Math]::Abs($this.Position) * 0.1 }

        $this.TotalReward += $reward

        return @{ NextState = $this.GetState(); Reward = $reward; Done = $done }

    }

}

# ============================================================

# ENVIRONMENT FACTORY

# ============================================================

#

# New-VBAFEnvironment is a FACTORY FUNCTION -- it creates the right

# environment type based on a name string.

#

# This lets you write code like:

#   $env = New-VBAFEnvironment -Name "CartPole"

# and swap environments by changing just the name -- no code restructuring needed.

#

# Factory functions are a common design pattern when you have multiple

# classes that share the same interface but different implementations.

function New-VBAFEnvironment {

    param(

        [string] $Name     = "CartPole",   # Which environment to create

        [int]    $MaxSteps = 200,           # Episode length

        [int]    $GridSize = 5,             # GridWorld only: grid dimensions

        [int]    $Seed     = -1             # CartPole only: fixed random seed (-1 = random)

    )

    switch ($Name) {

        "CartPole" {

            if ($Seed -ge 0) { return [CartPoleEnvironment]::new($MaxSteps, $Seed) }

            return [CartPoleEnvironment]::new($MaxSteps)

        }

        "GridWorld"  { return [GridWorldEnvironment]::new($GridSize, $MaxSteps) }

        "RandomWalk" { return [RandomWalkEnvironment]::new() }

        default {

            Write-Host "  Unknown environment: $Name" -ForegroundColor Red

            Write-Host "  Available: CartPole, GridWorld, RandomWalk" -ForegroundColor Yellow

            return $null

        }

    }

}

# ============================================================

# REWARD SHAPER WRAPPER

# ============================================================

#

# A wrapper that modifies rewards from any environment.

# Useful for experimenting with reward engineering without

# changing the environment itself.

#

# Scale:       multiply all rewards by this factor

# Clip:        clamp rewards to [-Clip, +Clip] (0 = no clipping)

# StepPenalty: subtract this from every reward (encourages shorter episodes)

#

# REWARD CLIPPING NOTE:

# ---------------------

# DeepMind used reward clipping (clip all rewards to -1/+1) in the

# original Atari DQN paper. This helped the same hyperparameters work

# across all 49 games despite very different score scales.

function New-RewardShaper {

    param(

        [object] $Environment,

        [double] $Scale       = 1.0,

        [double] $Clip        = 0.0,    # 0 = no clipping

        [double] $StepPenalty = 0.0     # penalty per step

    )

    return @{

        Env         = $Environment

        Scale       = $Scale

        Clip        = $Clip

        StepPenalty = $StepPenalty

        Reset       = { $Environment.Reset() }

        GetState    = { $Environment.GetState() }

        Step        = {

            param([int]$action)

            $result = $Environment.Step($action)

            $r      = $result.Reward * $Scale - $StepPenalty

            if ($Clip -gt 0) { $r = [Math]::Max(-$Clip, [Math]::Min($Clip, $r)) }

            return @{ NextState = $result.NextState; Reward = $r; Done = $result.Done }

        }

    }

}

# ============================================================

# BENCHMARKING UTILITY

# ============================================================

#

# Invoke-VBAFBenchmark runs a trained agent (or a random baseline)

# on an environment for N episodes and reports average performance.

#

# WHY BENCHMARK

# --------------

# Training reward is noisy -- the agent explores randomly which adds noise.

# Benchmark reward is clean -- the agent always picks its best action.

# Compare benchmark reward BEFORE and AFTER training to measure improvement.

#

# RANDOM BASELINE:

# ----------------

# Pass Agent=$null to get a random agent baseline.

# If your trained agent does not beat random, something is wrong.

# Random baseline for CartPole is typically 10-20 reward.

# A trained DQN should reach 150-200 reward.

function Invoke-VBAFBenchmark {

    param(

        [object] $Agent,

        [object] $Environment,

        [int]    $Episodes = 10,

        [string] $Label    = "Benchmark"

    )

    Write-Host ""

    Write-Host "  $Label" -ForegroundColor Yellow

    Write-Host "   Episodes : $Episodes" -ForegroundColor Cyan

    $rewards    = [System.Collections.Generic.List[double]]::new()

    $timer      = [System.Diagnostics.Stopwatch]::StartNew()

    $rng        = [System.Random]::new()

    $actionSize = $Environment.ActionSpace.Size

    $useRandom  = ($null -eq $Agent)

    if ($useRandom) {

        Write-Host "   Agent    : Random baseline" -ForegroundColor DarkYellow

    } else {

        Write-Host "   Agent    : $($Agent.GetType().Name)" -ForegroundColor DarkYellow

    }

    for ($ep = 1; $ep -le $Episodes; $ep++) {

        $state       = $Environment.Reset()

        $totalReward = 0.0

        $done        = $false

        while (-not $done) {

            if ($useRandom) {

                $action = $rng.Next(0, $actionSize)

            } else {

                try   { $action = $Agent.Predict($state) }

                catch { $action = $rng.Next(0, $actionSize) }

            }

            $result       = $Environment.Step($action)

            $state        = $result.NextState

            $totalReward += $result.Reward

            $done         = $result.Done

        }

        $rewards.Add($totalReward)

    }

    $timer.Stop()

    $avg = ($rewards | Measure-Object -Average).Average

    $max = ($rewards | Measure-Object -Maximum).Maximum

    $min = ($rewards | Measure-Object -Minimum).Minimum

    $ms  = $timer.ElapsedMilliseconds

    Write-Host ""

    Write-Host "  +--------------------------------------+" -ForegroundColor Yellow

    Write-Host ("  |  {0,-36}|" -f $Label)                  -ForegroundColor Yellow

    Write-Host "  +--------------------------------------+" -ForegroundColor Yellow

    Write-Host ("  |  Avg Reward : {0,-23}|" -f [Math]::Round($avg, 2)) -ForegroundColor White

    Write-Host ("  |  Max Reward : {0,-23}|" -f [Math]::Round($max, 2)) -ForegroundColor Green

    Write-Host ("  |  Min Reward : {0,-23}|" -f [Math]::Round($min, 2)) -ForegroundColor White

    Write-Host ("  |  Time (ms)  : {0,-23}|" -f $ms)                    -ForegroundColor Cyan

    Write-Host ("  |  ms/episode : {0,-23}|" -f [Math]::Round($ms / $Episodes, 1)) -ForegroundColor Cyan

    Write-Host "  +--------------------------------------+" -ForegroundColor Yellow

    Write-Host ""

    return @{ Avg = $avg; Max = $max; Min = $min; TimeMs = $ms }

}

# ============================================================

# QUICK REFERENCE

# ============================================================

#

# CREATE AN ENVIRONMENT:

#   $env = New-VBAFEnvironment -Name "CartPole"   -MaxSteps 200

#   $env = New-VBAFEnvironment -Name "GridWorld"  -GridSize 5 -MaxSteps 100

#   $env = New-VBAFEnvironment -Name "RandomWalk"

#

# INSPECT THE ENVIRONMENT:

#   $env.PrintInfo()

#

# RUN ONE EPISODE MANUALLY:

#   $state = $env.Reset()

#   while ($true) {

#       $action = 0   # or use an agent: $agent.Predict($state)

#       $result = $env.Step($action)

#       $state  = $result.NextState

#       if ($result.Done) { break }

#   }

#

# BENCHMARK A TRAINED AGENT:

#   $dqn = (Invoke-DQNTraining -Episodes 100 -FastMode)[-1]

#   $env = New-VBAFEnvironment -Name "CartPole"

#   Invoke-VBAFBenchmark -Agent $dqn  -Environment $env -Episodes 20 -Label "DQN"

#   Invoke-VBAFBenchmark -Agent $null -Environment $env -Episodes 20 -Label "Random"

#

# SEE ALSO:

#   VBAF.RL.DQN.ps1            -- DQN agent that uses CartPoleEnvironment

#   VBAF.RL.QLearningAgent.ps1 -- Q-learning agent (good for GridWorld)

# ============================================================

Write-Host "  VBAF.RL.Environment.ps1 loaded" -ForegroundColor Green

Write-Host "   Classes      : VBAFSpace, VBAFEnvironment" -ForegroundColor Cyan

Write-Host "   Environments : CartPole, GridWorld, RandomWalk" -ForegroundColor Cyan

Write-Host "   Functions    : New-VBAFEnvironment, New-RewardShaper, Invoke-VBAFBenchmark" -ForegroundColor Cyan

Write-Host ""

Write-Host "   Quick start:" -ForegroundColor Yellow

Write-Host '   $env = New-VBAFEnvironment -Name "CartPole" -MaxSteps 200' -ForegroundColor White

Write-Host '   $env.PrintInfo()' -ForegroundColor White

Write-Host ""