Planner

Abstract base class for planners

This module contains the abstract base class “Planner” for planning algorithms, i.e. algorithms for computing the optimal state-action value function (and thus the optimal policy) for a given model (i.e. state transition and expected reward function). Subclasses of Planner must implement the “plan” method.

class resources.planner.planner.Planner(stateSpace, functionApproximator, gamma, actions, **kwargs)

Abstract base class for planners

This module contains the abstract base class “Planner” for planning algorithms, i.e. algorithms for computing the optimal state-action value function (and thus the optimal policy) for a given model (i.e. state transition and expected reward function). Subclasses of Planner must implement the “plan” method.

New in version 0.9.9.

static create(plannerSpec, stateSpace, functionApproximator, gamma, actions, epsilon=0.0)

Factory method that creates planner based on spec-dictionary.

static getPlannerDict()

Returns dict that contains a mapping from planner name to planner class.

setStates(states)

Sets the discrete states on which dynamic programming is performed.

Remove Q-Values of state-action pairs that are no longer required. NOTE: Setting states is only meaningful for discrete state sets,

where the TabularStorage function approximator is used.

MBDPS Planner

A planner module which is used in the MBDPS agent.

class resources.planner.mbdps_planner.MBDPSPlanner(gamma, planningEpisodes, evalsPerIndividual, fitnessFunction=<function estimatePolicyOutcome at 0x6415aa0>, **kwargs)

A planner module which is used in the MBDPS agent.

The planner uses a policy search method to search in the space of policies. Policies are evaluated based on the return the achieve in trajectory sampled from a supplied model. A policy’s fitness is set to the average return it obtains in several episodes starting from potentially different start states.

Parameters:
  • fitnessFunction The fitness function used to evaluate the policy.

    Defaults to the module’s function estimatePolicyOutcome.

New in version 0.9.9.

CONFIG DICT
gamma:: The discounting factor.
planningEpisodes:
 : The number of simulated episodes that can be conducted before the planning is stopped.
evalsPerIndividual:
 : The number episodes each policy is evaluated.

Prioritized sweeping

Planning based on prioritized sweeping.

class resources.planner.prioritized_sweeping.PrioritizedSweepingPlanner(stateSpace, functionApproximator, gamma, actions, epsilon, minSweepDelta, updatesPerStep, **kwargs)

Planning based on prioritized sweeping

A planner based on the prioritized sweeping algorithm that allows to compute the optimal state-action value function (and thus the optimal policy) for a given distribution model (i.e. state transition and expected reward function). It is assumed that the MDP is finite and that the available actions are defined explicitly.

The following parameters must be passed to the constructor:
  • stateSpace The state space of the agent (must be finite)

  • functionApproximator The function approximator which handles storing

    the Q-Function

  • gamma The discount factor of the MDP

  • actions The actions available to the agent

New in version 0.9.9.

CONFIG DICT
minSweepDelta:: The minimal TD error that is applied during prioritized sweeping. If no change larger than minSweepDelta remains, the sweep is stopped.
updatesPerStep:: The maximal number of updates that can be performed in one sweep.

Trajectory sampling

Planning based on trajectory sampling.

This module contains a planner based on the trajectory sampling algorithm that allows to compute an improved state-action value function (and thus policy) for a given sample model. The MDP’s state space need not be discrete and finite but it is assumed that there is only a finite number of actions that are defined explicitly.

class resources.planner.trajectory_sampling.TrajectorySamplingPlanner(stateSpace, functionApproximator, gamma, actions, epsilon, maxTrajectoryLength, updatesPerStep, onPolicy, **kwargs)

Planning based on trajectory sampling.

A planner based on the trajectory sampling algorithm that allows to compute an improved state-action value function (and thus policy) for a given sample model. The MDP’s state space need not be discrete and finite but it is assumed that there is only a finite number of actions that are defined explicitly.

The following parameters must be passed to the constructor:
  • stateSpace The state space of the agent

  • functionApproximator The function approximator which handles storing

    the Q-Function

  • gamma The discount factor of the MDP

  • actions The actions available to the agent

New in version 0.9.9.

CONFIG DICT
maxTrajectoryLength:
 : The maximal length of a trajectory before a new trajectory is started.
updatesPerStep:: The maximal number of updates that can be performed in one planning call.
onPolicy:: Whether the trajectory is sampled from the on-policy distribution.

Value iteration

Planning based on value iteration

This module contains a planner based on the value iteration algorithm that allows to compute the optimal state-action value function (and thus the optimal policy) for a given distribution model (i.e. state transition and expected reward function). It is assumed that the MDP is finite and that the available actions are defined explicitly.

class resources.planner.value_iteration.ValueIterationPlanner(stateSpace, functionApproximator, gamma, actions, epsilon, minimalBellmanError, maxIterations, **kwargs)

Planning based on value iteration

This module contains a planner based on the value iteration algorithm that allows to compute the optimal state-action value function (and thus the optimal policy) for a given distribution model (i.e. state transition and expected reward function). It is assumed that the MDP is finite and that the available actions are defined explicitly.

The following parameters must be passed to the constructor:
  • stateSpace The state space of the agent (must be finite)

  • functionApproximator The function approximator which handles storing

    the Q-Function

  • gamma The discount factor of the MDP

  • actions The actions available to the agent

New in version 0.9.9.

CONFIG DICT
minimalBellmanError:
 : The minimal bellman error (sum of TD errors over all state-action pairs) that enforces another iteration. If the bellman error falls below this threshold, value iteration is stopped.
maxIterations:: The maximum number of iterations before value iteration is stopped.