Note
This list of agents has been automatically generated from the source code
Package that contains all available MMLF agents.
Agent ‘Model-based Direct Policy Search‘ from module mbdps_agent implemented in class MBDPS_Agent.
An agent that uses the state-action-reward-successor_state transitions to learn a model of the environment. It performs direct policy search (similar to the direct policy search agent using a black-box optimization algorithm to optimize the parameters of a parameterized policy) in the model in order to optimize a criterion defined by a fitness function. This fitness function can be e.g. the estimated accumulated reward obtained by this policy in the model environment. In order to enforce exploration, the model is wrapped for an RMax-like behavior so that it returns the reward RMax for all states that have not been sufficiently explored. RMax should be an upper bound to the actual achievable return in order to enforce optimism in the face of uncertainty.
Agent ‘Monte-Carlo‘ from module monte_carlo_agent implemented in class MonteCarloAgent.
An agent which uses Monte Carlo policy evaluation to optimize its behavior in a given environment.
Agent ‘Dyna TD‘ from module dyna_td_agent implemented in class DynaTDAgent.
Dyna-TD uses temporal difference learning along with learning a model of the environment and doing planning in it.
Agent ‘Temporal Difference + Eligibility‘ from module td_lambda_agent implemented in class TDLambdaAgent.
An agent that uses temporal difference learning (e.g. Sarsa) with eligibility traces and function approximation (e.g. linear tile coding CMAC) to optimize its behavior in a given environment
Agent ‘Policy Replay‘ from module policy_replay_agent implemented in class PolicyReplayAgent.
Agent which loads a stored policy and follows it without improving it.
Agent ‘Random‘ from module random_agent implemented in class RandomAgent.
Agent ‘Actor Critic‘ from module actor_critic_agent implemented in class ActorCriticAgent.
This agent learns based on the actor critic architecture. It uses standard TD(lambda) to learn the value function of the critic. For this reason, it subclasses TDLambdaAgent. The main difference to TD(lambda) is the means for action selection. Instead of deriving an epsilon-greedy policy from its Q-function, it learns an explicit stochastic policy. To this end, it maintains preferences for each action in each state. These preferences are updated after each action execution according to the following rule:
Agent ‘RoundRobin‘ from module example_agent implemented in class ExampleAgent.
Agent ‘Direct Policy Search‘ from module dps_agent implemented in class DPS_Agent.
This agent uses a black-box optimization algorithm to optimize the parameters of a parametrized policy such that the accumulated (undiscounted) reward of the the policy is maximized.
Agent ‘Fitted-RMax‘ from module fitted_r_max_agent implemented in class FittedRMaxAgent.
Fitted R-Max is a model-based RL algorithm that uses the RMax heuristic for exploration control, uses a fitted function approximator (even though this can be configured differently), and uses Dynamic Programming (boosted by prioritized sweeping) for deriving a value function from the model. Fitted R-Max learns usually very sample-efficient (meaning that a good policy is learned with only a few interactions with the environment) but requires a huge amount of computational resources.
See also
Note
This list of environments has been automatically generated from the source code
Package that contains all available MMLF world environments.
Environment ‘Maze Cliff‘ from module maze_cliff_environment implemented in class MazeCliffEnvironment.
In this maze, there are two alternative ways from the start to the goal state: one short way which leads along a dangerous cliff and one long but secure way. If the agent happens to step into the maze, it will get a huge negative reward (configurable via cliffPenalty) and is reset into the start state. Per default, the maze is deterministic, i.e. the agent always moves in the direction it chooses. However, the parameter stochasticity allows to control the stochasticity of the environment. For instance, when stochasticity is set to 0.01, the the agent performs a random move instead of the chosen one with probability 0.01.
Environment ‘Pinball 2D‘ from module pinball_maze_environment implemented in class PinballMazeEnvironment.
The pinball maze environment class.
Environment ‘Linear Markov Chain‘ from module linear_markov_chain_environment implemented in class LinearMarkovChainEnvironment.
The agent starts in the middle of this linear markov chain. He can either move right or left. The chain is not stochastic, i.e. when the agent wants to move right, the state is decreased with probability 1 by 1. When the agent wants to move left, the state is increased with probability 1 by 1 accordingly.
Environment ‘Maze 2D‘ from module maze2d_environment implemented in class Maze2dEnvironment.
A 2d maze world, in which the agent is situated at each moment in time in a certain field (specified by its (row,column) coordinate) and can move either upwards, downwards, left or right. The structure of the maze can be configured via a text-based config file.
Environment ‘Double Pole Balancing‘ from module double_pole_balancing_environment implemented in class DoublePoleBalancingEnvironment.
In the double pole balancing environment, the task of the agent is to control a cart such that two poles which are mounted on the cart stay in a nearly vertical position (to balance them). At the same time, the cart has to stay in a confined region.
Environment ‘Partial Observable Double Pole Balancing‘ from module po_double_pole_balancing_environment implemented in class PODoublePoleBalancingEnvironment.
In the partially observable double pole balancing environment, the task of the agent is to control a cart such that two poles which are mounted on the cart stay in a nearly vertical position (to balance them). At the same time, the cart has to stay in a confined region. In contrast to the fully observable double pole balancing environment, the agent only observes the current position of cart and the two poles but not their velocities. This renders the problem to be not markovian.
Environment ‘Mountain Car‘ from module mcar_env implemented in class MountainCarEnvironment.
In the mountain car environment, the agent has to control a car which is situated somewhere in a valley between two hills. The goal of the agent is to reach the top of the right hill. Unfortunately, the engine of the car is not strong enough to reach the top of the hill directly from many start states. Thus, it has first to drive in the wrong direction to gather enough potential energy.
Environment ‘Single Pole Balancing‘ from module single_pole_balancing_environment implemented in class SinglePoleBalancingEnvironment.
In the single pole balancing environment, the task of the agent is to control a cart such that a pole which is mounted on the cart stays in a nearly vertical position (to balance it). At the same time, the cart has to stay in a confined region.
Environment ‘Seventeen and Four‘ from module seventeen_and_four implemented in class SeventeenAndFourEnvironment.
This environment implements a simplified form of the card game seventeen & four, in which the agent takes the role of the player and plays against a hard-coded dealer.
See also