Package that contains all available MMLF agents.
Agent ‘Model-based Direct Policy Search‘ from module mbdps_agent implemented in class MBDPS_Agent.
An agent that uses the state-action-reward-successor_state transitions to learn a model of the environment. It performs direct policy search (similar to the direct policy search agent using a black-box optimization algorithm to optimize the parameters of a parameterized policy) in the model in order to optimize a criterion defined by a fitness function. This fitness function can be e.g. the estimated accumulated reward obtained by this policy in the model environment. In order to enforce exploration, the model is wrapped for an RMax-like behavior so that it returns the reward RMax for all states that have not been sufficiently explored. RMax should be an upper bound to the actual achievable return in order to enforce optimism in the face of uncertainty.
Agent ‘Monte-Carlo‘ from module monte_carlo_agent implemented in class MonteCarloAgent.
An agent which uses Monte Carlo policy evaluation to optimize its behavior in a given environment.
Agent ‘Dyna TD‘ from module dyna_td_agent implemented in class DynaTDAgent.
Dyna-TD uses temporal difference learning along with learning a model of the environment and doing planning in it.
Agent ‘Temporal Difference + Eligibility‘ from module td_lambda_agent implemented in class TDLambdaAgent.
An agent that uses temporal difference learning (e.g. Sarsa) with eligibility traces and function approximation (e.g. linear tile coding CMAC) to optimize its behavior in a given environment
Agent ‘Policy Replay‘ from module policy_replay_agent implemented in class PolicyReplayAgent.
Agent which loads a stored policy and follows it without improving it.
Agent ‘Random‘ from module random_agent implemented in class RandomAgent.
Agent ‘Actor Critic‘ from module actor_critic_agent implemented in class ActorCriticAgent.
This agent learns based on the actor critic architecture. It uses standard TD(lambda) to learn the value function of the critic. For this reason, it subclasses TDLambdaAgent. The main difference to TD(lambda) is the means for action selection. Instead of deriving an epsilon-greedy policy from its Q-function, it learns an explicit stochastic policy. To this end, it maintains preferences for each action in each state. These preferences are updated after each action execution according to the following rule:
Agent ‘RoundRobin‘ from module example_agent implemented in class ExampleAgent.
Agent ‘Direct Policy Search‘ from module dps_agent implemented in class DPS_Agent.
This agent uses a black-box optimization algorithm to optimize the parameters of a parametrized policy such that the accumulated (undiscounted) reward of the the policy is maximized.
Agent ‘Fitted-RMax‘ from module fitted_r_max_agent implemented in class FittedRMaxAgent.
Fitted R-Max is a model-based RL algorithm that uses the RMax heuristic for exploration control, uses a fitted function approximator (even though this can be configured differently), and uses Dynamic Programming (boosted by prioritized sweeping) for deriving a value function from the model. Fitted R-Max learns usually very sample-efficient (meaning that a good policy is learned with only a few interactions with the environment) but requires a huge amount of computational resources.
Module for MMLF interface for agents
This module contains the AgentBase class that specifies the interface that all MMLF agents have to implement.
MMLF interface for agents
Each agent that should be used in the MMLF needs to be derived from this class and implements the following methods:
Interface Methods
setStateSpace: : Informs the agent of the environment’s state space setActionSpace: : Informs the agent of the environment’s action space setState: : Informs the agent of the environment’s current state giveReward: : Provides a reward to the agent getAction: : Request the next action the agent want to execute nextEpisodeStarted: : Informs the agent that the current episode has terminated and a new one has started.
Request the next action the agent want to execute
Returns the optimal, greedy policy the agent has found so far
Provides a reward to the agent
Informs the agent that a new episode has started.
Informs the agent about the action space of the environment
More information about action spaces can be found in State and Action Spaces
Informs the agent of the environment’s current state.
If normalizeState is True, each state dimension is scaled to the value range(0,1). More information about (valid) states can be found in State and Action Spaces
Informs the agent about the state space of the environment
More information about state spaces can be found in State and Action Spaces
Stores the agent’s policy in the given file by pickling it.
Pickles the agent’s policy and stores it in the file filePath. If the agent is based on value functions, a value function policy wrapper is used to obtain a policy object which is then stored.
If optimal==True, the agent stores the best policy it has found so far, if optimal==False, the agent stores its current (exploitation) policy.
Agent that learns based on the actor-critic architecture.
This module contains an agent that learns based on the actor critic architecture. It uses standard TD(lambda) to learn the value function of the critic and updates the preferences of the actor based on the TD error.
Agent that learns based on the actor-critic architecture.
This agent learns based on the actor critic architecture. It uses standard TD(lambda) to learn the value function of the critic. For this reason, it subclasses TDLambdaAgent. The main difference to TD(lambda) is the means for action selection. Instead of deriving an epsilon-greedy policy from its Q-function, it learns an explicit stochastic policy. To this end, it maintains preferences for each action in each state. These preferences are updated after each action execution according to the following rule:
where delta is the TD error
Action selection is based on a Gibbs softmax distribution:
where tau is a temperature parameter.
Note that even though preferences are stored in a function approximator such that in principle, action preferences could be generalized over the state space, continuous state spaces are not yet supported.
New in version 0.9.9: Added Actor-Critic agent
gamma: | : The discount factor for computing the return given the rewards |
---|---|
lambda: | : The eligibility trace decay rate |
tau: | : Temperature parameter used in the Gibbs softmax distribution for action selection |
minTraceValue: | : The minimum value of an entry in a trace that is considered to be relevant. If the eligibility falls below this value, it is set to 0 and the entry is thus no longer updated |
update_rule: | : Whether the learning is on-policy or off-policy.. Can be either “SARSA” (on-policy) or “WatkinsQ” (off-policy) |
stateDimensionResolution: | |
: The default “resolution” the agent uses for every state dimension. Can be either an int (same resolution for each dimension) or a dict mapping dimension name to its resolution. | |
actionDimensionResolution: | |
: Per default, the agent discretizes a continuous action space in this number of discrete actions. | |
function_approximator: | |
: The function approximator used for representing the Q value function | |
preferences_approximator: | |
The function approximator used for representing the action preferences (i.e. the policy) |
Agent that performs direct search in the policy space to find a good policy
This agent uses a black-box optimization algorithm to optimize the parameters of a parametrized policy such that the accumulated (undiscounted) reward of the the policy is maximized.
Agent that performs direct search in the policy space to find a good policy
This agent uses a black-box optimization algorithm to optimize the parameters of a parametrized policy such that the accumulated (undiscounted) reward of the the policy is maximized.
policy_search: | : The method used for search of an optimal policy in the policy space. Defines policy parametrization and internally used black box optimization algorithm. |
---|
The Dyna-TD agent module
This module contains the Dyna-TD agent class. It uses temporal difference learning along with learning a model of the environment and is based on the Dyna architecture.
Agent that learns based on the DYNA architecture.
Dyna-TD uses temporal difference learning along with learning a model of the environment and doing planning in it.
gamma: | : The discount factor for computing the return given the rewards |
---|---|
epsilon: | : Exploration rate. The probability that an action is chosen non-greedily, i.e. uniformly random among all available actions |
lambda: | : The eligibility trace decay rate |
minTraceValue: | : The minimum value of an entry in a trace that is considered to be relevant. If the eligibility falls below this value, it is set to 0 and the entry is thus no longer updated |
update_rule: | : Whether the learning is on-policy or off-policy.. Can be either “SARSA” (on-policy) or “WatkinsQ” (off-policy) |
stateDimensionResolution: | |
: The default “resolution” the agent uses for every state dimension. Can be either an int (same resolution for each dimension) or a dict mapping dimension name to its resolution. | |
actionDimensionResolution: | |
: Per default, the agent discretizes a continuous action space in this number of discrete actions | |
planner: | : The algorithm used for planning, i.e. for optimizing the policy based on a learned model |
model: | : The algorithm used for learning a model of the environment |
function_approximator: | |
: The function approximator used for representing the Q value function |
Fitted R-Max agent
Fitted R-Max is a model-based RL algorithm that uses the RMax heuristic for exploration control, uses a fitted function approximator (even though this can be configured differently), and uses Dynamic Programming (boosted by prioritized sweeping) for deriving a value function from the model. Fitted R-Max learns usually very sample-efficient (meaning that a good policy is learned with only a few interactions with the environment) but requires a huge amount of computational resources.
Fitted R-Max agent
Fitted R-Max is a model-based RL algorithm that uses the RMax heuristic for exploration control, uses a fitted function approximator (even though this can be configured differently), and uses Dynamic Programming (boosted by prioritized sweeping) for deriving a value function from the model. Fitted R-Max learns usually very sample-efficient (meaning that a good policy is learned with only a few interactions with the environment) but requires a huge amount of computational resources.
See also
Nicholas K. Jong and Peter Stone, “Model-based function approximation in reinforcement learning”, in “Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems” Honolulu, Hawaii: ACM, 2007, 1-8, http://portal.acm.org/citation.cfm?id=1329125.1329242.
gamma: | : The discount factor for computing the return given the rewards# |
---|---|
min_exploration_value: | |
: The agent explores in a state until the given exploration value (approx. number of exploratory actions in proximity of state action pair) is reached for all actions | |
RMax: | : An upper bound on the achievable return an agent can obtain in a single episode |
planner: | : The algorithm used for planning, i.e. for optimizing the policy based on a learned model |
model: | : The algorithm used for learning a model of the environment |
function_approximator: | |
: The function approximator used for representing the Q value function | |
actionDimensionResolution: | |
: Per default, the agent discretizes a continuous action space in this number of discrete actions |
The Model-based Direct Policy Search agent
This module contains an agent that uses the state-action-reward-successor_state transitions to learn a model of the environment. It performs than direct policy search (similar to the direct policy search agent using a black-box optimization algorithm to optimize the parameters of a parameterized policy) in the model in order to optimize a criterion defined by a fitness function. This fitness function can be e.g. the estimated accumulated reward obtained by this policy in the model environment. In order to enforce exploration, the model is wrapped for an RMax-like behavior so that it returns the reward RMax for all states that have not been sufficiently explored.
The Model-based Direct Policy Search agent
An agent that uses the state-action-reward-successor_state transitions to learn a model of the environment. It performs direct policy search (similar to the direct policy search agent using a black-box optimization algorithm to optimize the parameters of a parameterized policy) in the model in order to optimize a criterion defined by a fitness function. This fitness function can be e.g. the estimated accumulated reward obtained by this policy in the model environment. In order to enforce exploration, the model is wrapped for an RMax-like behavior so that it returns the reward RMax for all states that have not been sufficiently explored. RMax should be an upper bound to the actual achievable return in order to enforce optimism in the face of uncertainty.
gamma: | : The discount factor for computing the return given the rewards |
---|---|
planning_episodes: | |
: The number internally simulated episodes that are performed in one planning step | |
policy_search: | : The method used for search of an optimal policy in the policy space. Defines policy parametrization and internally used black box optimization algorithm. |
model: | : The algorithm used for learning a model of the environment |
Monte-Carlo learning agent
This module defines an agent which uses Monte Carlo policy evaluation to optimize its behavior in a given environment
Agent that learns based on monte-carlo samples of the Q-function
An agent which uses Monte Carlo policy evaluation to optimize its behavior in a given environment.
gamma: | : The discount factor for computing the return given the rewards |
---|---|
epsilon: | : Exploration rate. The probability that an action is chosen non-greedily, i.e. uniformly random among all available actions |
visit: | : Whether first (“first”) or every visit (“every”) is used in Monte-Carlo updates |
defaultQ: | : The initially assumed Q-value for each state-action pair. Allows to control initial exploration due to optimistic initialization |
MMLF agent that acts randomly
This module defines a simple agent that can interact with an environment. It chooses all available actions with the same probability.
This module deals also as an example of how to implement an MMLF agent.
Agent that chooses uniformly randomly among the available actions.
Agent based on temporal difference learning
This module defines a base agent for all kind of agents based on temporal difference learning. Most of these agents can reuse most methods of this agents and have to modify only small parts.
Note: The TDAgent cannot be instantiated by itself, it is a abstract base class!
A base agent for all kind of agents based on temporal difference learning. Most of these agents can reuse most methods of this agents and have to modify only small parts
Note: The TDAgent cannot be instantiated by itself, it is a abstract base class!
Agent based on temporal difference learning with eligibility traces
This module defines an agent that uses temporal difference learning (e.g. Sarsa) with eligibility traces and function approximation (e.g. linear tile coding CMAC) to optimize its behavior in a given environment
Agent that implements TD(lambda) RL
An agent that uses temporal difference learning (e.g. Sarsa) with eligibility traces and function approximation (e.g. linear tile coding CMAC) to optimize its behavior in a given environment
update_rule: | : Whether the learning is on-policy or off-policy.. Can be either “SARSA” (on-policy) or “WatkinsQ” (off-policy) |
---|---|
gamma: | : The discount factor for computing the return given the rewards |
epsilon: | : Exploration rate. The probability that an action is chosen non-greedily, i.e. uniformly random among all available actions |
epsilonDecay: | : Decay factor for the exploration rate. The exploration rate is multiplied with this value after each episode. |
lambda: | : The eligibility trace decay rate |
minTraceValue: | : The minimum value of an entry in a trace that is considered to be relevant. If the eligibility falls below this value, it is set to 0 and the entry is thus no longer updated |
replacingTraces: | |
: Whether replacing or accumulating traces are used. | |
stateDimensionResolution: | |
: The default “resolution” the agent uses for every state dimension. Can be either an int (same resolution for each dimension) or a dict mapping dimension name to its resolution. | |
actionDimensionResolution: | |
: Per default, the agent discretizes a continuous action space in this number of discrete actions | |
function_approximator: | |
: The function approximator used for representing the Q value function |