Agents¶
Sequence Agents¶
SeqPGAgent¶
-
class
texar.tf.agents.
SeqPGAgent
(samples, logits, sequence_length, trainable_variables=None, learning_rate=None, sess=None, hparams=None)[source]¶ Policy Gradient agent for sequence prediction.
This is a wrapper of the training process that trains a model with policy gradient. Agent itself does not create new trainable variables.
Parameters: - samples – An int Tensor of shape [batch_size, max_time] containing sampled sequences from the model.
- logits – A float Tenosr of shape [batch_size, max_time, vocab_size] containing the logits of samples from the model.
- sequence_length – A Tensor of shape [batch_size]. Time steps beyond the respective sequence lengths are masked out.
- trainable_variables (optional) – Trainable variables of the model to update during training. If None, all trainable variables in the graph are used.
- learning_rate (optional) – Learning rate for policy optimization. If
not given, determine the learning rate from
hparams
. Seeget_train_op()
for more details. - sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
- hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerter will be set to default values. See
default_hparams()
for the hyperparameter sturcture and default values.
Example
## Train a decoder with policy gradient decoder = BasicRNNDecoder(...) outputs, _, sequence_length = decoder( decoding_strategy='infer_sample', ...) sess = tf.Session() agent = SeqPGAgent( samples=outputs.sample_id, logits=outputs.logits, sequence_length=sequence_length, sess=sess) while training: # Generate samples vals = agent.get_samples() # Evaluate reward sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab) reward_bleu = [] for y, y_ in zip(ground_truth, sample_text) reward_bleu.append(tx.evals.sentence_bleu(y, y_) # Update agent.observe(reward=reward_bleu)
-
static
default_hparams
()[source]¶ Returns a dictionary of hyperparameters with default values:
{ 'discount_factor': 0.95, 'normalize_reward': False, 'entropy_weight': 0., 'loss': { 'average_across_batch': True, 'average_across_timesteps': False, 'sum_over_batch': False, 'sum_over_timesteps': True, 'time_major': False }, 'optimization': default_optimization_hparams(), 'name': 'pg_agent', }
Here:
- “discount_factor”: float
- The discount factor of reward.
- “normalize_reward”: bool
- Whether to normalize the discounted reward, by (discounted_reward - mean) / std. Here mean and std are over all time steps and all samples in the batch.
- “entropy_weight”: float
- The weight of entropy loss of the sample distribution, to encourage maximizing the Shannon entropy. Set to 0 to disable the loss.
- “loss”: dict
- Extra keyword arguments for
pg_loss_with_logits()
, including the reduce arguments (e.g., average_across_batch) and time_major - “optimization”: dict
- Hyperparameters of optimization for updating the policy net.
See
default_optimization_hparams()
for details. - “name”: str
- Name of the agent.
-
get_samples
(extra_fetches=None, feed_dict=None)[source]¶ Returns sequence samples and extra results.
Parameters: - extra_fetches (dict, optional) – Extra tensors to fetch values, besides samples and sequence_length. Same as the fetches argument of tf.Session.run and tf_main:partial_run <Session#partial_run>.
- feed_dict (dict, optional) – A dict that maps tensor to
values. Note that all placeholder values used in
get_samples()
and subsequentobserve()
calls should be fed here.
Returns: A dict with keys “samples” and “sequence_length” containing the fetched values of
samples
andsequence_length
, as well as other fetched values as specified inextra_fetches
.Example
extra_fetches = {'truth_ids': data_batch['text_ids']} vals = agent.get_samples() sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab) truth_text = tx.utils.map_ids_to_strs(vals['truth_ids'], vocab) reward = reward_fn_in_python(truth_text, sample_text)
-
observe
(reward, train_policy=True, compute_loss=True)[source]¶ Observes the reward, and updates the policy or computes loss accordingly.
Parameters: - reward – A Python array/list of shape [batch_size] containing
the reward for the samples generated in last call of
get_samples()
. - train_policy (bool) – Whether to update the policy model according to the reward.
- compute_loss (bool) – If train_policy is False, whether to compute the policy gradient loss (but does not update the policy).
Returns: If train_policy or compute_loss is True, returns the loss (a python float scalar). Otherwise returns None.
- reward – A Python array/list of shape [batch_size] containing
the reward for the samples generated in last call of
-
sess
¶ The tf session.
-
pg_loss
¶ The scalar tensor of policy gradient loss.
-
sequence_length
¶ The tensor of sample sequence length, of shape [batch_size].
-
samples
¶ The tensor of sequence samples.
-
hparams
¶ A
HParams
instance. The hyperparameters of the module.
-
logits
¶ The tensor of sequence logits.
-
name
¶ The name of the module (not uniquified).
-
variable_scope
¶ The variable scope of the agent.
Episodic Agents¶
EpisodicAgentBase¶
-
class
texar.tf.agents.
EpisodicAgentBase
(env_config, hparams=None)[source]¶ Base class inherited by episodic RL agents.
An agent is a wrapper of the training process that trains a model with RL algorithms. Agent itself does not create new trainable variables.
An episodic RL agent typically provides 3 interfaces, namely,
reset()
,get_action()
andobserve()
, and is used as the following example.Example
env = SomeEnvironment(...) agent = PGAgent(...) while True: # Starts one episode agent.reset() observ = env.reset() while True: action = agent.get_action(observ) next_observ, reward, terminal = env.step(action) agent.observe(reward, terminal) observ = next_observ if terminal: break
Parameters: - env_config – An instance of
EnvConfig
specifying action space, observation space, and reward range, etc. Useget_gym_env_config()
to create an EnvConfig from a gym environment. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerter will be set to default values. See
default_hparams()
for the hyperparameter sturcture and default values.
-
static
default_hparams
()[source]¶ Returns a dictionary of hyperparameters with default values.
{ "name": "agent" }
-
observe
(reward, terminal, train_policy=True, feed_dict=None)[source]¶ Observes experience from environment.
Parameters: - reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
env_config
. - terminal (bool) – Whether the episode is terminated.
- train_policy (bool) – Wether to update the policy for this step.
- feed_dict (dict, optional) – Any stuffs fed to running the training operator.
- reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
-
get_action
(observ, feed_dict=None)[source]¶ Gets action according to observation.
Parameters: observ – Observation from the environment. Returns: action from the policy.
-
env_config
¶ Environment configuration.
-
hparams
¶ A
HParams
instance. The hyperparameters of the module.
-
name
¶ The name of the module (not uniquified).
-
variable_scope
¶ The variable scope of the agent.
- env_config – An instance of
PGAgent¶
-
class
texar.tf.agents.
PGAgent
(env_config, sess=None, policy=None, policy_kwargs=None, policy_caller_kwargs=None, learning_rate=None, hparams=None)[source]¶ Policy gradient agent for episodic setting. This agent here supports un-batched training, i.e., each time generates one action, takes one observation, and updates the policy.
The policy must take in an observation of shape [1] + observation_shape, where the first dimension 1 stands for batch dimension, and output a dict containing:
Key “action” whose value is a Tensor of shape [1] + action_shape containing a single action.
One of keys “log_prob” or “dist”:
- “log_prob”: A Tensor of shape [1], the log probability of the “action”.
- “dist”: A tf_main:tf.distributions.Distribution <distributions/Distribution> with the log_prob interface and log_prob = dist.log_prob(outputs[“action”]).
Parameters: - env_config – An instance of
EnvConfig
specifying action space, observation space, and reward range, etc. Useget_gym_env_config()
to create an EnvConfig from a gym environment. - sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
- policy (optional) – A policy net that takes in observation and outputs
actions and probabilities.
If not given, a policy network is created based on
hparams
. - policy_kwargs (dict, optional) – Keyword arguments for policy
constructor. Note that the hparams argument for network
constructor is specified in the “policy_hparams” field of
hparams
and should not be included in policy_kwargs. Ignored ifpolicy
is given. - policy_caller_kwargs (dict, optional) – Keyword arguments for
calling the policy to get actions. The policy is called with
outputs=policy(inputs=observation, **policy_caller_kwargs)
- learning_rate (optional) – Learning rate for policy optimization. If
not given, determine the learning rate from
hparams
. Seeget_train_op()
for more details. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerter will be set to default values. See
default_hparams()
for the hyperparameter sturcture and default values.
-
static
default_hparams
()[source]¶ Returns a dictionary of hyperparameters with default values:
{ 'policy_type': 'CategoricalPolicyNet', 'policy_hparams': None, 'discount_factor': 0.95, 'normalize_reward': False, 'optimization': default_optimization_hparams(), 'name': 'pg_agent', }
Here:
- “policy_type”: str or class or instance
- Policy net. Can be class, its name or module path, or a class
instance. If class name is given, the class must be from module
texar.tf.modules
ortexar.tf.custom
. Ignored if a policy is given to the agent constructor. - “policy_hparams”: dict, optional
- Hyperparameters for the policy net. With the
policy_kwargs
argument to the constructor, a network is created withpolicy_class(**policy_kwargs, hparams=policy_hparams)
. - “discount_factor”: float
- The discount factor of reward.
- “normalize_reward”: bool
- Whether to normalize the discounted reward, by (discounted_reward - mean) / std.
- “optimization”: dict
- Hyperparameters of optimization for updating the policy net.
See
default_optimization_hparams()
for details. - “name”: str
- Name of the agent.
-
sess
¶ The tf session.
-
env_config
¶ Environment configuration.
-
get_action
(observ, feed_dict=None)¶ Gets action according to observation.
Parameters: observ – Observation from the environment. Returns: action from the policy.
-
hparams
¶ A
HParams
instance. The hyperparameters of the module.
-
name
¶ The name of the module (not uniquified).
-
observe
(reward, terminal, train_policy=True, feed_dict=None)¶ Observes experience from environment.
Parameters: - reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
env_config
. - terminal (bool) – Whether the episode is terminated.
- train_policy (bool) – Wether to update the policy for this step.
- feed_dict (dict, optional) – Any stuffs fed to running the training operator.
- reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
-
policy
¶ The policy model.
-
reset
()¶ Resets the states to begin new episode.
-
variable_scope
¶ The variable scope of the agent.
DQNAgent¶
-
class
texar.tf.agents.
DQNAgent
(env_config, sess=None, qnet=None, target=None, qnet_kwargs=None, qnet_caller_kwargs=None, replay_memory=None, replay_memory_kwargs=None, exploration=None, exploration_kwargs=None, hparams=None)[source]¶ Deep Q learning agent for episodic setting.
A Q learning algorithm consists of several components:
- A Q-net takes in a state and returns Q-value for action sampling.
See
CategoricalQNet
for an example Q-net class and required interface. - A replay memory manages past experience for Q-net updates. See
DequeReplayMemory
for an example replay memory class and required interface. - An exploration that specifies the exploration strategy used to train the Q-net. See
EpsilonLinearDecayExploration
for an example class and required interface.
Parameters: - env_config – An instance of
EnvConfig
specifying action space, observation space, and reward range, etc. Useget_gym_env_config()
to create an EnvConfig from a gym environment. - sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
- qnet (optional) – A Q network that predicts Q values given states.
If not given, a Q network is created based on
hparams
. - target (optional) – A target network to compute target Q values.
- qnet_kwargs (dict, optional) – Keyword arguments for qnet
constructor. Note that the hparams argument for network
constructor is specified in the “policy_hparams” field of
hparams
and should not be included in policy_kwargs. Ignored ifqnet
is given. - qnet_caller_kwargs (dict, optional) – Keyword arguments for
calling qnet to get Q values. The qnet is called with
outputs=qnet(inputs=observation, **qnet_caller_kwargs)
- replay_memory (optional) – A replay memory instance.
If not given, a replay memory is created based on
hparams
. - replay_memory_kwargs (dict, optional) – Keyword arguments for
replay_memory constructor.
Ignored if
replay_memory
is given. - exploration (optional) – An exploration instance used in the algorithm.
If not given, an exploration instance is created based on
hparams
. - exploration_kwargs (dict, optional) – Keyword arguments for exploration
class constructor. Ignored if
exploration
is given. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerters will be set to default values. See
default_hparams()
for the hyperparameter sturcture and default values.
-
static
default_hparams
()[source]¶ Returns a dictionary of hyperparameters with default values:
{ 'qnet_type': 'CategoricalQNet', 'qnet_hparams': None, 'replay_memory_type': 'DequeReplayMemory', 'replay_memory_hparams': None, 'exploration_type': 'EpsilonLinearDecayExploration', 'exploration_hparams': None, 'optimization': opt.default_optimization_hparams(), 'target_update_strategy': 'copy', 'cold_start_steps': 100, 'sample_batch_size': 32, 'update_period': 100, 'discount_factor': 0.95, 'name': 'dqn_agent' }
Here:
- “qnet_type”: str or class or instance
- Q-value net. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module
texar.tf.modules
ortexar.tf.custom
. Ignored if a qnet is given to the agent constructor. - “qnet_hparams”: dict, optional
- Hyperparameters for the Q net. With the
qnet_kwargs
argument to the constructor, a network is created withqnet_class(**qnet_kwargs, hparams=qnet_hparams)
. - “replay_memory_type”: str or class or instance
- Replay memory class. Can be class, its name or module path,
or a class instance.
If class name is given, the class must be from module
texar.tf.core
ortexar.tf.custom
. Ignored if a replay_memory is given to the agent constructor. - “replay_memory_hparams”: dict, optional
- Hyperparameters for the replay memory. With the
replay_memory_kwargs
argument to the constructor, a network is created withreplay_memory_class( **replay_memory_kwargs, hparams=replay_memory_hparams)
. - “exploration_type”: str or class or instance
- Exploration class. Can be class,
its name or module path, or a class instance. If class name is
given, the class must be from module
texar.tf.core
ortexar.tf.custom
. Ignored if a exploration is given to the agent constructor. - “exploration_hparams”: dict, optional
- Hyperparameters for the exploration class.
With the
exploration_kwargs
argument to the constructor, a network is created withexploration_class( **exploration_kwargs, hparams=exploration_hparams)
. - “optimization”: dict
- Hyperparameters of optimization for updating the Q-net.
See
default_optimization_hparams()
for details. - “cold_start_steps”: int
- In the beginning, Q-net is not trained in the first few steps.
- “sample_batch_size”: int
- The number of samples taken in replay memory when training.
“target_update_strategy”: string
- If “copy”, the target network is assigned with the parameter of Q-net every
"update_period"
steps. - If “tau”, target will be updated by assigning as
` (1 - 1/update_period) * target + 1/update_period * qnet `
- “update_period”: int
- Frequecy of updating the target network, i.e., updating the target once for every “update_period” steps.
- “discount_factor”: float
- The discount factor of reward.
- “name”: str
- Name of the agent.
-
env_config
¶ Environment configuration.
-
get_action
(observ, feed_dict=None)¶ Gets action according to observation.
Parameters: observ – Observation from the environment. Returns: action from the policy.
-
hparams
¶ A
HParams
instance. The hyperparameters of the module.
-
name
¶ The name of the module (not uniquified).
-
observe
(reward, terminal, train_policy=True, feed_dict=None)¶ Observes experience from environment.
Parameters: - reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
env_config
. - terminal (bool) – Whether the episode is terminated.
- train_policy (bool) – Wether to update the policy for this step.
- feed_dict (dict, optional) – Any stuffs fed to running the training operator.
- reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
-
reset
()¶ Resets the states to begin new episode.
-
sess
¶ The tf session.
-
variable_scope
¶ The variable scope of the agent.
- A Q-net takes in a state and returns Q-value for action sampling.
See
ActorCriticAgent¶
-
class
texar.tf.agents.
ActorCriticAgent
(env_config, sess=None, actor=None, actor_kwargs=None, critic=None, critic_kwargs=None, hparams=None)[source]¶ Actor-critic agent for episodic setting.
An actor-critic algorithm consists of several components:
- Actor is the policy to optimize. As a temporary implementation, here by default we use a
PGAgent
instance that wraps a policy net and provides proper interfaces to perform the role of an actor. - Critic that provides learning signals to the actor. Again, as a temporary implemetation, here by default we use a
DQNAgent
instance that wraps a Q net and provides proper interfaces to perform the role of a critic.
Parameters: - env_config – An instance of
EnvConfig
specifying action space, observation space, and reward range, etc. Useget_gym_env_config()
to create an EnvConfig from a gym environment. - sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
- actor (optional) – An instance of
PGAgent
that performs as actor in the algorithm. If not provided, an actor is created based onhparams
. - actor_kwargs (dict, optional) – Keyword arguments for actor
constructor. Note that the hparams argument for actor
constructor is specified in the “actor_hparams” field of
hparams
and should not be included in actor_kwargs. Ignored ifactor
is given. - critic (optional) – An instance of
DQNAgent
that performs as critic in the algorithm. If not provided, a critic is created based onhparams
. - critic_kwargs (dict, optional) – Keyword arguments for critic
constructor. Note that the hparams argument for critic
constructor is specified in the “critic_hparams” field of
hparams
and should not be included in critic_kwargs. Ignored ifcritic
is given. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerters will be set to default values. See
default_hparams()
for the hyperparameter sturcture and default values.
-
static
default_hparams
()[source]¶ Returns a dictionary of hyperparameters with default values:
{ 'actor_type': 'PGAgent', 'actor_hparams': None, 'critic_type': 'DQNAgent', 'critic_hparams': None, 'name': 'actor_critic_agent' }
Here:
- “actor_type”: str or class or instance
- Actor. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module
texar.tf.agents
ortexar.tf.custom
. Ignored if a actor is given to the agent constructor. - “actor_kwargs”: dict, optional
- Hyperparameters for the actor class. With the
actor_kwargs
argument to the constructor, an actor is created withactor_class(**actor_kwargs, hparams=actor_hparams)
. - “critic_type”: str or class or instance
- Critic. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module
texar.tf.agents
ortexar.tf.custom
. Ignored if a critic is given to the agent constructor. - “critic_kwargs”: dict, optional
- Hyperparameters for the critic class. With the
critic_kwargs
argument to the constructor, an critic is created withcritic_class(**critic_kwargs, hparams=critic_hparams)
. - “name”: str
- Name of the agent.
-
get_action
(observ, feed_dict=None)[source]¶ Gets action according to observation.
Parameters: observ – Observation from the environment. Returns: action from the policy.
-
env_config
¶ Environment configuration.
-
hparams
¶ A
HParams
instance. The hyperparameters of the module.
-
name
¶ The name of the module (not uniquified).
-
observe
(reward, terminal, train_policy=True, feed_dict=None)¶ Observes experience from environment.
Parameters: - reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
env_config
. - terminal (bool) – Whether the episode is terminated.
- train_policy (bool) – Wether to update the policy for this step.
- feed_dict (dict, optional) – Any stuffs fed to running the training operator.
- reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
-
reset
()¶ Resets the states to begin new episode.
-
sess
¶ The tf session.
-
variable_scope
¶ The variable scope of the agent.
- Actor is the policy to optimize. As a temporary implementation, here by default we use a
Agent Utils¶
Space¶
-
class
texar.tf.agents.
Space
(shape=None, low=None, high=None, dtype=None)[source]¶ Observation and action spaces. Describes valid actions and observations. Similar to gym.Space.
Parameters: - shape (optional) – Shape of the space, a tuple. If not
given, infers from
low
andhigh
. - low (optional) – Lower bound (inclusive) of each dimension of the
space. Must have
shape as specified by
shape
, and of the same shape with withhigh
(if given). If None, set to -inf for each dimension. - high (optional) – Upper bound (inclusive) of each dimension of the
space. Must have
shape as specified by
shape
, and of the same shape with withlow
(if given). If None, set to inf for each dimension. - dtype (optional) – Data type of elements in the space. If not given,
infers from
low
(if given) or set to float.
Example
s = Space(low=0, high=10, dtype=np.int32) #s.contains(2) == True #s.contains(10) == True #s.contains(11) == False #s.shape == () s2 = Space(shape=(2,2), high=np.ones([2,2]), dtype=np.float) #s2.low == [[-inf, -inf], [-inf, -inf]] #s2.high == [[1., 1.], [1., 1.]]
-
shape
¶ Shape of the space.
-
low
¶ Lower bound of the space.
-
high
¶ Upper bound of the space.
-
dtype
¶ Data type of the element.
- shape (optional) – Shape of the space, a tuple. If not
given, infers from