Agents¶
Sequence Agents¶
SeqPGAgent¶
-
class
texar.tf.agents.SeqPGAgent(samples, logits, sequence_length, trainable_variables=None, learning_rate=None, sess=None, hparams=None)[source]¶ Policy Gradient agent for sequence prediction.
This is a wrapper of the training process that trains a model with policy gradient. Agent itself does not create new trainable variables.
Parameters: - samples – An int Tensor of shape [batch_size, max_time] containing sampled sequences from the model.
- logits – A float Tenosr of shape [batch_size, max_time, vocab_size] containing the logits of samples from the model.
- sequence_length – A Tensor of shape [batch_size]. Time steps beyond the respective sequence lengths are masked out.
- trainable_variables (optional) – Trainable variables of the model to update during training. If None, all trainable variables in the graph are used.
- learning_rate (optional) – Learning rate for policy optimization. If
not given, determine the learning rate from
hparams. Seeget_train_op()for more details. - sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
- hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerter will be set to default values. See
default_hparams()for the hyperparameter sturcture and default values.
Example
## Train a decoder with policy gradient decoder = BasicRNNDecoder(...) outputs, _, sequence_length = decoder( decoding_strategy='infer_sample', ...) sess = tf.Session() agent = SeqPGAgent( samples=outputs.sample_id, logits=outputs.logits, sequence_length=sequence_length, sess=sess) while training: # Generate samples vals = agent.get_samples() # Evaluate reward sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab) reward_bleu = [] for y, y_ in zip(ground_truth, sample_text) reward_bleu.append(tx.evals.sentence_bleu(y, y_) # Update agent.observe(reward=reward_bleu)
-
static
default_hparams()[source]¶ Returns a dictionary of hyperparameters with default values:
{ 'discount_factor': 0.95, 'normalize_reward': False, 'entropy_weight': 0., 'loss': { 'average_across_batch': True, 'average_across_timesteps': False, 'sum_over_batch': False, 'sum_over_timesteps': True, 'time_major': False }, 'optimization': default_optimization_hparams(), 'name': 'pg_agent', }
Here:
- “discount_factor”: float
- The discount factor of reward.
- “normalize_reward”: bool
- Whether to normalize the discounted reward, by (discounted_reward - mean) / std. Here mean and std are over all time steps and all samples in the batch.
- “entropy_weight”: float
- The weight of entropy loss of the sample distribution, to encourage maximizing the Shannon entropy. Set to 0 to disable the loss.
- “loss”: dict
- Extra keyword arguments for
pg_loss_with_logits(), including the reduce arguments (e.g., average_across_batch) and time_major - “optimization”: dict
- Hyperparameters of optimization for updating the policy net.
See
default_optimization_hparams()for details. - “name”: str
- Name of the agent.
-
get_samples(extra_fetches=None, feed_dict=None)[source]¶ Returns sequence samples and extra results.
Parameters: - extra_fetches (dict, optional) – Extra tensors to fetch values, besides samples and sequence_length. Same as the fetches argument of tf.Session.run and tf_main:partial_run <Session#partial_run>.
- feed_dict (dict, optional) – A dict that maps tensor to
values. Note that all placeholder values used in
get_samples()and subsequentobserve()calls should be fed here.
Returns: A dict with keys “samples” and “sequence_length” containing the fetched values of
samplesandsequence_length, as well as other fetched values as specified inextra_fetches.Example
extra_fetches = {'truth_ids': data_batch['text_ids']} vals = agent.get_samples() sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab) truth_text = tx.utils.map_ids_to_strs(vals['truth_ids'], vocab) reward = reward_fn_in_python(truth_text, sample_text)
-
observe(reward, train_policy=True, compute_loss=True)[source]¶ Observes the reward, and updates the policy or computes loss accordingly.
Parameters: - reward – A Python array/list of shape [batch_size] containing
the reward for the samples generated in last call of
get_samples(). - train_policy (bool) – Whether to update the policy model according to the reward.
- compute_loss (bool) – If train_policy is False, whether to compute the policy gradient loss (but does not update the policy).
Returns: If train_policy or compute_loss is True, returns the loss (a python float scalar). Otherwise returns None.
- reward – A Python array/list of shape [batch_size] containing
the reward for the samples generated in last call of
-
sess¶ The tf session.
-
pg_loss¶ The scalar tensor of policy gradient loss.
-
sequence_length¶ The tensor of sample sequence length, of shape [batch_size].
-
samples¶ The tensor of sequence samples.
-
hparams¶ A
HParamsinstance. The hyperparameters of the module.
-
logits¶ The tensor of sequence logits.
-
name¶ The name of the module (not uniquified).
-
variable_scope¶ The variable scope of the agent.
Episodic Agents¶
EpisodicAgentBase¶
-
class
texar.tf.agents.EpisodicAgentBase(env_config, hparams=None)[source]¶ Base class inherited by episodic RL agents.
An agent is a wrapper of the training process that trains a model with RL algorithms. Agent itself does not create new trainable variables.
An episodic RL agent typically provides 3 interfaces, namely,
reset(),get_action()andobserve(), and is used as the following example.Example
env = SomeEnvironment(...) agent = PGAgent(...) while True: # Starts one episode agent.reset() observ = env.reset() while True: action = agent.get_action(observ) next_observ, reward, terminal = env.step(action) agent.observe(reward, terminal) observ = next_observ if terminal: break
Parameters: - env_config – An instance of
EnvConfigspecifying action space, observation space, and reward range, etc. Useget_gym_env_config()to create an EnvConfig from a gym environment. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerter will be set to default values. See
default_hparams()for the hyperparameter sturcture and default values.
-
static
default_hparams()[source]¶ Returns a dictionary of hyperparameters with default values.
{ "name": "agent" }
-
observe(reward, terminal, train_policy=True, feed_dict=None)[source]¶ Observes experience from environment.
Parameters: - reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
env_config. - terminal (bool) – Whether the episode is terminated.
- train_policy (bool) – Wether to update the policy for this step.
- feed_dict (dict, optional) – Any stuffs fed to running the training operator.
- reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
-
get_action(observ, feed_dict=None)[source]¶ Gets action according to observation.
Parameters: observ – Observation from the environment. Returns: action from the policy.
-
env_config¶ Environment configuration.
-
hparams¶ A
HParamsinstance. The hyperparameters of the module.
-
name¶ The name of the module (not uniquified).
-
variable_scope¶ The variable scope of the agent.
- env_config – An instance of
PGAgent¶
-
class
texar.tf.agents.PGAgent(env_config, sess=None, policy=None, policy_kwargs=None, policy_caller_kwargs=None, learning_rate=None, hparams=None)[source]¶ Policy gradient agent for episodic setting. This agent here supports un-batched training, i.e., each time generates one action, takes one observation, and updates the policy.
The policy must take in an observation of shape [1] + observation_shape, where the first dimension 1 stands for batch dimension, and output a dict containing:
Key “action” whose value is a Tensor of shape [1] + action_shape containing a single action.
One of keys “log_prob” or “dist”:
- “log_prob”: A Tensor of shape [1], the log probability of the “action”.
- “dist”: A tf_main:tf.distributions.Distribution <distributions/Distribution> with the log_prob interface and log_prob = dist.log_prob(outputs[“action”]).
Parameters: - env_config – An instance of
EnvConfigspecifying action space, observation space, and reward range, etc. Useget_gym_env_config()to create an EnvConfig from a gym environment. - sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
- policy (optional) – A policy net that takes in observation and outputs
actions and probabilities.
If not given, a policy network is created based on
hparams. - policy_kwargs (dict, optional) – Keyword arguments for policy
constructor. Note that the hparams argument for network
constructor is specified in the “policy_hparams” field of
hparamsand should not be included in policy_kwargs. Ignored ifpolicyis given. - policy_caller_kwargs (dict, optional) – Keyword arguments for
calling the policy to get actions. The policy is called with
outputs=policy(inputs=observation, **policy_caller_kwargs) - learning_rate (optional) – Learning rate for policy optimization. If
not given, determine the learning rate from
hparams. Seeget_train_op()for more details. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerter will be set to default values. See
default_hparams()for the hyperparameter sturcture and default values.
-
static
default_hparams()[source]¶ Returns a dictionary of hyperparameters with default values:
{ 'policy_type': 'CategoricalPolicyNet', 'policy_hparams': None, 'discount_factor': 0.95, 'normalize_reward': False, 'optimization': default_optimization_hparams(), 'name': 'pg_agent', }
Here:
- “policy_type”: str or class or instance
- Policy net. Can be class, its name or module path, or a class
instance. If class name is given, the class must be from module
texar.tf.modulesortexar.tf.custom. Ignored if a policy is given to the agent constructor. - “policy_hparams”: dict, optional
- Hyperparameters for the policy net. With the
policy_kwargsargument to the constructor, a network is created withpolicy_class(**policy_kwargs, hparams=policy_hparams). - “discount_factor”: float
- The discount factor of reward.
- “normalize_reward”: bool
- Whether to normalize the discounted reward, by (discounted_reward - mean) / std.
- “optimization”: dict
- Hyperparameters of optimization for updating the policy net.
See
default_optimization_hparams()for details. - “name”: str
- Name of the agent.
-
sess¶ The tf session.
-
env_config¶ Environment configuration.
-
get_action(observ, feed_dict=None)¶ Gets action according to observation.
Parameters: observ – Observation from the environment. Returns: action from the policy.
-
hparams¶ A
HParamsinstance. The hyperparameters of the module.
-
name¶ The name of the module (not uniquified).
-
observe(reward, terminal, train_policy=True, feed_dict=None)¶ Observes experience from environment.
Parameters: - reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
env_config. - terminal (bool) – Whether the episode is terminated.
- train_policy (bool) – Wether to update the policy for this step.
- feed_dict (dict, optional) – Any stuffs fed to running the training operator.
- reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
-
policy¶ The policy model.
-
reset()¶ Resets the states to begin new episode.
-
variable_scope¶ The variable scope of the agent.
DQNAgent¶
-
class
texar.tf.agents.DQNAgent(env_config, sess=None, qnet=None, target=None, qnet_kwargs=None, qnet_caller_kwargs=None, replay_memory=None, replay_memory_kwargs=None, exploration=None, exploration_kwargs=None, hparams=None)[source]¶ Deep Q learning agent for episodic setting.
A Q learning algorithm consists of several components:
- A Q-net takes in a state and returns Q-value for action sampling.
See
CategoricalQNetfor an example Q-net class and required interface. - A replay memory manages past experience for Q-net updates. See
DequeReplayMemoryfor an example replay memory class and required interface. - An exploration that specifies the exploration strategy used to train the Q-net. See
EpsilonLinearDecayExplorationfor an example class and required interface.
Parameters: - env_config – An instance of
EnvConfigspecifying action space, observation space, and reward range, etc. Useget_gym_env_config()to create an EnvConfig from a gym environment. - sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
- qnet (optional) – A Q network that predicts Q values given states.
If not given, a Q network is created based on
hparams. - target (optional) – A target network to compute target Q values.
- qnet_kwargs (dict, optional) – Keyword arguments for qnet
constructor. Note that the hparams argument for network
constructor is specified in the “policy_hparams” field of
hparamsand should not be included in policy_kwargs. Ignored ifqnetis given. - qnet_caller_kwargs (dict, optional) – Keyword arguments for
calling qnet to get Q values. The qnet is called with
outputs=qnet(inputs=observation, **qnet_caller_kwargs) - replay_memory (optional) – A replay memory instance.
If not given, a replay memory is created based on
hparams. - replay_memory_kwargs (dict, optional) – Keyword arguments for
replay_memory constructor.
Ignored if
replay_memoryis given. - exploration (optional) – An exploration instance used in the algorithm.
If not given, an exploration instance is created based on
hparams. - exploration_kwargs (dict, optional) – Keyword arguments for exploration
class constructor. Ignored if
explorationis given. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerters will be set to default values. See
default_hparams()for the hyperparameter sturcture and default values.
-
static
default_hparams()[source]¶ Returns a dictionary of hyperparameters with default values:
{ 'qnet_type': 'CategoricalQNet', 'qnet_hparams': None, 'replay_memory_type': 'DequeReplayMemory', 'replay_memory_hparams': None, 'exploration_type': 'EpsilonLinearDecayExploration', 'exploration_hparams': None, 'optimization': opt.default_optimization_hparams(), 'target_update_strategy': 'copy', 'cold_start_steps': 100, 'sample_batch_size': 32, 'update_period': 100, 'discount_factor': 0.95, 'name': 'dqn_agent' }
Here:
- “qnet_type”: str or class or instance
- Q-value net. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module
texar.tf.modulesortexar.tf.custom. Ignored if a qnet is given to the agent constructor. - “qnet_hparams”: dict, optional
- Hyperparameters for the Q net. With the
qnet_kwargsargument to the constructor, a network is created withqnet_class(**qnet_kwargs, hparams=qnet_hparams). - “replay_memory_type”: str or class or instance
- Replay memory class. Can be class, its name or module path,
or a class instance.
If class name is given, the class must be from module
texar.tf.coreortexar.tf.custom. Ignored if a replay_memory is given to the agent constructor. - “replay_memory_hparams”: dict, optional
- Hyperparameters for the replay memory. With the
replay_memory_kwargsargument to the constructor, a network is created withreplay_memory_class( **replay_memory_kwargs, hparams=replay_memory_hparams). - “exploration_type”: str or class or instance
- Exploration class. Can be class,
its name or module path, or a class instance. If class name is
given, the class must be from module
texar.tf.coreortexar.tf.custom. Ignored if a exploration is given to the agent constructor. - “exploration_hparams”: dict, optional
- Hyperparameters for the exploration class.
With the
exploration_kwargsargument to the constructor, a network is created withexploration_class( **exploration_kwargs, hparams=exploration_hparams). - “optimization”: dict
- Hyperparameters of optimization for updating the Q-net.
See
default_optimization_hparams()for details. - “cold_start_steps”: int
- In the beginning, Q-net is not trained in the first few steps.
- “sample_batch_size”: int
- The number of samples taken in replay memory when training.
“target_update_strategy”: string
- If “copy”, the target network is assigned with the parameter of Q-net every
"update_period"steps. - If “tau”, target will be updated by assigning as
` (1 - 1/update_period) * target + 1/update_period * qnet `
- “update_period”: int
- Frequecy of updating the target network, i.e., updating the target once for every “update_period” steps.
- “discount_factor”: float
- The discount factor of reward.
- “name”: str
- Name of the agent.
-
env_config¶ Environment configuration.
-
get_action(observ, feed_dict=None)¶ Gets action according to observation.
Parameters: observ – Observation from the environment. Returns: action from the policy.
-
hparams¶ A
HParamsinstance. The hyperparameters of the module.
-
name¶ The name of the module (not uniquified).
-
observe(reward, terminal, train_policy=True, feed_dict=None)¶ Observes experience from environment.
Parameters: - reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
env_config. - terminal (bool) – Whether the episode is terminated.
- train_policy (bool) – Wether to update the policy for this step.
- feed_dict (dict, optional) – Any stuffs fed to running the training operator.
- reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
-
reset()¶ Resets the states to begin new episode.
-
sess¶ The tf session.
-
variable_scope¶ The variable scope of the agent.
- A Q-net takes in a state and returns Q-value for action sampling.
See
ActorCriticAgent¶
-
class
texar.tf.agents.ActorCriticAgent(env_config, sess=None, actor=None, actor_kwargs=None, critic=None, critic_kwargs=None, hparams=None)[source]¶ Actor-critic agent for episodic setting.
An actor-critic algorithm consists of several components:
- Actor is the policy to optimize. As a temporary implementation, here by default we use a
PGAgentinstance that wraps a policy net and provides proper interfaces to perform the role of an actor. - Critic that provides learning signals to the actor. Again, as a temporary implemetation, here by default we use a
DQNAgentinstance that wraps a Q net and provides proper interfaces to perform the role of a critic.
Parameters: - env_config – An instance of
EnvConfigspecifying action space, observation space, and reward range, etc. Useget_gym_env_config()to create an EnvConfig from a gym environment. - sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
- actor (optional) – An instance of
PGAgentthat performs as actor in the algorithm. If not provided, an actor is created based onhparams. - actor_kwargs (dict, optional) – Keyword arguments for actor
constructor. Note that the hparams argument for actor
constructor is specified in the “actor_hparams” field of
hparamsand should not be included in actor_kwargs. Ignored ifactoris given. - critic (optional) – An instance of
DQNAgentthat performs as critic in the algorithm. If not provided, a critic is created based onhparams. - critic_kwargs (dict, optional) – Keyword arguments for critic
constructor. Note that the hparams argument for critic
constructor is specified in the “critic_hparams” field of
hparamsand should not be included in critic_kwargs. Ignored ifcriticis given. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparamerters will be set to default values. See
default_hparams()for the hyperparameter sturcture and default values.
-
static
default_hparams()[source]¶ Returns a dictionary of hyperparameters with default values:
{ 'actor_type': 'PGAgent', 'actor_hparams': None, 'critic_type': 'DQNAgent', 'critic_hparams': None, 'name': 'actor_critic_agent' }
Here:
- “actor_type”: str or class or instance
- Actor. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module
texar.tf.agentsortexar.tf.custom. Ignored if a actor is given to the agent constructor. - “actor_kwargs”: dict, optional
- Hyperparameters for the actor class. With the
actor_kwargsargument to the constructor, an actor is created withactor_class(**actor_kwargs, hparams=actor_hparams). - “critic_type”: str or class or instance
- Critic. Can be class, its
name or module path, or a class instance. If class name is given,
the class must be from module
texar.tf.agentsortexar.tf.custom. Ignored if a critic is given to the agent constructor. - “critic_kwargs”: dict, optional
- Hyperparameters for the critic class. With the
critic_kwargsargument to the constructor, an critic is created withcritic_class(**critic_kwargs, hparams=critic_hparams). - “name”: str
- Name of the agent.
-
get_action(observ, feed_dict=None)[source]¶ Gets action according to observation.
Parameters: observ – Observation from the environment. Returns: action from the policy.
-
env_config¶ Environment configuration.
-
hparams¶ A
HParamsinstance. The hyperparameters of the module.
-
name¶ The name of the module (not uniquified).
-
observe(reward, terminal, train_policy=True, feed_dict=None)¶ Observes experience from environment.
Parameters: - reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
env_config. - terminal (bool) – Whether the episode is terminated.
- train_policy (bool) – Wether to update the policy for this step.
- feed_dict (dict, optional) – Any stuffs fed to running the training operator.
- reward – Reward of the action. The configuration (e.g., shape) of
the reward is defined in
-
reset()¶ Resets the states to begin new episode.
-
sess¶ The tf session.
-
variable_scope¶ The variable scope of the agent.
- Actor is the policy to optimize. As a temporary implementation, here by default we use a
Agent Utils¶
Space¶
-
class
texar.tf.agents.Space(shape=None, low=None, high=None, dtype=None)[source]¶ Observation and action spaces. Describes valid actions and observations. Similar to gym.Space.
Parameters: - shape (optional) – Shape of the space, a tuple. If not
given, infers from
lowandhigh. - low (optional) – Lower bound (inclusive) of each dimension of the
space. Must have
shape as specified by
shape, and of the same shape with withhigh(if given). If None, set to -inf for each dimension. - high (optional) – Upper bound (inclusive) of each dimension of the
space. Must have
shape as specified by
shape, and of the same shape with withlow(if given). If None, set to inf for each dimension. - dtype (optional) – Data type of elements in the space. If not given,
infers from
low(if given) or set to float.
Example
s = Space(low=0, high=10, dtype=np.int32) #s.contains(2) == True #s.contains(10) == True #s.contains(11) == False #s.shape == () s2 = Space(shape=(2,2), high=np.ones([2,2]), dtype=np.float) #s2.low == [[-inf, -inf], [-inf, -inf]] #s2.high == [[1., 1.], [1., 1.]]
-
shape¶ Shape of the space.
-
low¶ Lower bound of the space.
-
high¶ Upper bound of the space.
-
dtype¶ Data type of the element.
- shape (optional) – Shape of the space, a tuple. If not
given, infers from