Agents

Sequence Agents

SeqPGAgent

class texar.agents.SeqPGAgent(samples, logits, sequence_length, trainable_variables=None, learning_rate=None, sess=None, hparams=None)[source]

Policy Gradient agent for sequence prediction.

This is a wrapper of the training process that trains a model with policy gradient. Agent itself does not create new trainable variables.

Parameters:
  • samples – An int Tensor of shape [batch_size, max_time] containing sampled sequences from the model.
  • logits – A float Tenosr of shape [batch_size, max_time, vocab_size] containing the logits of samples from the model.
  • sequence_length – A Tensor of shape [batch_size]. Time steps beyond the respective sequence lengths are masked out.
  • trainable_variables (optional) – Trainable variables of the model to update during training. If None, all trainable variables in the graph are used.
  • learning_rate (optional) – Learning rate for policy optimization. If not given, determine the learning rate from hparams. See get_train_op() for more details.
  • sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerter will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.

Example

## Train a decoder with policy gradient
decoder = BasicRNNDecoder(...)
outputs, _, sequence_length = decoder(
    decoding_strategy='infer_sample', ...)

sess = tf.Session()
agent = SeqPGAgent(
    samples=outputs.sample_id,
    logits=outputs.logits,
    sequence_length=sequence_length,
    sess=sess)
while training:
    # Generate samples
    vals = agent.get_samples()
    # Evaluate reward
    sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab)
    reward_bleu = []
    for y, y_ in zip(ground_truth, sample_text)
        reward_bleu.append(tx.evals.sentence_bleu(y, y_)
    # Update
    agent.observe(reward=reward_bleu)
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values:

{
    'discount_factor': 0.95,
    'normalize_reward': False,
    'entropy_weight': 0.,
    'loss': {
        'average_across_batch': True,
        'average_across_timesteps': False,
        'sum_over_batch': False,
        'sum_over_timesteps': True,
        'time_major': False
    },
    'optimization': default_optimization_hparams(),
    'name': 'pg_agent',
}

Here:

“discount_factor” : float
The discount factor of reward.
“normalize_reward” : bool
Whether to normalize the discounted reward, by (discounted_reward - mean) / std. Here mean and std are over all time steps and all samples in the batch.
“entropy_weight” : float
The weight of entropy loss of the sample distribution, to encourage maximizing the Shannon entropy. Set to 0 to disable the loss.
“loss” : dict
Extra keyword arguments for pg_loss_with_logits(), including the reduce arguments (e.g., average_across_batch) and time_major
“optimization” : dict
Hyperparameters of optimization for updating the policy net. See default_optimization_hparams() for details.
“name” : str
Name of the agent.
get_samples(extra_fetches=None, feed_dict=None)[source]

Returns sequence samples and extra results.

Parameters:
  • extra_fetches (dict, optional) – Extra tensors to fetch values, besides samples and sequence_length. Same as the fetches argument of tf.Session.run and tf_main:partial_run <Session#partial_run>.
  • feed_dict (dict, optional) – A dict that maps tensor to values. Note that all placeholder values used in get_samples() and subsequent observe() calls should be fed here.
Returns:

A dict with keys “samples” and “sequence_length” containing the fetched values of samples and sequence_length, as well as other fetched values as specified in extra_fetches.

Example

extra_fetches = {'truth_ids': data_batch['text_ids']}
vals = agent.get_samples()
sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab)
truth_text = tx.utils.map_ids_to_strs(vals['truth_ids'], vocab)
reward = reward_fn_in_python(truth_text, sample_text)
observe(reward, train_policy=True, compute_loss=True)[source]

Observes the reward, and updates the policy or computes loss accordingly.

Parameters:
  • reward – A Python array/list of shape [batch_size] containing the reward for the samples generated in last call of get_samples().
  • train_policy (bool) – Whether to update the policy model according to the reward.
  • compute_loss (bool) – If train_policy is False, whether to compute the policy gradient loss (but does not update the policy).
Returns:

If train_policy or compute_loss is True, returns the loss (a python float scalar). Otherwise returns None.

sess

The tf session.

pg_loss

The scalar tensor of policy gradient loss.

sequence_length

The tensor of sample sequence length, of shape [batch_size].

samples

The tensor of sequence samples.

hparams

A HParams instance. The hyperparameters of the module.

logits

The tensor of sequence logits.

name

The name of the module (not uniquified).

variable_scope

The variable scope of the agent.

Episodic Agents

EpisodicAgentBase

class texar.agents.EpisodicAgentBase(env_config, hparams=None)[source]

Base class inherited by episodic RL agents.

An agent is a wrapper of the training process that trains a model with RL algorithms. Agent itself does not create new trainable variables.

An episodic RL agent typically provides 3 interfaces, namely, reset(), get_action() and observe(), and is used as the following example.

Example

env = SomeEnvironment(...)
agent = PGAgent(...)

while True:
    # Starts one episode
    agent.reset()
    observ = env.reset()
    while True:
        action = agent.get_action(observ)
        next_observ, reward, terminal = env.step(action)
        agent.observe(reward, terminal)
        observ = next_observ
        if terminal:
            break
Parameters:
  • env_config – An instance of EnvConfig specifying action space, observation space, and reward range, etc. Use get_gym_env_config() to create an EnvConfig from a gym environment.
  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerter will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values.

{
    "name": "agent"
}
reset()[source]

Resets the states to begin new episode.

observe(reward, terminal, train_policy=True, feed_dict=None)[source]

Observes experience from environment.

Parameters:
  • reward – Reward of the action. The configuration (e.g., shape) of the reward is defined in env_config.
  • terminal (bool) – Whether the episode is terminated.
  • train_policy (bool) – Wether to update the policy for this step.
  • feed_dict (dict, optional) – Any stuffs fed to running the training operator.
get_action(observ, feed_dict=None)[source]

Gets action according to observation.

Parameters:observ – Observation from the environment.
Returns:action from the policy.
env_config

Environment configuration.

hparams

A HParams instance. The hyperparameters of the module.

name

The name of the module (not uniquified).

variable_scope

The variable scope of the agent.

PGAgent

class texar.agents.PGAgent(env_config, sess=None, policy=None, policy_kwargs=None, policy_caller_kwargs=None, learning_rate=None, hparams=None)[source]

Policy gradient agent for episodic setting. This agent here supports un-batched training, i.e., each time generates one action, takes one observation, and updates the policy.

The policy must take in an observation of shape [1] + observation_shape, where the first dimension 1 stands for batch dimension, and output a dict containing:

  • Key “action” whose value is a Tensor of shape [1] + action_shape containing a single action.

  • One of keys “log_prob” or “dist”:

    • “log_prob”: A Tensor of shape [1], the log probability of the “action”.
    • “dist”: A tf_main:tf.distributions.Distribution <distributions/Distribution> with the log_prob interface and log_prob = dist.log_prob(outputs[“action”]).
Parameters:
  • env_config – An instance of EnvConfig specifying action space, observation space, and reward range, etc. Use get_gym_env_config() to create an EnvConfig from a gym environment.
  • sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
  • policy (optional) – A policy net that takes in observation and outputs actions and probabilities. If not given, a policy network is created based on hparams.
  • policy_kwargs (dict, optional) – Keyword arguments for policy constructor. Note that the hparams argument for network constructor is specified in the “policy_hparams” field of hparams and should not be included in policy_kwargs. Ignored if policy is given.
  • policy_caller_kwargs (dict, optional) – Keyword arguments for calling the policy to get actions. The policy is called with outputs=policy(inputs=observation, **policy_caller_kwargs)
  • learning_rate (optional) – Learning rate for policy optimization. If not given, determine the learning rate from hparams. See get_train_op() for more details.
  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerter will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values:

{
    'policy_type': 'CategoricalPolicyNet',
    'policy_hparams': None,
    'discount_factor': 0.95,
    'normalize_reward': False,
    'optimization': default_optimization_hparams(),
    'name': 'pg_agent',
}

Here:

“policy_type” : str or class or instance
Policy net. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.modules or texar.custom. Ignored if a policy is given to the agent constructor.
“policy_hparams” : dict, optional
Hyperparameters for the policy net. With the policy_kwargs argument to the constructor, a network is created with policy_class(**policy_kwargs, hparams=policy_hparams).
“discount_factor” : float
The discount factor of reward.
“normalize_reward” : bool
Whether to normalize the discounted reward, by (discounted_reward - mean) / std.
“optimization” : dict
Hyperparameters of optimization for updating the policy net. See default_optimization_hparams() for details.
“name” : str
Name of the agent.
sess

The tf session.

env_config

Environment configuration.

get_action(observ, feed_dict=None)

Gets action according to observation.

Parameters:observ – Observation from the environment.
Returns:action from the policy.
hparams

A HParams instance. The hyperparameters of the module.

name

The name of the module (not uniquified).

observe(reward, terminal, train_policy=True, feed_dict=None)

Observes experience from environment.

Parameters:
  • reward – Reward of the action. The configuration (e.g., shape) of the reward is defined in env_config.
  • terminal (bool) – Whether the episode is terminated.
  • train_policy (bool) – Wether to update the policy for this step.
  • feed_dict (dict, optional) – Any stuffs fed to running the training operator.
policy

The policy model.

reset()

Resets the states to begin new episode.

variable_scope

The variable scope of the agent.

DQNAgent

class texar.agents.DQNAgent(env_config, sess=None, qnet=None, target=None, qnet_kwargs=None, qnet_caller_kwargs=None, replay_memory=None, replay_memory_kwargs=None, exploration=None, exploration_kwargs=None, hparams=None)[source]

Deep Q learning agent for episodic setting.

A Q learning algorithm consists of several components:

  • A Q-net takes in a state and returns Q-value for action sampling. See CategoricalQNet for an example Q-net class and required interface.
  • A replay memory manages past experience for Q-net updates. See DequeReplayMemory for an example replay memory class and required interface.
  • An exploration that specifies the exploration strategy used to train the Q-net. See EpsilonLinearDecayExploration for an example class and required interface.
Parameters:
  • env_config – An instance of EnvConfig specifying action space, observation space, and reward range, etc. Use get_gym_env_config() to create an EnvConfig from a gym environment.
  • sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
  • qnet (optional) – A Q network that predicts Q values given states. If not given, a Q network is created based on hparams.
  • target (optional) – A target network to compute target Q values.
  • qnet_kwargs (dict, optional) – Keyword arguments for qnet constructor. Note that the hparams argument for network constructor is specified in the “policy_hparams” field of hparams and should not be included in policy_kwargs. Ignored if qnet is given.
  • qnet_caller_kwargs (dict, optional) – Keyword arguments for calling qnet to get Q values. The qnet is called with outputs=qnet(inputs=observation, **qnet_caller_kwargs)
  • replay_memory (optional) – A replay memory instance. If not given, a replay memory is created based on hparams.
  • replay_memory_kwargs (dict, optional) – Keyword arguments for replay_memory constructor. Ignored if replay_memory is given.
  • exploration (optional) – An exploration instance used in the algorithm. If not given, an exploration instance is created based on hparams.
  • exploration_kwargs (dict, optional) – Keyword arguments for exploration class constructor. Ignored if exploration is given.
  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerters will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values:

{
    'qnet_type': 'CategoricalQNet',
    'qnet_hparams': None,
    'replay_memory_type': 'DequeReplayMemory',
    'replay_memory_hparams': None,
    'exploration_type': 'EpsilonLinearDecayExploration',
    'exploration_hparams': None,
    'optimization': opt.default_optimization_hparams(),
    'target_update_strategy': 'copy',
    'cold_start_steps': 100,
    'sample_batch_size': 32,
    'update_period': 100,
    'discount_factor': 0.95,
    'name': 'dqn_agent'
}

Here:

“qnet_type” : str or class or instance
Q-value net. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.modules or texar.custom. Ignored if a qnet is given to the agent constructor.
“qnet_hparams” : dict, optional
Hyperparameters for the Q net. With the qnet_kwargs argument to the constructor, a network is created with qnet_class(**qnet_kwargs, hparams=qnet_hparams).
“replay_memory_type” : str or class or instance
Replay memory class. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.core or texar.custom. Ignored if a replay_memory is given to the agent constructor.
“replay_memory_hparams” : dict, optional
Hyperparameters for the replay memory. With the replay_memory_kwargs argument to the constructor, a network is created with replay_memory_class( **replay_memory_kwargs, hparams=replay_memory_hparams).
“exploration_type” : str or class or instance
Exploration class. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.core or texar.custom. Ignored if a exploration is given to the agent constructor.
“exploration_hparams” : dict, optional
Hyperparameters for the exploration class. With the exploration_kwargs argument to the constructor, a network is created with exploration_class( **exploration_kwargs, hparams=exploration_hparams).
“optimization” : dict
Hyperparameters of optimization for updating the Q-net. See default_optimization_hparams() for details.
“cold_start_steps”: int
In the beginning, Q-net is not trained in the first few steps.
“sample_batch_size”: int
The number of samples taken in replay memory when training.

“target_update_strategy”: string

  • If “copy”, the target network is assigned with the parameter of Q-net every "update_period" steps.
  • If “tau”, target will be updated by assigning as

` (1 - 1/update_period) * target + 1/update_period * qnet `

“update_period”: int
Frequecy of updating the target network, i.e., updating the target once for every “update_period” steps.
“discount_factor” : float
The discount factor of reward.
“name” : str
Name of the agent.
env_config

Environment configuration.

get_action(observ, feed_dict=None)

Gets action according to observation.

Parameters:observ – Observation from the environment.
Returns:action from the policy.
hparams

A HParams instance. The hyperparameters of the module.

name

The name of the module (not uniquified).

observe(reward, terminal, train_policy=True, feed_dict=None)

Observes experience from environment.

Parameters:
  • reward – Reward of the action. The configuration (e.g., shape) of the reward is defined in env_config.
  • terminal (bool) – Whether the episode is terminated.
  • train_policy (bool) – Wether to update the policy for this step.
  • feed_dict (dict, optional) – Any stuffs fed to running the training operator.
reset()

Resets the states to begin new episode.

sess

The tf session.

variable_scope

The variable scope of the agent.

ActorCriticAgent

class texar.agents.ActorCriticAgent(env_config, sess=None, actor=None, actor_kwargs=None, critic=None, critic_kwargs=None, hparams=None)[source]

Actor-critic agent for episodic setting.

An actor-critic algorithm consists of several components:

  • Actor is the policy to optimize. As a temporary implementation, here by default we use a PGAgent instance that wraps a policy net and provides proper interfaces to perform the role of an actor.
  • Critic that provides learning signals to the actor. Again, as a temporary implemetation, here by default we use a DQNAgent instance that wraps a Q net and provides proper interfaces to perform the role of a critic.
Parameters:
  • env_config – An instance of EnvConfig specifying action space, observation space, and reward range, etc. Use get_gym_env_config() to create an EnvConfig from a gym environment.
  • sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
  • actor (optional) – An instance of PGAgent that performs as actor in the algorithm. If not provided, an actor is created based on hparams.
  • actor_kwargs (dict, optional) – Keyword arguments for actor constructor. Note that the hparams argument for actor constructor is specified in the “actor_hparams” field of hparams and should not be included in actor_kwargs. Ignored if actor is given.
  • critic (optional) – An instance of DQNAgent that performs as critic in the algorithm. If not provided, a critic is created based on hparams.
  • critic_kwargs (dict, optional) – Keyword arguments for critic constructor. Note that the hparams argument for critic constructor is specified in the “critic_hparams” field of hparams and should not be included in critic_kwargs. Ignored if critic is given.
  • hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerters will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.
static default_hparams()[source]

Returns a dictionary of hyperparameters with default values:

{
    'actor_type': 'PGAgent',
    'actor_hparams': None,
    'critic_type': 'DQNAgent',
    'critic_hparams': None,
    'name': 'actor_critic_agent'
}

Here:

“actor_type” : str or class or instance
Actor. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.agents or texar.custom. Ignored if a actor is given to the agent constructor.
“actor_kwargs” : dict, optional
Hyperparameters for the actor class. With the actor_kwargs argument to the constructor, an actor is created with actor_class(**actor_kwargs, hparams=actor_hparams).
“critic_type” : str or class or instance
Critic. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.agents or texar.custom. Ignored if a critic is given to the agent constructor.
“critic_kwargs” : dict, optional
Hyperparameters for the critic class. With the critic_kwargs argument to the constructor, an critic is created with critic_class(**critic_kwargs, hparams=critic_hparams).
“name” : str
Name of the agent.
get_action(observ, feed_dict=None)[source]

Gets action according to observation.

Parameters:observ – Observation from the environment.
Returns:action from the policy.
env_config

Environment configuration.

hparams

A HParams instance. The hyperparameters of the module.

name

The name of the module (not uniquified).

observe(reward, terminal, train_policy=True, feed_dict=None)

Observes experience from environment.

Parameters:
  • reward – Reward of the action. The configuration (e.g., shape) of the reward is defined in env_config.
  • terminal (bool) – Whether the episode is terminated.
  • train_policy (bool) – Wether to update the policy for this step.
  • feed_dict (dict, optional) – Any stuffs fed to running the training operator.
reset()

Resets the states to begin new episode.

sess

The tf session.

variable_scope

The variable scope of the agent.

Agent Utils

Space

class texar.agents.Space(shape=None, low=None, high=None, dtype=None)[source]

Observation and action spaces. Describes valid actions and observations. Similar to gym.Space.

Parameters:
  • shape (optional) – Shape of the space, a tuple. If not given, infers from low and high.
  • low (optional) – Lower bound (inclusive) of each dimension of the space. Must have shape as specified by shape, and of the same shape with with high (if given). If None, set to -inf for each dimension.
  • high (optional) – Upper bound (inclusive) of each dimension of the space. Must have shape as specified by shape, and of the same shape with with low (if given). If None, set to inf for each dimension.
  • dtype (optional) – Data type of elements in the space. If not given, infers from low (if given) or set to float.

Example

s = Space(low=0, high=10, dtype=np.int32)
#s.contains(2) == True
#s.contains(10) == True
#s.contains(11) == False
#s.shape == ()

s2 = Space(shape=(2,2), high=np.ones([2,2]), dtype=np.float)
#s2.low == [[-inf, -inf], [-inf, -inf]]
#s2.high == [[1., 1.], [1., 1.]]
contains(x)[source]

Checks if x is contained in the space. Returns a bool.

shape

Shape of the space.

low

Lower bound of the space.

high

Upper bound of the space.

dtype

Data type of the element.

EnvConfig

class texar.agents.EnvConfig(action_space, observ_space, reward_range)[source]

Configurations of an environment.

Parameters:
  • action_space – An instance of Space or gym.Space, the action space.
  • observ_space – An instance of Space or gym.Space, the observation space.
  • reward_range – A tuple corresponding to the min and max possible rewards, e.g., reward_range=(-1.0, 1.0).

convert_gym_space

get_gym_env_config