Agents¶

Sequence Agents¶

SeqPGAgent¶

class texar.tf.agents.SeqPGAgent(samples, logits, sequence_length, trainable_variables=None, learning_rate=None, sess=None, hparams=None)[source]¶

Policy Gradient agent for sequence prediction.

This is a wrapper of the training process that trains a model with policy gradient. Agent itself does not create new trainable variables.

Parameters:

samples – An int Tensor of shape [batch_size, max_time] containing sampled sequences from the model.
logits – A float Tenosr of shape [batch_size, max_time, vocab_size] containing the logits of samples from the model.
sequence_length – A Tensor of shape [batch_size]. Time steps beyond the respective sequence lengths are masked out.
trainable_variables (optional) – Trainable variables of the model to update during training. If None, all trainable variables in the graph are used.
learning_rate (optional) – Learning rate for policy optimization. If not given, determine the learning rate from hparams. See get_train_op() for more details.
sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerter will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.

Example

## Train a decoder with policy gradient
decoder = BasicRNNDecoder(...)
outputs, _, sequence_length = decoder(
    decoding_strategy='infer_sample', ...)

sess = tf.Session()
agent = SeqPGAgent(
    samples=outputs.sample_id,
    logits=outputs.logits,
    sequence_length=sequence_length,
    sess=sess)
while training:
    # Generate samples
    vals = agent.get_samples()
    # Evaluate reward
    sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab)
    reward_bleu = []
    for y, y_ in zip(ground_truth, sample_text)
        reward_bleu.append(tx.evals.sentence_bleu(y, y_)
    # Update
    agent.observe(reward=reward_bleu)

static default_hparams()[source]¶

Returns a dictionary of hyperparameters with default values:

{
    'discount_factor': 0.95,
    'normalize_reward': False,
    'entropy_weight': 0.,
    'loss': {
        'average_across_batch': True,
        'average_across_timesteps': False,
        'sum_over_batch': False,
        'sum_over_timesteps': True,
        'time_major': False
    },
    'optimization': default_optimization_hparams(),
    'name': 'pg_agent',
}

Here:

“discount_factor”: float: The discount factor of reward.
“normalize_reward”: bool: Whether to normalize the discounted reward, by (discounted_reward - mean) / std. Here mean and std are over all time steps and all samples in the batch.
“entropy_weight”: float: The weight of entropy loss of the sample distribution, to encourage maximizing the Shannon entropy. Set to 0 to disable the loss.
“loss”: dict: Extra keyword arguments for pg_loss_with_logits(), including the reduce arguments (e.g., average_across_batch) and time_major
“optimization”: dict: Hyperparameters of optimization for updating the policy net. See default_optimization_hparams() for details.
“name”: str: Name of the agent.

get_samples(extra_fetches=None, feed_dict=None)[source]¶

Returns sequence samples and extra results.

Parameters:	extra_fetches (dict, optional) – Extra tensors to fetch values, besides samples and sequence_length. Same as the fetches argument of tf.Session.run and tf_main:partial_run <Session#partial_run>. feed_dict (dict, optional) – A dict that maps tensor to values. Note that all placeholder values used in `get_samples()` and subsequent `observe()` calls should be fed here.
Returns:	A dict with keys “samples” and “sequence_length” containing the fetched values of `samples` and `sequence_length`, as well as other fetched values as specified in `extra_fetches`.

Example

extra_fetches = {'truth_ids': data_batch['text_ids']}
vals = agent.get_samples()
sample_text = tx.utils.map_ids_to_strs(vals['samples'], vocab)
truth_text = tx.utils.map_ids_to_strs(vals['truth_ids'], vocab)
reward = reward_fn_in_python(truth_text, sample_text)

observe(reward, train_policy=True, compute_loss=True)[source]¶

Observes the reward, and updates the policy or computes loss accordingly.

Parameters:	reward – A Python array/list of shape [batch_size] containing the reward for the samples generated in last call of `get_samples()`. train_policy (bool) – Whether to update the policy model according to the reward. compute_loss (bool) – If train_policy is False, whether to compute the policy gradient loss (but does not update the policy).
Returns:	If train_policy or compute_loss is True, returns the loss (a python float scalar). Otherwise returns None.

sess¶: The tf session.

pg_loss¶: The scalar tensor of policy gradient loss.

sequence_length¶: The tensor of sample sequence length, of shape [batch_size].

samples¶: The tensor of sequence samples.

hparams¶: A HParams instance. The hyperparameters of the module.

logits¶: The tensor of sequence logits.

name¶: The name of the module (not uniquified).

variable_scope¶: The variable scope of the agent.

Episodic Agents¶

EpisodicAgentBase¶

class texar.tf.agents.EpisodicAgentBase(env_config, hparams=None)[source]¶

Base class inherited by episodic RL agents.

An agent is a wrapper of the training process that trains a model with RL algorithms. Agent itself does not create new trainable variables.

An episodic RL agent typically provides 3 interfaces, namely, reset(), get_action() and observe(), and is used as the following example.

Example

env = SomeEnvironment(...)
agent = PGAgent(...)

while True:
    # Starts one episode
    agent.reset()
    observ = env.reset()
    while True:
        action = agent.get_action(observ)
        next_observ, reward, terminal = env.step(action)
        agent.observe(reward, terminal)
        observ = next_observ
        if terminal:
            break

Parameters:	env_config – An instance of `EnvConfig` specifying action space, observation space, and reward range, etc. Use `get_gym_env_config()` to create an EnvConfig from a gym environment. hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerter will be set to default values. See `default_hparams()` for the hyperparameter sturcture and default values.

static default_hparams()[source]¶

Returns a dictionary of hyperparameters with default values.

{
    "name": "agent"
}

reset()[source]¶: Resets the states to begin new episode.

observe(reward, terminal, train_policy=True, feed_dict=None)[source]¶

Observes experience from environment.

Parameters:	reward – Reward of the action. The configuration (e.g., shape) of the reward is defined in `env_config`. terminal (bool) – Whether the episode is terminated. train_policy (bool) – Wether to update the policy for this step. feed_dict (dict, optional) – Any stuffs fed to running the training operator.

get_action(observ, feed_dict=None)[source]¶

Gets action according to observation.

Parameters:	observ – Observation from the environment.
Returns:	action from the policy.

env_config¶: Environment configuration.

hparams¶: A HParams instance. The hyperparameters of the module.

name¶: The name of the module (not uniquified).

variable_scope¶: The variable scope of the agent.

PGAgent¶

class texar.tf.agents.PGAgent(env_config, sess=None, policy=None, policy_kwargs=None, policy_caller_kwargs=None, learning_rate=None, hparams=None)[source]¶

Policy gradient agent for episodic setting. This agent here supports un-batched training, i.e., each time generates one action, takes one observation, and updates the policy.

The policy must take in an observation of shape [1] + observation_shape, where the first dimension 1 stands for batch dimension, and output a dict containing:

Key “action” whose value is a Tensor of shape [1] + action_shape containing a single action.
One of keys “log_prob” or “dist”:
- “log_prob”: A Tensor of shape [1], the log probability of the “action”.
- “dist”: A tf_main:tf.distributions.Distribution <distributions/Distribution> with the log_prob interface and log_prob = dist.log_prob(outputs[“action”]).

Parameters:

env_config – An instance of EnvConfig specifying action space, observation space, and reward range, etc. Use get_gym_env_config() to create an EnvConfig from a gym environment.
sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
policy (optional) – A policy net that takes in observation and outputs actions and probabilities. If not given, a policy network is created based on hparams.
policy_kwargs (dict, optional) – Keyword arguments for policy constructor. Note that the hparams argument for network constructor is specified in the “policy_hparams” field of hparams and should not be included in policy_kwargs. Ignored if policy is given.
policy_caller_kwargs (dict, optional) – Keyword arguments for calling the policy to get actions. The policy is called with outputs=policy(inputs=observation, **policy_caller_kwargs)
learning_rate (optional) – Learning rate for policy optimization. If not given, determine the learning rate from hparams. See get_train_op() for more details.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerter will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.

static default_hparams()[source]¶

Returns a dictionary of hyperparameters with default values:

{
    'policy_type': 'CategoricalPolicyNet',
    'policy_hparams': None,
    'discount_factor': 0.95,
    'normalize_reward': False,
    'optimization': default_optimization_hparams(),
    'name': 'pg_agent',
}

Here:

“policy_type”: str or class or instance: Policy net. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.tf.modules or texar.tf.custom. Ignored if a policy is given to the agent constructor.
“policy_hparams”: dict, optional: Hyperparameters for the policy net. With the policy_kwargs argument to the constructor, a network is created with policy_class(**policy_kwargs, hparams=policy_hparams).
“discount_factor”: float: The discount factor of reward.
“normalize_reward”: bool: Whether to normalize the discounted reward, by (discounted_reward - mean) / std.
“optimization”: dict: Hyperparameters of optimization for updating the policy net. See default_optimization_hparams() for details.
“name”: str: Name of the agent.

sess¶: The tf session.

env_config¶: Environment configuration.

get_action(observ, feed_dict=None)¶

Gets action according to observation.

Parameters:	observ – Observation from the environment.
Returns:	action from the policy.

hparams¶: A HParams instance. The hyperparameters of the module.

name¶: The name of the module (not uniquified).

observe(reward, terminal, train_policy=True, feed_dict=None)¶

Observes experience from environment.

Parameters:	reward – Reward of the action. The configuration (e.g., shape) of the reward is defined in `env_config`. terminal (bool) – Whether the episode is terminated. train_policy (bool) – Wether to update the policy for this step. feed_dict (dict, optional) – Any stuffs fed to running the training operator.

policy¶: The policy model.

reset()¶: Resets the states to begin new episode.

variable_scope¶: The variable scope of the agent.

DQNAgent¶

class texar.tf.agents.DQNAgent(env_config, sess=None, qnet=None, target=None, qnet_kwargs=None, qnet_caller_kwargs=None, replay_memory=None, replay_memory_kwargs=None, exploration=None, exploration_kwargs=None, hparams=None)[source]¶

Deep Q learning agent for episodic setting.

A Q learning algorithm consists of several components:

A Q-net takes in a state and returns Q-value for action sampling. See CategoricalQNet for an example Q-net class and required interface.

A replay memory manages past experience for Q-net updates. See DequeReplayMemory for an example replay memory class and required interface.

An exploration that specifies the exploration strategy used to train the Q-net. See EpsilonLinearDecayExploration for an example class and required interface.

Parameters:

env_config – An instance of EnvConfig specifying action space, observation space, and reward range, etc. Use get_gym_env_config() to create an EnvConfig from a gym environment.
sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
qnet (optional) – A Q network that predicts Q values given states. If not given, a Q network is created based on hparams.
target (optional) – A target network to compute target Q values.
qnet_kwargs (dict, optional) – Keyword arguments for qnet constructor. Note that the hparams argument for network constructor is specified in the “policy_hparams” field of hparams and should not be included in policy_kwargs. Ignored if qnet is given.
qnet_caller_kwargs (dict, optional) – Keyword arguments for calling qnet to get Q values. The qnet is called with outputs=qnet(inputs=observation, **qnet_caller_kwargs)
replay_memory (optional) – A replay memory instance. If not given, a replay memory is created based on hparams.
replay_memory_kwargs (dict, optional) – Keyword arguments for replay_memory constructor. Ignored if replay_memory is given.
exploration (optional) – An exploration instance used in the algorithm. If not given, an exploration instance is created based on hparams.
exploration_kwargs (dict, optional) – Keyword arguments for exploration class constructor. Ignored if exploration is given.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerters will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.

static default_hparams()[source]¶

Returns a dictionary of hyperparameters with default values:

{
    'qnet_type': 'CategoricalQNet',
    'qnet_hparams': None,
    'replay_memory_type': 'DequeReplayMemory',
    'replay_memory_hparams': None,
    'exploration_type': 'EpsilonLinearDecayExploration',
    'exploration_hparams': None,
    'optimization': opt.default_optimization_hparams(),
    'target_update_strategy': 'copy',
    'cold_start_steps': 100,
    'sample_batch_size': 32,
    'update_period': 100,
    'discount_factor': 0.95,
    'name': 'dqn_agent'
}

Here:

“qnet_type”: str or class or instance: Q-value net. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.tf.modules or texar.tf.custom. Ignored if a qnet is given to the agent constructor.
“qnet_hparams”: dict, optional: Hyperparameters for the Q net. With the qnet_kwargs argument to the constructor, a network is created with qnet_class(**qnet_kwargs, hparams=qnet_hparams).
“replay_memory_type”: str or class or instance: Replay memory class. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.tf.core or texar.tf.custom. Ignored if a replay_memory is given to the agent constructor.
“replay_memory_hparams”: dict, optional: Hyperparameters for the replay memory. With the replay_memory_kwargs argument to the constructor, a network is created with replay_memory_class( **replay_memory_kwargs, hparams=replay_memory_hparams).
“exploration_type”: str or class or instance: Exploration class. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.tf.core or texar.tf.custom. Ignored if a exploration is given to the agent constructor.
“exploration_hparams”: dict, optional: Hyperparameters for the exploration class. With the exploration_kwargs argument to the constructor, a network is created with exploration_class( **exploration_kwargs, hparams=exploration_hparams).
“optimization”: dict: Hyperparameters of optimization for updating the Q-net. See default_optimization_hparams() for details.
“cold_start_steps”: int: In the beginning, Q-net is not trained in the first few steps.
“sample_batch_size”: int: The number of samples taken in replay memory when training.

“target_update_strategy”: string

If “copy”, the target network is assigned with the parameter of Q-net every "update_period" steps.

If “tau”, target will be updated by assigning as ` (1 - 1/update_period) * target + 1/update_period * qnet `

“update_period”: int: Frequecy of updating the target network, i.e., updating the target once for every “update_period” steps.
“discount_factor”: float: The discount factor of reward.
“name”: str: Name of the agent.

env_config¶: Environment configuration.

get_action(observ, feed_dict=None)¶

Gets action according to observation.

Parameters:	observ – Observation from the environment.
Returns:	action from the policy.

hparams¶: A HParams instance. The hyperparameters of the module.

name¶: The name of the module (not uniquified).

observe(reward, terminal, train_policy=True, feed_dict=None)¶

Observes experience from environment.

Parameters:	reward – Reward of the action. The configuration (e.g., shape) of the reward is defined in `env_config`. terminal (bool) – Whether the episode is terminated. train_policy (bool) – Wether to update the policy for this step. feed_dict (dict, optional) – Any stuffs fed to running the training operator.

reset()¶: Resets the states to begin new episode.

sess¶: The tf session.

variable_scope¶: The variable scope of the agent.

ActorCriticAgent¶

class texar.tf.agents.ActorCriticAgent(env_config, sess=None, actor=None, actor_kwargs=None, critic=None, critic_kwargs=None, hparams=None)[source]¶

Actor-critic agent for episodic setting.

An actor-critic algorithm consists of several components:

Actor is the policy to optimize. As a temporary implementation, here by default we use a PGAgent instance that wraps a policy net and provides proper interfaces to perform the role of an actor.

Critic that provides learning signals to the actor. Again, as a temporary implemetation, here by default we use a DQNAgent instance that wraps a Q net and provides proper interfaces to perform the role of a critic.

Parameters:

env_config – An instance of EnvConfig specifying action space, observation space, and reward range, etc. Use get_gym_env_config() to create an EnvConfig from a gym environment.
sess (optional) – A tf session. Can be None here and set later with agent.sess = session.
actor (optional) – An instance of PGAgent that performs as actor in the algorithm. If not provided, an actor is created based on hparams.
actor_kwargs (dict, optional) – Keyword arguments for actor constructor. Note that the hparams argument for actor constructor is specified in the “actor_hparams” field of hparams and should not be included in actor_kwargs. Ignored if actor is given.
critic (optional) – An instance of DQNAgent that performs as critic in the algorithm. If not provided, a critic is created based on hparams.
critic_kwargs (dict, optional) – Keyword arguments for critic constructor. Note that the hparams argument for critic constructor is specified in the “critic_hparams” field of hparams and should not be included in critic_kwargs. Ignored if critic is given.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparamerters will be set to default values. See default_hparams() for the hyperparameter sturcture and default values.

static default_hparams()[source]¶

Returns a dictionary of hyperparameters with default values:

{
    'actor_type': 'PGAgent',
    'actor_hparams': None,
    'critic_type': 'DQNAgent',
    'critic_hparams': None,
    'name': 'actor_critic_agent'
}

Here:

“actor_type”: str or class or instance: Actor. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.tf.agents or texar.tf.custom. Ignored if a actor is given to the agent constructor.
“actor_kwargs”: dict, optional: Hyperparameters for the actor class. With the actor_kwargs argument to the constructor, an actor is created with actor_class(**actor_kwargs, hparams=actor_hparams).
“critic_type”: str or class or instance: Critic. Can be class, its name or module path, or a class instance. If class name is given, the class must be from module texar.tf.agents or texar.tf.custom. Ignored if a critic is given to the agent constructor.
“critic_kwargs”: dict, optional: Hyperparameters for the critic class. With the critic_kwargs argument to the constructor, an critic is created with critic_class(**critic_kwargs, hparams=critic_hparams).
“name”: str: Name of the agent.

get_action(observ, feed_dict=None)[source]¶

Gets action according to observation.

Parameters:	observ – Observation from the environment.
Returns:	action from the policy.

env_config¶: Environment configuration.

hparams¶: A HParams instance. The hyperparameters of the module.

name¶: The name of the module (not uniquified).

observe(reward, terminal, train_policy=True, feed_dict=None)¶

Observes experience from environment.

Parameters:	reward – Reward of the action. The configuration (e.g., shape) of the reward is defined in `env_config`. terminal (bool) – Whether the episode is terminated. train_policy (bool) – Wether to update the policy for this step. feed_dict (dict, optional) – Any stuffs fed to running the training operator.

reset()¶: Resets the states to begin new episode.

sess¶: The tf session.

variable_scope¶: The variable scope of the agent.

Agent Utils¶

Space¶

class texar.tf.agents.Space(shape=None, low=None, high=None, dtype=None)[source]¶

Observation and action spaces. Describes valid actions and observations. Similar to gym.Space.

Parameters:

shape (optional) – Shape of the space, a tuple. If not given, infers from low and high.
low (optional) – Lower bound (inclusive) of each dimension of the space. Must have shape as specified by shape, and of the same shape with with high (if given). If None, set to -inf for each dimension.
high (optional) – Upper bound (inclusive) of each dimension of the space. Must have shape as specified by shape, and of the same shape with with low (if given). If None, set to inf for each dimension.
dtype (optional) – Data type of elements in the space. If not given, infers from low (if given) or set to float.

Example

s = Space(low=0, high=10, dtype=np.int32)
#s.contains(2) == True
#s.contains(10) == True
#s.contains(11) == False
#s.shape == ()

s2 = Space(shape=(2,2), high=np.ones([2,2]), dtype=np.float)
#s2.low == [[-inf, -inf], [-inf, -inf]]
#s2.high == [[1., 1.], [1., 1.]]

contains(x)[source]¶: Checks if x is contained in the space. Returns a bool.

shape¶: Shape of the space.

low¶: Lower bound of the space.

high¶: Upper bound of the space.

dtype¶: Data type of the element.

EnvConfig¶

class texar.tf.agents.EnvConfig(action_space, observ_space, reward_range)[source]¶

Configurations of an environment.

Parameters:	action_space – An instance of `Space` or gym.Space, the action space. observ_space – An instance of `Space` or gym.Space, the observation space. reward_range – A tuple corresponding to the min and max possible rewards, e.g., reward_range=(-1.0, 1.0).

convert_gym_space¶

texar.tf.agents.convert_gym_space(spc)[source]¶

Converts a gym.Space instance to a Space instance.

Parameters:	spc – An instance of gym.Space or `Space`.

get_gym_env_config¶

texar.tf.agents.get_gym_env_config(env)[source]¶

Creates an instance of EnvConfig from a gym env.

Parameters:	env – An instance of OpenAI gym Environment.
Returns:	An instance of `EnvConfig`.