Data¶

Tokenizers¶

TokenizerBase¶

class texar.tf.data.TokenizerBase(hparams)[source]¶

Base class inherited by all tokenizer classes. This class handles downloading and loading pre-trained tokenizer and adding tokens to the vocabulary.

Derived class can set up a few special tokens to be used in common scripts and internals: bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, and additional_special_tokens.

We defined an added_tokens_encoder to add new tokens to the vocabulary without having to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece …).

classmethod load(pretrained_model_path: str, configs: Optional[Dict[KT, VT]] = None)[source]¶

Instantiate a tokenizer from the vocabulary files or the saved tokenizer files.

Parameters:	pretrained_model_path – The path to a vocabulary file or a folder that contains the saved pre-trained tokenizer files. configs – Tokenizer configurations. You can overwrite the original tokenizer configurations saved in the configuration file by this dictionary.
Returns:	A tokenizer instance.

save(save_dir: str) → Tuple[str][source]¶

Save the tokenizer vocabulary files (with added tokens), tokenizer configuration file and a dictionary mapping special token class attributes (cls_token, unk_token, …) to their values (<unk>, <cls>, …) to a directory, so that it can be re-loaded using the load().

Parameters:	save_dir – The path to a folder in which the tokenizer files will be saved.
Returns:	The paths to the vocabulary file, added token file, special token mapping file, and the configuration file.

save_vocab(save_dir)[source]¶

Save the tokenizer vocabulary to a directory. This method does not save added tokens, special token mappings, and the configuration file.

Please use save() to save the full tokenizer state so that it can be reloaded using load().

add_tokens(new_tokens: List[Optional[str]]) → int[source]¶

Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to the added_tokens_encoder with indices starting from the last index of the current vocabulary.

Parameters:	new_tokens – A list of new tokens.
Returns:	Number of tokens added to the vocabulary which can be used to correspondingly increase the size of the associated model embedding matrices.

add_special_tokens(special_tokens_dict: Dict[str, str]) → int[source]¶

Add a dictionary of special tokens to the encoder and link them to class attributes. If the special tokens are not in the vocabulary, they are added to it and indexed starting from the last index of the current vocabulary.

Parameters:	special_tokens_dict – A dictionary mapping special token class attributes (`cls_token`, `unk_token`, …) to their values (<unk>, <cls>, …).
Returns:	Number of tokens added to the vocabulary which can be used to correspondingly increase the size of the associated model embedding matrices.

map_text_to_token(text: Optional[str], **kwargs) → List[str][source]¶

Maps a string to a sequence of tokens (string), using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePiece/WordPiece). This function also takes care of the added tokens.

Parameters:	text – A input string.
Returns:	A list of tokens.

map_token_to_id(tokens)[source]¶

Maps a single token or a sequence of tokens to a integer id (resp.) a sequence of ids, using the vocabulary.

Parameters:	tokens – A single token or a list of tokens.
Returns:	A single token id or a list of token ids.

map_text_to_id(text: str) → List[int][source]¶

Maps a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as self.map_token_to_id(self.map_text_to_token(text)).

Parameters:	text – A input string.
Returns:	A single token id or a list of token ids.

map_id_to_token(token_ids, skip_special_tokens=False)[source]¶

Maps a single id or a sequence of ids to a token (resp.) a sequence of tokens, using the vocabulary and added tokens.

Parameters:	token_ids – A single token id or a list of token ids. skip_special_tokens – Whether to skip the special tokens.
Returns:	A single token or a list of tokens.

map_token_to_text(tokens: List[str]) → str[source]¶: Maps a sequence of tokens (string) in a single string. The most simple way to do it is ‘ ‘.join(tokens), but we often want to remove sub-word tokenization artifacts at the same time.

map_id_to_text(token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True) → str[source]¶

Maps a sequence of ids (integer) to a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Parameters:	token_ids – A list of token ids. skip_special_tokens – Whether to skip the special tokens. clean_up_tokenization_spaces – Whether to clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.

encode_text(text_a: str, text_b: Optional[str] = None, max_seq_length: Optional[int] = None)[source]¶: Adds special tokens to a sequence or sequence pair and computes other information such as segment ids, input mask, and sequence length for specific tasks.

special_tokens_map¶: A dictionary mapping special token class attributes (cls_token, unk_token, …) to their values (<unk>, <cls>, …)

all_special_tokens¶: List all the special tokens (<unk>, <cls>, …) mapped to class attributes (cls_token, unk_token, …).

all_special_ids¶: List the vocabulary indices of the special tokens (<unk>, <cls>, …) mapped to class attributes (cls_token, unk_token, …).

static clean_up_tokenization(out_string: str) → str[source]¶: Clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.

BERTTokenizer¶

class texar.tf.data.BERTTokenizer(pretrained_model_name: Optional[str] = None, cache_dir: Optional[str] = None, hparams=None)[source]¶

Pre-trained BERT Tokenizer.

Parameters:

pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., bert-base-uncased). Please refer to PretrainedBERTMixin for all supported models. If None, the model name in hparams is used.
cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

save_vocab(save_dir: str) → Tuple[str][source]¶: Save the tokenizer vocabulary to a directory or file.

map_token_to_text(tokens: List[str]) → str[source]¶: Maps a sequence of tokens (string) to a single string.

encode_text(text_a: str, text_b: Optional[str] = None, max_seq_length: Optional[int] = None) → Tuple[List[int], List[int], List[int]][source]¶

Adds special tokens to a sequence or sequence pair and computes the corresponding segment ids and input mask for BERT specific tasks. The sequence will be truncated if its length is larger than max_seq_length.

A BERT sequence has the following format: [cls_token] X [sep_token]

A BERT sequence pair has the following format: [cls_token] A [sep_token] B [sep_token]

Parameters:

text_a – The first input text.
text_b – The second input text.
max_seq_length – Maximum sequence length.

Returns:

A tuple of (input_ids, segment_ids, input_mask), where

input_ids: A list of input token ids with added special token ids.
segment_ids: A list of segment ids.
input_mask: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.

static default_hparams() → Dict[str, Any][source]¶

Returns a dictionary of hyperparameters with default values.

The tokenizer is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.
Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the tokenizer is defined by the configurations in hparams.

{
    "pretrained_model_name": "bert-base-uncased",
    "vocab_file": None,
    "max_len": 512,
    "unk_token": "[UNK]",
    "sep_token": "[SEP]",
    "pad_token": "[PAD]",
    "cls_token": "[CLS]",
    "mask_token": "[MASK]",
    "tokenize_chinese_chars": True,
    "do_lower_case": True,
    "do_basic_tokenize": True,
    "non_split_tokens": None,
    "name": "bert_tokenizer",
}

Here:

“pretrained_model_name”: str or None: The name of the pre-trained BERT model.
“vocab_file”: str or None: The path to a one-wordpiece-per-line vocabulary file.
“max_len”: int: The maximum sequence length that this model might ever be used with.
“unk_token”: str: Unknown token.
“sep_token”: str: Separation token.
“pad_token”: str: Padding token.
“cls_token”: str: Classification token.
“mask_token”: str: Masking token.
“tokenize_chinese_chars”: bool: Whether to tokenize Chinese characters.
“do_lower_case”: bool: Whether to lower case the input Only has an effect when do_basic_tokenize=True
“do_basic_tokenize”: bool: Whether to do basic tokenization before wordpiece.
“non_split_tokens”: list: List of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True
“name”: str: Name of the tokenizer.

XLNetTokenizer¶

class texar.tf.data.XLNetTokenizer(pretrained_model_name: Optional[str] = None, cache_dir: Optional[str] = None, hparams=None)[source]¶

Pre-trained XLNet Tokenizer.

Parameters:

pretrained_model_name (optional) – a str, the name of pre-trained model (e.g., xlnet-base-uncased). Please refer to PretrainedXLNetMixin for all supported models. If None, the model name in hparams is used.
cache_dir (optional) – the path to a folder in which the pre-trained models will be cached. If None (default), a default directory (texar_data folder under user’s home directory) will be used.
hparams (dict or HParams, optional) – Hyperparameters. Missing hyperparameter will be set to default values. See default_hparams() for the hyperparameter structure and default values.

save_vocab(save_dir: str) → Tuple[str][source]¶: Save the sentencepiece vocabulary (copy original file) to a directory.

map_token_to_text(tokens: List[str]) → str[source]¶: Maps a sequence of tokens (string) in a single string.

encode_text(text_a: str, text_b: Optional[str] = None, max_seq_length: Optional[int] = None) → Tuple[List[int], List[int], List[int]][source]¶

Adds special tokens to a sequence or sequence pair and computes the corresponding segment ids and input mask for XLNet specific tasks. The sequence will be truncated if its length is larger than max_seq_length.

A XLNet sequence has the following format: X [sep_token] [cls_token]

A XLNet sequence pair has the following format: [cls_token] A [sep_token] B [sep_token]

Parameters:

text_a – The first input text.
text_b – The second input text.
max_seq_length – Maximum sequence length.

Returns:

A tuple of (input_ids, segment_ids, input_mask), where

input_ids: A list of input token ids with added special token ids.
segment_ids: A list of segment ids.
input_mask: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.

static default_hparams() → Dict[str, Any][source]¶

Returns a dictionary of hyperparameters with default values.

The tokenizer is determined by the constructor argument pretrained_model_name if it’s specified. In this case, hparams are ignored.
Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
If the above two are None, the tokenizer is defined by the configurations in hparams.

{
    "pretrained_model_name": "xlnet-base-cased",
    "vocab_file": None,
    "max_len": None,
    "bos_token": "<s>",
    "eos_token": "</s>",
    "unk_token": "<unk>",
    "sep_token": "<sep>",
    "pad_token": "<pad>",
    "cls_token": "<cls>",
    "mask_token": "<mask>",
    "additional_special_tokens": ["<eop>", "<eod>"],
    "do_lower_case": False,
    "remove_space": True,
    "keep_accents": False,
}

Here:

“pretrained_model_name”: str or None: The name of the pre-trained XLNet model.
“vocab_file”: str or None: The path to a sentencepiece vocabulary file.
“max_len”: int or None: The maximum sequence length that this model might ever be used with.
“bos_token”: str: Beginning of sentence token.
“eos_token”: str: End of sentence token.
“unk_token”: str: Unknown token.
“sep_token”: str: Separation token.
“pad_token”: str: Padding token.
“cls_token”: str: Classification token.
“mask_token”: str: Masking token.
“additional_special_tokens”: list: A list of additional special tokens.
“do_lower_case”: bool: Whether to lower-case the text.
“remove_space”: bool: Whether to remove the space in the text.
“keep_accents”: bool: Whether to keep the accents in the text.
“name”: str: Name of the tokenizer.

Vocabulary¶

SpecialTokens¶

class texar.tf.data.SpecialTokens[source]¶: Special tokens, including PAD, BOS, EOS, UNK. These tokens will by default have token ids 0, 1, 2, 3, respectively.

Vocab¶

class texar.tf.data.Vocab(filename, pad_token='<PAD>', bos_token='<BOS>', eos_token='<EOS>', unk_token='<UNK>')[source]¶

Vocabulary class that loads vocabulary from file, and maintains mapping tables between token strings and indexes.

Each line of the vocab file should contains one vocabulary token, e.g.,:

vocab_token_1
vocab token 2
vocab       token | 3 .
...

Parameters:

filename (str) – Path to the vocabulary file where each line contains one token.
bos_token (str) – A special token that will be added to the beginning of sequences.
eos_token (str) – A special token that will be added to the end of sequences.
unk_token (str) – A special token that will replace all unknown tokens (tokens not included in the vocabulary).
pad_token (str) – A special token that is used to do padding.

load(filename)[source]¶

Loads the vocabulary from the file.

Parameters:	filename (str) – Path to the vocabulary file.
Returns:	A tuple of TF and python mapping tables between word string and index, (`id_to_token_map`, `token_to_id_map`, `id_to_token_map_py`, `token_to_id_map_py`), where `id_to_token_map` and `token_to_id_map` are TF HashTable instances, and `id_to_token_map_py` and `token_to_id_map_py` are python defaultdict instances.

map_ids_to_tokens(ids)[source]¶

Maps ids into text tokens.

The returned tokens are a Tensor.

Parameters:	ids – An int tensor of token ids.
Returns:	A tensor of text tokens of the same shape.

map_tokens_to_ids(tokens)[source]¶

Maps text tokens into ids.

The returned ids are a Tensor.

Parameters:	tokens – An tensor of text tokens.
Returns:	A tensor of token ids of the same shape.

map_ids_to_tokens_py(ids)[source]¶

Maps ids into text tokens.

The input ids and returned tokens are both python arrays or list.

Parameters:	ids – An int numpy arry or (possibly nested) list of token ids.
Returns:	A numpy array of text tokens of the same shape as `ids`.

map_tokens_to_ids_py(tokens)[source]¶

Maps text tokens into ids.

The input tokens and returned ids are both python arrays or list.

Parameters:	tokens – A numpy array or (possibly nested) list of text tokens.
Returns:	A numpy array of token ids of the same shape as `tokens`.

id_to_token_map¶: The HashTable instance that maps from token index to the string form.

token_to_id_map¶: The HashTable instance that maps from token string to the index.

id_to_token_map_py¶: The python defaultdict instance that maps from token index to the string form.

token_to_id_map_py¶: The python defaultdict instance that maps from token string to the index.

size¶: The vocabulary size.

bos_token¶: A string of the special token indicating the beginning of sequence.

bos_token_id¶: The int index of the special token indicating the beginning of sequence.

eos_token¶: A string of the special token indicating the end of sequence.

eos_token_id¶: The int index of the special token indicating the end of sequence.

unk_token¶: A string of the special token indicating unknown token.

unk_token_id¶: The int index of the special token indicating unknown token.

pad_token¶: A string of the special token indicating padding token. The default padding token is an empty string.

pad_token_id¶: The int index of the special token indicating padding token.

special_tokens¶: The list of special tokens [pad_token, bos_token, eos_token, unk_token].

Embedding¶

class texar.tf.data.Embedding(vocab, hparams=None)[source]¶

Embedding class that loads token embedding vectors from file. Token embeddings not in the embedding file are initialized as specified in hparams.

Parameters:	vocab (dict) – A dictionary that maps token strings to integer index. read_fn – Callable that takes (filename, vocab, word_vecs) and returns the updated word_vecs. E.g., `load_word2vec()` and `load_glove()`.

static default_hparams()[source]¶

Returns a dictionary of hyperparameters with default values:

{
    "file": "",
    "dim": 50,
    "read_fn": "load_word2vec",
    "init_fn": {
        "type": "numpy.random.uniform",
        "kwargs": {
            "low": -0.1,
            "high": 0.1,
        }
    },
}

Here:

“file”: str

Path to the embedding file. If not provided, all embeddings are initialized with the initialization function.

“dim”: int

Dimension size of each embedding vector

“read_fn”: str or callable

Function to read the embedding file. This can be the function, or its string name or full module path. E.g.,

"read_fn": texar.tf.data.load_word2vec
"read_fn": "load_word2vec"
"read_fn": "texar.tf.data.load_word2vec"
"read_fn": "my_module.my_read_fn"

If function string name is used, the function must be in one of the modules: texar.tf.data or texar.tf.custom.

The function must have the same signature as with load_word2vec().

“init_fn”: dict

Hyperparameters of the initialization function used to initialize embedding of tokens missing in the embedding file.

The function must accept argument named size or shape to specify the output shape, and return a numpy array of the shape.

The dict has the following fields:

“type”: str or callable

The initialization function. Can be either the function, or its string name or full module path.

“kwargs”: dict

Keyword arguments for calling the function. The function is called with init_fn(size=[.., ..], **kwargs).

word_vecs¶: 2D numpy array of shape [vocab_size, embedding_dim].

vector_size¶: The embedding dimention size.

load_word2vec¶

texar.tf.data.load_word2vec(filename, vocab, word_vecs)[source]¶

Loads embeddings in the word2vec binary format which has a header line containing the number of vectors and their dimensionality (two integers), followed with number-of-vectors lines each of which is formatted as ‘<word-string> <embedding-vector>’.

Parameters:	filename (str) – Path to the embedding file. vocab (dict) – A dictionary that maps token strings to integer index. Tokens not in `vocab` are not read. word_vecs – A 2D numpy array of shape [vocab_size, embed_dim] which is updated as reading from the file.
Returns:	The updated `word_vecs`.

load_glove¶

texar.tf.data.load_glove(filename, vocab, word_vecs)[source]¶

Loads embeddings in the glove text format in which each line is ‘<word-string> <embedding-vector>’. Dimensions of the embedding vector are separated with whitespace characters.

Parameters:	filename (str) – Path to the embedding file. vocab (dict) – A dictionary that maps token strings to integer index. Tokens not in `vocab` are not read. word_vecs – A 2D numpy array of shape [vocab_size, embed_dim] which is updated as reading from the file.
Returns:	The updated `word_vecs`.

Data¶

DataBase¶

class texar.tf.data.DataBase(hparams)[source]¶

Base class inheritted by all data classes.

static default_hparams()[source]¶

Returns a dictionary of default hyperparameters.

{
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "data",
}

Here:

“num_epochs”: int

Number of times the dataset should be repeated. An OutOfRangeError signal will be raised after the whole repeated dataset has been iterated through.

E.g., For training data, set it to 1 (default) so that you will get the signal after each epoch of training. Set to -1 to repeat the dataset indefinitely.

“batch_size”: int

Batch size, i.e., the number of consecutive elements of the dataset to combine in a single batch.

“allow_smaller_final_batch”: bool

Whether to allow the final batch to be smaller if there are insufficient elements left. If False, the final batch is discarded if it is smaller than batch size. Note that, if True, output_shapes of the resulting dataset will have a a static batch_size dimension equal to “batch_size”.

“shuffle”: bool

Whether to randomly shuffle the elements of the dataset.

“shuffle_buffer_size”: int

The buffer size for data shuffling. The larger, the better the resulting data is mixed.

If None (default), buffer size is set to the size of the whole dataset (i.e., make the shuffling the maximally effective).

“shard_and_shuffle”: bool

Whether to first shard the dataset and then shuffle each block respectively. Useful when the whole data is too large to be loaded efficiently into the memory.

If True, shuffle_buffer_size must be specified to determine the size of each shard.

“num_parallel_calls”: int

Number of elements from the datasets to process in parallel.

“prefetch_buffer_size”: int

The maximum number of elements that will be buffered when prefetching.

max_dataset_size : int

Maximum number of instances to include in the dataset. If set to -1 or greater than the size of dataset, all instances will be included. This constraint is imposed after data shuffling and filtering.

seed : int, optional

The random seed for shuffle.

Note that if a seed is set, the shuffle order will be exact the same every time when going through the (repeated) dataset.

For example, consider a dataset with elements [1, 2, 3], with “num_epochs”=2 and some fixed seed, the resulting sequence can be: 2 1 3, 1 3 2 | 2 1 3, 1 3 2, … That is, the orders are different within every num_epochs, but are the same across the num_epochs.

name : str

Name of the data.

num_epochs¶: Number of epochs.

batch_size¶: The batch size.

hparams¶: A HParams instance of the data hyperparameters.

name¶: Name of the module.

MonoTextData¶

class texar.tf.data.MonoTextData(hparams)[source]¶

Text data processor that reads single set of text files. This can be used for, e.g., language models, auto-encoders, etc.

Parameters:	hparams – A dict or instance of `HParams` containing hyperparameters. See `default_hparams()` for the defaults.

By default, the processor reads raw data files, performs tokenization, batching and other pre-processing steps, and results in a TF Dataset whose element is a python dict including three fields:

“text”:

A string Tensor of shape [batch_size, max_time] containing the raw text toknes. max_time is the length of the longest sequence in the batch. Short sequences in the batch are padded with empty string. BOS and EOS tokens are added as per hparams. Out-of-vocabulary tokens are NOT replaced with UNK.

“text_ids”:

An int64 Tensor of shape [batch_size, max_time] containing the token indexes.

“length”:

An int Tensor of shape [batch_size] containing the length of each sequence in the batch (including BOS and EOS if added).

If 'variable_utterance' is set to True in hparams, the resulting dataset has elements with four fields:

“text”:

A string Tensor of shape [batch_size, max_utterance, max_time], where max_utterance is either the maximum number of utterances in each elements of the batch, or max_utterance_cnt as specified in hparams.

“text_ids”:

An int64 Tensor of shape [batch_size, max_utterance, max_time] containing the token indexes.

“length”:

An int Tensor of shape [batch_size, max_utterance] containing the length of each sequence in the batch.

“utterance_cnt”:

An int Tensor of shape [batch_size] containing the number of utterances of each element in the batch.

The above field names can be accessed through text_name, text_id_name, length_name, and utterance_cnt_name, respectively.

Example

hparams={
    'dataset': { 'files': 'data.txt', 'vocab_file': 'vocab.txt' },
    'batch_size': 1
}
data = MonoTextData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()

iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
#    'text': [['<BOS>', 'example', 'sequence', '<EOS>']],
#    'text_ids': [[1, 5, 10, 2]],
#    'length': [4]
# }

static default_hparams()[source]¶

Returns a dicitionary of default hyperparameters:

{
    # (1) Hyperparams specific to text dataset
    "dataset": {
        "files": [],
        "compression_type": None,
        "vocab_file": "",
        "embedding_init": {},
        "delimiter": " ",
        "max_seq_length": None,
        "length_filter_mode": "truncate",
        "pad_to_max_seq_length": False,
        "bos_token": "<BOS>"
        "eos_token": "<EOS>"
        "other_transformations": [],
        "variable_utterance": False,
        "utterance_delimiter": "|||",
        "max_utterance_cnt": 5,
        "data_name": None,
    }
    # (2) General hyperparams
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "mono_text_data",
    # (3) Bucketing
    "bucket_boundaries": [],
    "bucket_batch_sizes": None,
    "bucket_length_fn": None,
}

Here:

For the hyperparameters in the "dataset" field:

“files”: str or list

A (list of) text file path(s).

Each line contains a single text sequence.

“compression_type”: str, optional

One of “” (no compression), “ZLIB”, or “GZIP”.

“vocab_file”: str

Path to vocabulary file. Each line of the file should contain one vocabulary token.

Used to create an instance of Vocab.

“embedding_init”: dict

The hyperparameters for pre-trained embedding loading and initialization.

The structure and default values are defined in texar.tf.data.Embedding.default_hparams().

“delimiter”: str

The delimiter to split each line of the text files into tokens.

“max_seq_length”: int, optional

Maximum length of output sequences. Data samples exceeding the length will be truncated or discarded according to "length_filter_mode". The length does not include any added "bos_token" or "eos_token". If None (default), no filtering is performed.

“length_filter_mode”: str

Either “truncate” or “discard”. If “truncate” (default), tokens exceeding the "max_seq_length" will be truncated. If “discard”, data samples longer than the "max_seq_length" will be discarded.

“pad_to_max_seq_length”: bool

If True, pad all data instances to length "max_seq_length". Raises error if "max_seq_length" is not provided.

“bos_token”: str

The Begin-Of-Sequence token prepended to each sequence.

Set to an empty string to avoid prepending.

“eos_token”: str

The End-Of-Sequence token appended to each sequence.

Set to an empty string to avoid appending.

“other_transformations”: list

A list of transformation functions or function names/paths to further transform each single data instance.

(More documentations to be added.)

“variable_utterance”: bool

If True, each line of the text file is considered to contain multiple sequences (utterances) separated by "utterance_delimiter".

For example, in dialog data, each line can contain a series of dialog history utterances. See the example in examples/hierarchical_dialog for a use case.

“utterance_delimiter”: str

The delimiter to split over utterance level. Should not be the same with "delimiter". Used only when "variable_utterance"``==True.

“max_utterance_cnt”: int

Maximally allowed number of utterances in a data instance. Extra utterances are truncated out.

“data_name”: str

Name of the dataset.

2. For the general hyperparameters, see texar.tf.data.DataBase.default_hparams() for details.

3. Bucketing is to group elements of the dataset together by length and then pad and batch. (See more at bucket_by_sequence_length). For bucketing hyperparameters:

“bucket_boundaries”: list

An int list containing the upper length boundaries of the buckets.

Set to an empty list (default) to disable bucketing.

“bucket_batch_sizes”: list

An int list containing batch size per bucket. Length should be len(bucket_boundaries) + 1.

If None, every bucket whill have the same batch size specified in batch_size.

“bucket_length_fn”: str or callable

Function maps dataset element to tf.int32 scalar, determines the length of the element.

This can be a function, or the name or full module path to the function. If function name is given, the function must be in the texar.tf.custom module.

If None (default), length is determined by the number of tokens (including BOS and EOS if added) of the element.

list_items()[source]¶

Returns the list of item names that the data can produce.

Returns:	A list of strings.

dataset¶: The dataset, an instance of TF dataset.

dataset_size()[source]¶

Returns the number of data instances in the data files.

Note that this is the total data count in the raw files, before any filtering and truncation.

vocab¶: The vocabulary, an instance of Vocab.

embedding_init_value¶: The Tensor containing the embedding value loaded from file. None if embedding is not specified.

text_name¶: The name of text tensor, “text” by default.

length_name¶: The name of length tensor, “length” by default.

text_id_name¶: The name of text index tensor, “text_ids” by default.

utterance_cnt_name¶: The name of utterance count tensor, “utterance_cnt” by default.

batch_size¶: The batch size.

hparams¶: A HParams instance of the data hyperparameters.

name¶: Name of the module.

num_epochs¶: Number of epochs.

PairedTextData¶

class texar.tf.data.PairedTextData(hparams)[source]¶

Text data processor that reads parallel source and target text. This can be used in, e.g., seq2seq models.

Parameters:	hparams (dict) – Hyperparameters. See `default_hparams()` for the defaults.

By default, the processor reads raw data files, performs tokenization, batching and other pre-processing steps, and results in a TF Dataset whose element is a python dict including six fields:

“source_text”:

A string Tensor of shape [batch_size, max_time] containing the raw text toknes of source sequences. max_time is the length of the longest sequence in the batch. Short sequences in the batch are padded with empty string. By default only EOS token is appended to each sequence. Out-of-vocabulary tokens are NOT replaced with UNK.

“source_text_ids”:

An int64 Tensor of shape [batch_size, max_time] containing the token indexes of source sequences.

“source_length”:

An int Tensor of shape [batch_size] containing the length of each source sequence in the batch (including BOS and/or EOS if added).

“target_text”:

A string Tensor as “source_text” but for target sequences. By default both BOS and EOS are added.

“target_text_ids”:

An int64 Tensor as “source_text_ids” but for target sequences.

“target_length”:

An int Tensor of shape [batch_size] as “source_length” but for target sequences.

If 'variable_utterance' is set to True in 'source_dataset' and/or 'target_dataset' of hparams, the corresponding fields “source_*” and/or “target_*” are respectively changed to contain variable utterance text data, as in MonoTextData.

The above field names can be accessed through source_text_name, source_text_id_name, source_length_name, source_utterance_cnt_name, and those prefixed with target_, respectively.

Example

hparams={
    'source_dataset': {'files': 's', 'vocab_file': 'vs'},
    'target_dataset': {'files': ['t1', 't2'], 'vocab_file': 'vt'},
    'batch_size': 1
}
data = PairedTextData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()

iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
#    'source_text': [['source', 'sequence', '<EOS>']],
#    'source_text_ids': [[5, 10, 2]],
#    'source_length': [3]
#    'target_text': [['<BOS>', 'target', 'sequence', '1', '<EOS>']],
#    'target_text_ids': [[1, 6, 10, 20, 2]],
#    'target_length': [5]
# }

static default_hparams()[source]¶

Returns a dicitionary of default hyperparameters.

{
    # (1) Hyperparams specific to text dataset
    "source_dataset": {
        "files": [],
        "compression_type": None,
        "vocab_file": "",
        "embedding_init": {},
        "delimiter": " ",
        "max_seq_length": None,
        "length_filter_mode": "truncate",
        "pad_to_max_seq_length": False,
        "bos_token": None,
        "eos_token": "<EOS>",
        "other_transformations": [],
        "variable_utterance": False,
        "utterance_delimiter": "|||",
        "max_utterance_cnt": 5,
        "data_name": "source",
    },
    "target_dataset": {
        # ...
        # Same fields are allowed as in "source_dataset" with the
        # same default values, except the
        # following new fields/values:
        "bos_token": "<BOS>"
        "vocab_share": False,
        "embedding_init_share": False,
        "processing_share": False,
        "data_name": "target"
    }
    # (2) General hyperparams
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "paired_text_data",
    # (3) Bucketing
    "bucket_boundaries": [],
    "bucket_batch_sizes": None,
    "bucket_length_fn": None,
}

Here:

1. Hyperparameters in the "source_dataset" and attr:”target_dataset” fields have the same definition as those in texar.tf.data.MonoTextData.default_hparams(), for source and target text, respectively.

For the new hyperparameters in “target_dataset”:

“vocab_share”: bool

Whether to share the vocabulary of source. If True, the vocab file of target is ignored.

“embedding_init_share”: bool

Whether to share the embedding initial value of source. If True, "embedding_init" of target is ignored.

"vocab_share" must be true to share the embedding initial value.

“processing_share”: bool

Whether to share the processing configurations of source, including “delimiter”, “bos_token”, “eos_token”, and “other_transformations”.

2. For the general hyperparameters, see texar.tf.data.DataBase.default_hparams() for details.

3. For bucketing hyperparameters, see texar.tf.data.MonoTextData.default_hparams() for details, except that the default bucket_length_fn is the maximum sequence length of source and target sequences.

list_items()[source]¶

Returns the list of item names that the data can produce.

Returns:	A list of strings.

dataset¶: The dataset.

dataset_size()[source]¶

Returns the number of data instances in the dataset.

Note that this is the total data count in the raw files, before any filtering and truncation.

vocab¶: A pair instances of Vocab that are source and target vocabs, respectively.

source_vocab¶: The source vocab, an instance of Vocab.

target_vocab¶: The target vocab, an instance of Vocab.

source_embedding_init_value¶: The Tensor containing the embedding value of source data loaded from file. None if embedding is not specified.

target_embedding_init_value¶: The Tensor containing the embedding value of target data loaded from file. None if embedding is not specified.

embedding_init_value()[source]¶: A pair of Tensor containing the embedding values of source and target data loaded from file.

source_text_name¶: The name of the source text tensor, “source_text” by default.

source_length_name¶: The name of the source length tensor, “source_length” by default.

source_text_id_name¶: The name of the source text index tensor, “source_text_ids” by default.

source_utterance_cnt_name¶: The name of the source text utterance count tensor, “source_utterance_cnt” by default.

target_text_name¶: The name of the target text tensor, “target_text” bt default.

target_length_name¶: The name of the target length tensor, “target_length” by default.

target_text_id_name¶: The name of the target text index tensor, “target_text_ids” by default.

target_utterance_cnt_name¶: The name of the target text utterance count tensor, “target_utterance_cnt” by default.

text_name¶: The name of text tensor, “text” by default.

length_name¶: The name of length tensor, “length” by default.

text_id_name¶: The name of text index tensor, “text_ids” by default.

utterance_cnt_name¶: The name of the text utterance count tensor, “utterance_cnt” by default.

batch_size¶: The batch size.

hparams¶: A HParams instance of the data hyperparameters.

name¶: Name of the module.

num_epochs¶: Number of epochs.

ScalarData¶

class texar.tf.data.ScalarData(hparams)[source]¶

Scalar data where each line of the files is a scalar (int or float), e.g., a data label.

Parameters:	hparams (dict) – Hyperparameters. See `default_hparams()` for the defaults.

The processor reads and processes raw data and results in a TF dataset whose element is a python dict including one field. The field name is specified in hparams["dataset"]["data_name"]. If not specified, the default name is “data”. The field name can be accessed through data_name.

This field is a Tensor of shape [batch_size] containing a batch of scalars, of either int or float type as specified in hparams.

Example

hparams={
    'dataset': { 'files': 'data.txt', 'data_name': 'label' },
    'batch_size': 2
}
data = ScalarData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()

iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
#     'label': [2, 9]
# }

static default_hparams()[source]¶

Returns a dicitionary of default hyperparameters.

{
    # (1) Hyperparams specific to scalar dataset
    "dataset": {
        "files": [],
        "compression_type": None,
        "data_type": "int",
        "other_transformations": [],
        "data_name": None,
    }
    # (2) General hyperparams
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "scalar_data",
}

Here:

For the hyperparameters in the "dataset" field:

“files”: str or list

A (list of) file path(s).

Each line contains a single scalar number.

“compression_type”: str, optional

One of “” (no compression), “ZLIB”, or “GZIP”.

“data_type”: str

The scalar type. Currently supports “int” and “float”.

“other_transformations”: list

A list of transformation functions or function names/paths to further transform each single data instance.

(More documentations to be added.)

“data_name”: str

Name of the dataset.

2. For the general hyperparameters, see texar.tf.data.DataBase.default_hparams() for details.

list_items()[source]¶

Returns the list of item names that the data can produce.

Returns:	A list of strings.

dataset¶: The dataset.

dataset_size()[source]¶

Returns the number of data instances in the dataset.

Note that this is the total data count in the raw files, before any filtering and truncation.

data_name¶: The name of the data tensor, “data” by default if not specified in hparams.

batch_size¶: The batch size.

hparams¶: A HParams instance of the data hyperparameters.

name¶: Name of the module.

num_epochs¶: Number of epochs.

TFRecordData¶

class texar.tf.data.TFRecordData(hparams)[source]¶

TFRecord data which loads and processes TFRecord files.

This module can be used to process image data, features, etc.

Parameters:	hparams (dict) – Hyperparameters. See `default_hparams()` for the defaults.

The module reads and restores data from TFRecord files and results in a TF Dataset whose element is a Python dict that maps feature names to feature values. The features names and dtypes are specified in hparams["dataset"]["feature_original_types"].

The module also provides simple processing options for image data, such as image resize.

Example

# Read data from TFRecord file
hparams={
    'dataset': {
        'files': 'image1.tfrecord',
        'feature_original_types': {
            'height': ['tf.int64', 'FixedLenFeature'],
            'width': ['tf.int64', 'FixedLenFeature'],
            'label': ['tf.int64', 'FixedLenFeature'],
            'image_raw': ['tf.string', 'FixedLenFeature']
        }
    },
    'batch_size': 1
}
data = TFRecordData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()

iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
#    'data': {
#        'height': [239],
#        'width': [149],
#        'label': [1],
#
#        # 'image_raw' is a list of image data bytes in this
#        # example.
#        'image_raw': [...],
#    }
# }

# Read image data from TFRecord file and do resizing
hparams={
    'dataset': {
        'files': 'image2.tfrecord',
        'feature_original_types': {
            'label': ['tf.int64', 'FixedLenFeature'],
            'image_raw': ['tf.string', 'FixedLenFeature']
        },
        'image_options': {
            'image_feature_name': 'image_raw',
            'resize_height': 512,
            'resize_width': 512,
        }
    },
    'batch_size': 1
}
data = TFRecordData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()

iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
#    'data': {
#        'label': [1],
#
#        # "image_raw" is a list of a "numpy.ndarray" image
#        # in this example. Each image has a width of 512 and
#        # height of 512.
#        'image_raw': [...]
#    }
# }

static default_hparams()[source]¶

Returns a dicitionary of default hyperparameters.

{
    # (1) Hyperparams specific to TFRecord dataset
    'dataset': {
        'files': [],
        'feature_original_types': {},
        'feature_convert_types': {},
        'image_options': {},
        "num_shards": None,
        "shard_id": None,
        "other_transformations": [],
        "data_name": None,
    }
    # (2) General hyperparams
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "tfrecord_data",
}

Here:

For the hyperparameters in the "dataset" field:
“files”: str or list

A (list of) TFRecord file path(s).

“feature_original_types”: dict
The feature names (str) with their data types and length types, key and value in pair feature_name: [dtype, feature_len_type, len],

dtype is a TF Dtype such as tf.string and tf.int32, or its string name such as ‘tf.string’ and ‘tf.int32’. The feature will be read from the files and parsed into this dtype.

feature_len_type is of type str, and can be either ‘FixedLenFeature’ or ‘VarLenFeature’ for fixed length features and non-fixed length features, respectively.

len is an int and is optional. It is the length for ‘FixedLenFeature’. Ignored if ‘VarLenFeature’ is used.

Example:

feature_original_types = { "input_ids": ["tf.int64", "FixedLenFeature", 128], "label_ids": ["tf.int64", "FixedLenFeature"], "name_lists": ["tf.string", "VarLenFeature"], }
“feature_convert_types”: dict, optional
Specifies dtype converting after reading the data files. This dict maps feature names to desired data dtypes. For example, you can first read a feature into dtype tf.float64 by specifying in “feature_original_types” above, and convert the feature to dtype “tf.int64” by specifying here. Features not specified here will not do dtype-convert.

dtype is a TF Dtype such as tf.string and tf.int32, or its string name such as ‘tf.string’ and ‘tf.int32’.

Be noticed that this converting process is after all the data are restored, feature_original_types has to be set firstly.

Example:

feature_convert_types = { "input_ids": "tf.int32", "label_ids": "tf.int32", }
“image_options”: dict, optional
Specifies the image feature name and performs image resizing, includes three fields:

“image_feature_name”:

A str, the name of the feature which contains the image data. If set, the image data will be restored in format numpy.ndarray.

“resize_height”:

A int, the height of the image after resizing.

“resize_width”:

A int, the width of the image after resizing

If either resize_height or resize_width is not set, image data will be restored with original shape.
“num_shards”: int, optional

The number of data shards in distributed mode. Usually set to the number of processes in distributed computing. Used in combination with "shard_id".

“shard_id”: int, optional
Sets the unique id to identify a shard. The module will processes only the corresponding shard of the whole data. Used in combination with "num_shards".

E.g., in a case of distributed computing on 2 GPUs, the hparams of the data module for the two processes can be as below, respectively.

For gpu 0:

dataset: { ... "num_shards": 2, "shard_id": 0 }

For gpu 1:

dataset: { ... "num_shards": 2, "shard_id": 1 }

Also refer to examples/bert for a use case.
“other_transformations”: list

A list of transformation functions or function names/paths to further transform each single data instance.

“data_name”: str

Name of the dataset.

2. For the general hyperparameters, see texar.tf.data.DataBase.default_hparams() for details.

list_items()[source]¶

Returns the list of item names that the data can produce.

Returns:	A list of strings.

feature_names¶: A list of feature names.

batch_size¶: The batch size.

hparams¶: A HParams instance of the data hyperparameters.

name¶: Name of the module.

num_epochs¶: Number of epochs.

MultiAlignedData¶

class texar.tf.data.MultiAlignedData(hparams)[source]¶

Data consisting of multiple aligned parts.

Parameters:	hparams (dict) – Hyperparameters. See `default_hparams()` for the defaults.

The processor can read any number of parallel fields as specified in the “datasets” list of hparams, and result in a TF Dataset whose element is a python dict containing data fields from each of the specified datasets. Fields from a text dataset or TFRecord dataset have names prefixed by its “data_name”. Fields from a scalar dataset are specified by its “data_name”.

Example

hparams={
    'datasets': [
        {'files': 'a.txt', 'vocab_file': 'v.a', 'data_name': 'x'},
        {'files': 'b.txt', 'vocab_file': 'v.b', 'data_name': 'y'},
        {'files': 'c.txt', 'data_type': 'int', 'data_name': 'z'}
    ]
    'batch_size': 1
}
data = MultiAlignedData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()

iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
#    'x_text': [['<BOS>', 'x', 'sequence', '<EOS>']],
#    'x_text_ids': [['1', '5', '10', '2']],
#    'x_length': [4]
#    'y_text': [['<BOS>', 'y', 'sequence', '1', '<EOS>']],
#    'y_text_ids': [['1', '6', '10', '20', '2']],
#    'y_length': [5],
#    'z': [1000],
# }
...

hparams={
    'datasets': [
        {'files': 'd.txt', 'vocab_file': 'v.d', 'data_name': 'm'},
        {
            'files': 'd.tfrecord',
            'data_type': 'tf_record',
            "feature_original_types": {
                'image': ['tf.string', 'FixedLenFeature']
            },
            'image_options': {
                'image_feature_name': 'image',
                'resize_height': 512,
                'resize_width': 512,
            },
            'data_name': 't',
        }
    ]
    'batch_size': 1
}
data = MultiAlignedData(hparams)
iterator = DataIterator(data)
batch = iterator.get_next()

iterator.switch_to_dataset(sess) # initializes the dataset
batch_ = sess.run(batch)
# batch_ == {
#    'x_text': [['<BOS>', 'NewYork', 'City', 'Map', '<EOS>']],
#    'x_text_ids': [['1', '100', '80', '65', '2']],
#    'x_length': [5],
#
#    # "t_image" is a list of a "numpy.ndarray" image
#    # in this example. Its width equals to 512 and
#    # its height equals to 512.
#    't_image': [...]
# }

static default_hparams()[source]¶

Returns a dicitionary of default hyperparameters.

{
    # (1) Hyperparams specific to text dataset
    "datasets": []
    # (2) General hyperparams
    "num_epochs": 1,
    "batch_size": 64,
    "allow_smaller_final_batch": True,
    "shuffle": True,
    "shuffle_buffer_size": None,
    "shard_and_shuffle": False,
    "num_parallel_calls": 1,
    "prefetch_buffer_size": 0,
    "max_dataset_size": -1,
    "seed": None,
    "name": "multi_aligned_data",
}

Here:

1. “datasets” is a list of dict each of which specifies a dataset which can be text, scalar or TFRecord. The "data_name" field of each dataset is used as the name prefix of the data fields from the respective dataset. The "data_name" field of each dataset should not be the same.

For scalar dataset, the allowed hyperparameters and default values are the same as the “dataset” field of texar.tf.data.ScalarData.default_hparams(). Note that "data_type" must be explicily specified (either “int” or “float”).

For TFRecord dataset, the allowed hyperparameters and default values are the same as the “dataset” field of texar.tf.data.TFRecordData.default_hparams(). Note that "data_type" must be explicily specified (tf_record”).

For text dataset, the allowed hyperparameters and default values are the same as the “dataset” filed of texar.tf.data.MonoTextData.default_hparams(), with several extra hyperparameters:

“data_type”: str

The type of the dataset, one of {“text”, “int”, “float”, “tf_record”}. If set to “int” or “float”, the dataset is considered to be a scalar dataset. If set to “tf_record”, the dataset is considered to be a TFRecord dataset. If not specified or set to “text”, the dataset is considered to be a text dataset.

“vocab_share_with”: int, optional

Share the vocabulary of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.

If specified, the vocab file of current dataset is ignored. Default is None which disables the vocab sharing.

“embedding_init_share_with”: int, optional

Share the embedding initial value of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.

If specified, the "embedding_init" field of the current dataset is ignored. Default is None which disables the initial value sharing.

“processing_share_with”: int, optional

Share the processing configurations of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.

If specified, relevant field of the current dataset are ignored, including “delimiter”, “bos_token”, “eos_token”, and “other_transformations”. Default is None which disables the processing sharing.

2. For the general hyperparameters, see texar.tf.data.DataBase.default_hparams() for details.

list_items()[source]¶

Returns the list of item names that the data can produce.

Returns:	A list of strings.

dataset¶: The dataset.

dataset_size()[source]¶

Returns the number of data instances in the dataset.

Note that this is the total data count in the raw files, before any filtering and truncation.

vocab(name_or_id)[source]¶

Returns the Vocab of text dataset by its name or id. None if the dataset is not of text type.

Parameters:	name_or_id (str or int) – Data name or the index of text dataset.

embedding_init_value(name_or_id)[source]¶: Returns the Tensor of embedding init value of the dataset by its name or id. None if the dataset is not of text type.

text_name(name_or_id)[source]¶: The name of text tensor of text dataset by its name or id. If the dataaet is not of text type, returns None.

length_name(name_or_id)[source]¶: The name of length tensor of text dataset by its name or id. If the dataset is not of text type, returns None.

text_id_name(name_or_id)[source]¶: The name of length tensor of text dataset by its name or id. If the dataset is not of text type, returns None.

utterance_cnt_name(name_or_id)[source]¶: The name of utterance count tensor of text dataset by its name or id. If the dataset is not variable utterance text data, returns None.

data_name¶: The name of the data tensor of scalar dataset by its name or id.. If the dataset is not a scalar data, returns None.

batch_size¶: The batch size.

hparams¶: A HParams instance of the data hyperparameters.

name¶: Name of the module.

num_epochs¶: Number of epochs.

TextDataBase¶

class texar.tf.data.TextDataBase(hparams)[source]¶

Base class inheritted by all text data classes.

static default_hparams()[source]¶

Returns a dictionary of default hyperparameters.

See the specific subclasses for the details.

Data Iterators¶

DataIteratorBase¶

class texar.tf.data.DataIteratorBase(datasets)[source]¶

Base class for all data iterator classes to inherit. A data iterator is a wrapper of tf.data.Iterator, and can switch between and iterate through multiple datasets.

Parameters:

datasets –

Datasets to iterates through. This can be:

A single instance of tf.data.Dataset or instance of subclass of DataBase.
A dict that maps dataset name to instance of tf.data.Dataset or subclass of DataBase.
A list of instances of subclasses of texar.tf.data.DataBase. The name of instances (texar.tf.data.DataBase.name) must be unique.

num_datasets¶: Number of datasets.

dataset_names¶: A list of dataset names.

DataIterator¶

class texar.tf.data.DataIterator(datasets)[source]¶

Data iterator that switches and iterates through multiple datasets.

This is a wrapper of TF reinitializble iterator.

Parameters:

datasets –

Datasets to iterates through. This can be:

A single instance of tf.data.Dataset or instance of subclass of DataBase.
A dict that maps dataset name to instance of tf.data.Dataset or subclass of DataBase.
A list of instances of subclasses of texar.tf.data.DataBase. The name of instances (texar.tf.data.DataBase.name) must be unique.

Example

train_data = MonoTextData(hparams_train)
test_data = MonoTextData(hparams_test)
iterator = DataIterator({'train': train_data, 'test': test_data})
batch = iterator.get_next()

sess = tf.Session()

for _ in range(200): # Run 200 epochs of train/test
    # Starts iterating through training data from the beginning
    iterator.switch_to_dataset(sess, 'train')
    while True:
        try:
            train_batch_ = sess.run(batch)
        except tf.errors.OutOfRangeError:
            print("End of training epoch.")
    # Starts iterating through test data from the beginning
    iterator.switch_to_dataset(sess, 'test')
    while True:
        try:
            test_batch_ = sess.run(batch)
        except tf.errors.OutOfRangeError:
            print("End of test epoch.")

switch_to_dataset(sess, dataset_name=None)[source]¶

Re-initializes the iterator of a given dataset and starts iterating over the dataset (from the beginning).

Parameters:	sess – The current tf session. dataset_name (optional) – Name of the dataset. If not provided, there must be only one Dataset.

get_next()[source]¶: Returns the next element of the activated dataset.

TrainTestDataIterator¶

class texar.tf.data.TrainTestDataIterator(train=None, val=None, test=None)[source]¶

Data iterator that alternatives between train, val, and test datasets.

train, val, and test can be instance of either tf.data.Dataset or subclass of DataBase. At least one of them must be provided.

This is a wrapper of DataIterator.

Parameters:	train (optional) – Training data. val (optional) – Validation data. test (optional) – Test data.

Example

train_data = MonoTextData(hparams_train)
val_data = MonoTextData(hparams_val)
iterator = TrainTestDataIterator(train=train_data, val=val_data)
batch = iterator.get_next()

sess = tf.Session()

for _ in range(200): # Run 200 epochs of train/val
    # Starts iterating through training data from the beginning
    iterator.switch_to_train_data(sess)
    while True:
        try:
            train_batch_ = sess.run(batch)
        except tf.errors.OutOfRangeError:
            print("End of training epoch.")
    # Starts iterating through val data from the beginning
    iterator.switch_to_val_dataset(sess)
    while True:
        try:
            val_batch_ = sess.run(batch)
        except tf.errors.OutOfRangeError:
            print("End of val epoch.")

switch_to_train_data(sess)[source]¶

Starts to iterate through training data (from the beginning).

Parameters:	sess – The current tf session.

switch_to_val_data(sess)[source]¶

Starts to iterate through val data (from the beginning).

Parameters:	sess – The current tf session.

switch_to_test_data(sess)[source]¶

Starts to iterate through test data (from the beginning).

Parameters:	sess – The current tf session.

FeedableDataIterator¶

class texar.tf.data.FeedableDataIterator(datasets)[source]¶

Data iterator that iterates through multiple datasets and switches between datasets.

The iterator can switch to a dataset and resume from where we left off last time we visited the dataset. This is a wrapper of TF feedable iterator.

Parameters:

datasets –

Datasets to iterates through. This can be:

A single instance of tf.data.Dataset or instance of subclass of DataBase.
A dict that maps dataset name to instance of tf.data.Dataset or subclass of DataBase.
A list of instances of subclasses of texar.tf.data.DataBase. The name of instances (texar.tf.data.DataBase.name) must be unique.

Example

train_data = MonoTextData(hparams={'num_epochs': 200, ...})
test_data = MonoTextData(hparams_test)
iterator = FeedableDataIterator({'train': train_data,
                                 'test': test_data})
batch = iterator.get_next()

sess = tf.Session()

def _eval_epoch(): # Iterate through test data for one epoch
    # Initialize and start from beginning of test data
    iterator.initialize_dataset(sess, 'test')
    while True:
        try:
            fetch_dict = { # Read from test data
                iterator.handle: Iterator.get_handle(sess, 'test')
            }
            test_batch_ = sess.run(batch, feed_dict=feed_dict)
        except tf.errors.OutOfRangeError:
            print("End of val epoch.")

# Initialize and start from beginning of training data
iterator.initialize_dataset(sess, 'train')
step = 0
while True:
    try:
        fetch_dict = { # Read from training data
            iterator.handle: Iterator.get_handle(sess, 'train')
        }
        train_batch_ = sess.run(batch, fetch_dict=fetch_dict)

        step +=1
        if step % 200 == 0: # Evaluate periodically
            _eval_epoch()
    except tf.errors.OutOfRangeError:
        print("End of training.")

get_handle(sess, dataset_name=None)[source]¶

Returns a dataset handle used to feed the handle placeholder to fetch data from the dataset.

Parameters:	sess – The current tf session. dataset_name (optional) – Name of the dataset. If not provided, there must be only one Dataset.
Returns:	A string handle to be fed to the `handle` placeholder.

Example

next_element = iterator.get_next()
train_handle = iterator.get_handle(sess, 'train')
# Gets the next training element
ne_ = sess.run(next_element,
               feed_dict={iterator.handle: train_handle})

restart_dataset(sess, dataset_name=None)[source]¶

Restarts datasets so that next iteration will fetch data from the beginning of the datasets.

Parameters:	sess – The current tf session. dataset_name (optional) – A dataset name or a list of dataset names that specifies which dataset(s) to restart. If None, all datasets are restart.

initialize_dataset(sess, dataset_name=None)[source]¶

Initializes datasets. A dataset must be initialized before being used.

Parameters:	sess – The current tf session. dataset_name (optional) – A dataset name or a list of dataset names that specifies which dataset(s) to initialize. If None, all datasets are initialized.

get_next()[source]¶: Returns the next element of the activated dataset.

handle¶: The handle placeholder that can be fed with a dataset handle to fetch data from the dataset.

TrainTestFeedableDataIterator¶

class texar.tf.data.TrainTestFeedableDataIterator(train=None, val=None, test=None)[source]¶

Feedable data iterator that alternatives between train, val, and test datasets.

This is a wrapper of FeedableDataIterator. The iterator can switch to a dataset and resume from where it was left off when it was visited last time.

train, val, and test can be instance of either tf.data.Dataset or subclass of DataBase. At least one of them must be provided.

Parameters:	train (optional) – Training data. val (optional) – Validation data. test (optional) – Test data.

Example

train_data = MonoTextData(hparams={'num_epochs': 200, ...})
test_data = MonoTextData(hparams_test)
iterator = TrainTestFeedableDataIterator(train=train_data,
                                         test=test_data)
batch = iterator.get_next()

sess = tf.Session()

def _eval_epoch(): # Iterate through test data for one epoch
    # Initialize and start from beginning of test data
    iterator.initialize_test_dataset(sess)
    while True:
        try:
            fetch_dict = { # Read from test data
                iterator.handle: Iterator.get_test_handle(sess)
            }
            test_batch_ = sess.run(batch, feed_dict=feed_dict)
        except tf.errors.OutOfRangeError:
            print("End of test epoch.")

# Initialize and start from beginning of training data
iterator.initialize_train_dataset(sess)
step = 0
while True:
    try:
        fetch_dict = { # Read from training data
            iterator.handle: Iterator.get_train_handle(sess)
        }
        train_batch_ = sess.run(batch, fetch_dict=fetch_dict)

        step +=1
        if step % 200 == 0: # Evaluate periodically
            _eval_epoch()
    except tf.errors.OutOfRangeError:
        print("End of training.")

get_train_handle(sess)[source]¶

Returns the handle of the training dataset. The handle can be used to feed the handle placeholder to fetch training data.

Parameters:	sess – The current tf session.
Returns:	A string handle to be fed to the `handle` placeholder.

Example

next_element = iterator.get_next()
train_handle = iterator.get_train_handle(sess)
# Gets the next training element
ne_ = sess.run(next_element,
               feed_dict={iterator.handle: train_handle})

get_val_handle(sess)[source]¶

Returns the handle of the validation dataset. The handle can be used to feed the handle placeholder to fetch validation data.

Parameters:	sess – The current tf session.
Returns:	A string handle to be fed to the `handle` placeholder.

get_test_handle(sess)[source]¶

Returns the handle of the test dataset. The handle can be used to feed the handle placeholder to fetch test data.

Parameters:	sess – The current tf session.
Returns:	A string handle to be fed to the `handle` placeholder.

restart_train_dataset(sess)[source]¶

Restarts the training dataset so that next iteration will fetch data from the beginning of the training dataset.

Parameters:	sess – The current tf session.

restart_val_dataset(sess)[source]¶

Restarts the validation dataset so that next iteration will fetch data from the beginning of the validation dataset.

Parameters:	sess – The current tf session.

restart_test_dataset(sess)[source]¶

Restarts the test dataset so that next iteration will fetch data from the beginning of the test dataset.

Parameters:	sess – The current tf session.

Data Utils¶

random_shard_dataset¶

texar.tf.data.random_shard_dataset(dataset_size, shard_size, seed=None)[source]¶: Returns a dataset transformation function that randomly shards a dataset.

maybe_tuple¶

texar.tf.data.maybe_tuple(data)[source]¶

Returns tuple(data) if data contains more than 1 elements.

Used to wrap map_func inputs.

make_partial¶

texar.tf.data.make_partial(fn, *args, **kwargs)[source]¶: Returns a new function with single argument by freezing other arguments of fn.

maybe_download¶

texar.tf.data.maybe_download(urls, path, filenames=None, extract=False)[source]¶

Downloads a set of files.

Parameters:	urls – A (list of) urls to download files. path (str) – The destination path to save the files. filenames – A (list of) strings of the file names. If given, must have the same length with `urls`. If None, filenames are extracted from `urls`. extract (bool) – Whether to extract compressed files.
Returns:	A list of paths to the downloaded files.

read_words¶

texar.tf.data.read_words(filename, newline_token=None)[source]¶

Reads word from a file.

Parameters:	filename (str) – Path to the file. newline_token (str, optional) – The token to replace the original newline token “\n”. For example, newline_token=tx.data.SpecialTokens.EOS. If None, no replacement is performed.
Returns:	A list of words.

make_vocab¶

texar.tf.data.make_vocab(filenames, max_vocab_size=-1, newline_token=None, return_type='list', return_count=False)[source]¶

Builds vocab of the files.

Parameters:

filenames (str) – A (list of) files.
max_vocab_size (int) – Maximum size of the vocabulary. Low frequency words that exceeding the limit will be discarded. Set to -1 (default) if no truncation is wanted.
newline_token (str, optional) – The token to replace the original newline token “\n”. For example, newline_token=tx.data.SpecialTokens.EOS. If None, no replacement is performed.
return_type (str) – Either “list” or “dict”. If “list” (default), this function returns a list of words sorted by frequency. If “dict”, this function returns a dict mapping words to their index sorted by frequency.
return_count (bool) – Whether to return word counts. If True and return_type is “dict”, then a count dict is returned, which is a mapping from words to their frequency.

Returns:

If return_count is False, returns a list or dict containing the vocabulary words.
If return_count if True, returns a pair of list or dict (a, b), where a is a list or dict containing the vocabulary words, b is a list of dict containing the word counts.

count_file_lines¶

texar.tf.data.count_file_lines(filenames)[source]¶: Counts the number of lines in the file(s).

make_chained_transformation¶

texar.tf.data.make_chained_transformation(tran_fns, *args, **kwargs)[source]¶

Returns a dataset transformation function that applies a list of transformations sequentially.

Parameters:	tran_fns (list) – A list of dataset transformation function. args – Extra arguments for each of the transformation function. *kwargs – Extra keyword arguments for each of the transformation function.
Returns:	A transformation function to be used in tf.data.Dataset.map.

make_combined_transformation¶

texar.tf.data.make_combined_transformation(tran_fns, name_prefix=None, *args, **kwargs)[source]¶

Returns a dataset transformation function that applies transformations to each component of the data.

The data to be transformed must be a tuple of the same length of tran_fns.

Parameters:

tran_fns (list) – A list of elements where each element is a transformation function or a list of transformation functions.
name_prefix (list, optional) – Prefix to the field names of each component of the data, to prevent fields with the same name in different components from overriding each other. If not None, must be of the same length of tran_fns.
*args – Extra arguments for each of the transformation function.
**kwargs – Extra keyword arguments for each of the transformation function.

Returns:

A transformation function to be used in tf.data.Dataset.map.