Data¶
Tokenizers¶
TokenizerBase¶
-
class
texar.tf.data.
TokenizerBase
(hparams)[source]¶ Base class inherited by all tokenizer classes. This class handles downloading and loading pre-trained tokenizer and adding tokens to the vocabulary.
Derived class can set up a few special tokens to be used in common scripts and internals:
bos_token
,eos_token
,unk_token
,sep_token
,pad_token
,cls_token
,mask_token
, andadditional_special_tokens
.We defined an
added_tokens_encoder
to add new tokens to the vocabulary without having to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece …).-
classmethod
load
(pretrained_model_path: str, configs: Optional[Dict] = None)[source]¶ Instantiate a tokenizer from the vocabulary files or the saved tokenizer files.
Parameters: - pretrained_model_path – The path to a vocabulary file or a folder that contains the saved pre-trained tokenizer files.
- configs – Tokenizer configurations. You can overwrite the original tokenizer configurations saved in the configuration file by this dictionary.
Returns: A tokenizer instance.
-
save
(save_dir: str) → Tuple[str][source]¶ Save the tokenizer vocabulary files (with added tokens), tokenizer configuration file and a dictionary mapping special token class attributes (
cls_token
,unk_token
, …) to their values (<unk>, <cls>, …) to a directory, so that it can be re-loaded using theload()
.Parameters: save_dir – The path to a folder in which the tokenizer files will be saved. Returns: The paths to the vocabulary file, added token file, special token mapping file, and the configuration file.
-
save_vocab
(save_dir)[source]¶ Save the tokenizer vocabulary to a directory. This method does not save added tokens, special token mappings, and the configuration file.
Please use
save()
to save the full tokenizer state so that it can be reloaded usingload()
.
-
add_tokens
(new_tokens: List[Optional[str]]) → int[source]¶ Add a list of new tokens to the tokenizer class. If the new tokens are not in the vocabulary, they are added to the
added_tokens_encoder
with indices starting from the last index of the current vocabulary.Parameters: new_tokens – A list of new tokens. Returns: Number of tokens added to the vocabulary which can be used to correspondingly increase the size of the associated model embedding matrices.
-
add_special_tokens
(special_tokens_dict: Dict[str, str]) → int[source]¶ Add a dictionary of special tokens to the encoder and link them to class attributes. If the special tokens are not in the vocabulary, they are added to it and indexed starting from the last index of the current vocabulary.
Parameters: special_tokens_dict – A dictionary mapping special token class attributes ( cls_token
,unk_token
, …) to their values (<unk>, <cls>, …).Returns: Number of tokens added to the vocabulary which can be used to correspondingly increase the size of the associated model embedding matrices.
-
map_text_to_token
(text: Optional[str], **kwargs) → List[str][source]¶ Maps a string to a sequence of tokens (string), using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePiece/WordPiece). This function also takes care of the added tokens.
Parameters: text – A input string. Returns: A list of tokens.
-
map_token_to_id
(tokens)[source]¶ Maps a single token or a sequence of tokens to a integer id (resp.) a sequence of ids, using the vocabulary.
Parameters: tokens – A single token or a list of tokens. Returns: A single token id or a list of token ids.
-
map_text_to_id
(text: str) → List[int][source]¶ Maps a string to a sequence of ids (integer), using the tokenizer and vocabulary. Same as self.map_token_to_id(self.map_text_to_token(text)).
Parameters: text – A input string. Returns: A single token id or a list of token ids.
-
map_id_to_token
(token_ids, skip_special_tokens=False)[source]¶ Maps a single id or a sequence of ids to a token (resp.) a sequence of tokens, using the vocabulary and added tokens.
Parameters: - token_ids – A single token id or a list of token ids.
- skip_special_tokens – Whether to skip the special tokens.
Returns: A single token or a list of tokens.
-
map_token_to_text
(tokens: List[str]) → str[source]¶ Maps a sequence of tokens (string) in a single string. The most simple way to do it is
‘ ‘.join(tokens)
, but we often want to remove sub-word tokenization artifacts at the same time.
-
map_id_to_text
(token_ids: List[int], skip_special_tokens: bool = False, clean_up_tokenization_spaces: bool = True) → str[source]¶ Maps a sequence of ids (integer) to a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
Parameters: - token_ids – A list of token ids.
- skip_special_tokens – Whether to skip the special tokens.
- clean_up_tokenization_spaces – Whether to clean up a list of simple English tokenization artifacts like spaces before punctuations and abbreviated forms.
-
encode_text
(text_a: str, text_b: Optional[str] = None, max_seq_length: Optional[int] = None)[source]¶ Adds special tokens to a sequence or sequence pair and computes other information such as segment ids, input mask, and sequence length for specific tasks.
-
special_tokens_map
¶ A dictionary mapping special token class attributes (
cls_token
,unk_token
, …) to their values (<unk>, <cls>, …)
-
all_special_tokens
¶ List all the special tokens (<unk>, <cls>, …) mapped to class attributes (
cls_token
,unk_token
, …).
-
all_special_ids
¶ List the vocabulary indices of the special tokens (<unk>, <cls>, …) mapped to class attributes (
cls_token
,unk_token
, …).
-
classmethod
BERTTokenizer¶
-
class
texar.tf.data.
BERTTokenizer
(pretrained_model_name: Optional[str] = None, cache_dir: Optional[str] = None, hparams=None)[source]¶ Pre-trained BERT Tokenizer.
Parameters: - pretrained_model_name (optional) – a str, the name of
pre-trained model (e.g., bert-base-uncased). Please refer to
PretrainedBERTMixin
for all supported models. If None, the model name inhparams
is used. - cache_dir (optional) – the path to a folder in which the
pre-trained models will be cached. If None (default),
a default directory (
texar_data
folder under user’s home directory) will be used. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
-
save_vocab
(save_dir: str) → Tuple[str][source]¶ Save the tokenizer vocabulary to a directory or file.
-
map_token_to_text
(tokens: List[str]) → str[source]¶ Maps a sequence of tokens (string) to a single string.
-
encode_text
(text_a: str, text_b: Optional[str] = None, max_seq_length: Optional[int] = None) → Tuple[List[int], List[int], List[int]][source]¶ Adds special tokens to a sequence or sequence pair and computes the corresponding segment ids and input mask for BERT specific tasks. The sequence will be truncated if its length is larger than
max_seq_length
.A BERT sequence has the following format: [cls_token] X [sep_token]
A BERT sequence pair has the following format: [cls_token] A [sep_token] B [sep_token]
Parameters: - text_a – The first input text.
- text_b – The second input text.
- max_seq_length – Maximum sequence length.
Returns: A tuple of (input_ids, segment_ids, input_mask), where
input_ids
: A list of input token ids with added special token ids.segment_ids
: A list of segment ids.input_mask
: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
-
static
default_hparams
() → Dict[str, Any][source]¶ Returns a dictionary of hyperparameters with default values.
- The tokenizer is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored. - Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
- If the above two are None, the tokenizer is defined by the configurations in hparams.
{ "pretrained_model_name": "bert-base-uncased", "vocab_file": None, "max_len": 512, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": True, "do_lower_case": True, "do_basic_tokenize": True, "non_split_tokens": None, "name": "bert_tokenizer", }
Here:
- “pretrained_model_name”: str or None
- The name of the pre-trained BERT model.
- “vocab_file”: str or None
- The path to a one-wordpiece-per-line vocabulary file.
- “max_len”: int
- The maximum sequence length that this model might ever be used with.
- “unk_token”: str
- Unknown token.
- “sep_token”: str
- Separation token.
- “pad_token”: str
- Padding token.
- “cls_token”: str
- Classification token.
- “mask_token”: str
- Masking token.
- “tokenize_chinese_chars”: bool
- Whether to tokenize Chinese characters.
- “do_lower_case”: bool
- Whether to lower case the input Only has an effect when do_basic_tokenize=True
- “do_basic_tokenize”: bool
- Whether to do basic tokenization before wordpiece.
- “non_split_tokens”: list
- List of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True
- “name”: str
- Name of the tokenizer.
- The tokenizer is determined by the constructor argument
- pretrained_model_name (optional) – a str, the name of
pre-trained model (e.g., bert-base-uncased). Please refer to
XLNetTokenizer¶
-
class
texar.tf.data.
XLNetTokenizer
(pretrained_model_name: Optional[str] = None, cache_dir: Optional[str] = None, hparams=None)[source]¶ Pre-trained XLNet Tokenizer.
Parameters: - pretrained_model_name (optional) – a str, the name of
pre-trained model (e.g., xlnet-base-uncased). Please refer to
PretrainedXLNetMixin
for all supported models. If None, the model name inhparams
is used. - cache_dir (optional) – the path to a folder in which the
pre-trained models will be cached. If None (default),
a default directory (
texar_data
folder under user’s home directory) will be used. - hparams (dict or HParams, optional) – Hyperparameters. Missing
hyperparameter will be set to default values. See
default_hparams()
for the hyperparameter structure and default values.
-
save_vocab
(save_dir: str) → Tuple[str][source]¶ Save the sentencepiece vocabulary (copy original file) to a directory.
-
map_token_to_text
(tokens: List[str]) → str[source]¶ Maps a sequence of tokens (string) in a single string.
-
encode_text
(text_a: str, text_b: Optional[str] = None, max_seq_length: Optional[int] = None) → Tuple[List[int], List[int], List[int]][source]¶ Adds special tokens to a sequence or sequence pair and computes the corresponding segment ids and input mask for XLNet specific tasks. The sequence will be truncated if its length is larger than
max_seq_length
.A XLNet sequence has the following format: X [sep_token] [cls_token]
A XLNet sequence pair has the following format: [cls_token] A [sep_token] B [sep_token]
Parameters: - text_a – The first input text.
- text_b – The second input text.
- max_seq_length – Maximum sequence length.
Returns: A tuple of (input_ids, segment_ids, input_mask), where
input_ids
: A list of input token ids with added special token ids.segment_ids
: A list of segment ids.input_mask
: A list of mask ids. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.
-
static
default_hparams
() → Dict[str, Any][source]¶ Returns a dictionary of hyperparameters with default values.
- The tokenizer is determined by the constructor argument
pretrained_model_name
if it’s specified. In this case, hparams are ignored. - Otherwise, the tokenizer is determined by hparams[‘pretrained_model_name’] if it’s specified. All other configurations in hparams are ignored.
- If the above two are None, the tokenizer is defined by the configurations in hparams.
{ "pretrained_model_name": "xlnet-base-cased", "vocab_file": None, "max_len": None, "bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "<sep>", "pad_token": "<pad>", "cls_token": "<cls>", "mask_token": "<mask>", "additional_special_tokens": ["<eop>", "<eod>"], "do_lower_case": False, "remove_space": True, "keep_accents": False, }
Here:
- “pretrained_model_name”: str or None
- The name of the pre-trained XLNet model.
- “vocab_file”: str or None
- The path to a sentencepiece vocabulary file.
- “max_len”: int or None
- The maximum sequence length that this model might ever be used with.
- “bos_token”: str
- Beginning of sentence token.
- “eos_token”: str
- End of sentence token.
- “unk_token”: str
- Unknown token.
- “sep_token”: str
- Separation token.
- “pad_token”: str
- Padding token.
- “cls_token”: str
- Classification token.
- “mask_token”: str
- Masking token.
- “additional_special_tokens”: list
- A list of additional special tokens.
- “do_lower_case”: bool
- Whether to lower-case the text.
- “remove_space”: bool
- Whether to remove the space in the text.
- “keep_accents”: bool
- Whether to keep the accents in the text.
- “name”: str
- Name of the tokenizer.
- The tokenizer is determined by the constructor argument
- pretrained_model_name (optional) – a str, the name of
pre-trained model (e.g., xlnet-base-uncased). Please refer to
Vocabulary¶
SpecialTokens¶
Vocab¶
-
class
texar.tf.data.
Vocab
(filename, pad_token='<PAD>', bos_token='<BOS>', eos_token='<EOS>', unk_token='<UNK>')[source]¶ Vocabulary class that loads vocabulary from file, and maintains mapping tables between token strings and indexes.
Each line of the vocab file should contains one vocabulary token, e.g.,:
vocab_token_1 vocab token 2 vocab token | 3 . ...
Parameters: - filename (str) – Path to the vocabulary file where each line contains one token.
- bos_token (str) – A special token that will be added to the beginning of sequences.
- eos_token (str) – A special token that will be added to the end of sequences.
- unk_token (str) – A special token that will replace all unknown tokens (tokens not included in the vocabulary).
- pad_token (str) – A special token that is used to do padding.
-
load
(filename)[source]¶ Loads the vocabulary from the file.
Parameters: filename (str) – Path to the vocabulary file. Returns: A tuple of TF and python mapping tables between word string and index, ( id_to_token_map
,token_to_id_map
,id_to_token_map_py
,token_to_id_map_py
), whereid_to_token_map
andtoken_to_id_map
are TF HashTable instances, andid_to_token_map_py
andtoken_to_id_map_py
are python defaultdict instances.
-
map_ids_to_tokens
(ids)[source]¶ Maps ids into text tokens.
The returned tokens are a Tensor.
Parameters: ids – An int tensor of token ids. Returns: A tensor of text tokens of the same shape.
-
map_tokens_to_ids
(tokens)[source]¶ Maps text tokens into ids.
The returned ids are a Tensor.
Parameters: tokens – An tensor of text tokens. Returns: A tensor of token ids of the same shape.
-
map_ids_to_tokens_py
(ids)[source]¶ Maps ids into text tokens.
The input
ids
and returned tokens are both python arrays or list.Parameters: ids – An int numpy arry or (possibly nested) list of token ids. Returns: A numpy array of text tokens of the same shape as ids
.
-
map_tokens_to_ids_py
(tokens)[source]¶ Maps text tokens into ids.
The input
tokens
and returned ids are both python arrays or list.Parameters: tokens – A numpy array or (possibly nested) list of text tokens. Returns: A numpy array of token ids of the same shape as tokens
.
-
id_to_token_map_py
¶ The python defaultdict instance that maps from token index to the string form.
-
token_to_id_map_py
¶ The python defaultdict instance that maps from token string to the index.
-
size
¶ The vocabulary size.
-
bos_token
¶ A string of the special token indicating the beginning of sequence.
-
bos_token_id
¶ The int index of the special token indicating the beginning of sequence.
-
eos_token
¶ A string of the special token indicating the end of sequence.
-
eos_token_id
¶ The int index of the special token indicating the end of sequence.
-
unk_token
¶ A string of the special token indicating unknown token.
-
unk_token_id
¶ The int index of the special token indicating unknown token.
-
pad_token
¶ A string of the special token indicating padding token. The default padding token is an empty string.
-
pad_token_id
¶ The int index of the special token indicating padding token.
Embedding¶
Embedding¶
-
class
texar.tf.data.
Embedding
(vocab, hparams=None)[source]¶ Embedding class that loads token embedding vectors from file. Token embeddings not in the embedding file are initialized as specified in
hparams
.Parameters: - vocab (dict) – A dictionary that maps token strings to integer index.
- read_fn – Callable that takes (filename, vocab, word_vecs) and
returns the updated word_vecs. E.g.,
load_word2vec()
andload_glove()
.
-
static
default_hparams
()[source]¶ Returns a dictionary of hyperparameters with default values:
{ "file": "", "dim": 50, "read_fn": "load_word2vec", "init_fn": { "type": "numpy.random.uniform", "kwargs": { "low": -0.1, "high": 0.1, } }, }
Here:
- “file”: str
- Path to the embedding file. If not provided, all embeddings are initialized with the initialization function.
- “dim”: int
- Dimension size of each embedding vector
- “read_fn”: str or callable
Function to read the embedding file. This can be the function, or its string name or full module path. E.g.,
"read_fn": texar.tf.data.load_word2vec "read_fn": "load_word2vec" "read_fn": "texar.tf.data.load_word2vec" "read_fn": "my_module.my_read_fn"
If function string name is used, the function must be in one of the modules:
texar.tf.data
ortexar.tf.custom
.The function must have the same signature as with
load_word2vec()
.- “init_fn”: dict
Hyperparameters of the initialization function used to initialize embedding of tokens missing in the embedding file.
The function must accept argument named size or shape to specify the output shape, and return a numpy array of the shape.
The dict has the following fields:
- “type”: str or callable
- The initialization function. Can be either the function, or its string name or full module path.
- “kwargs”: dict
- Keyword arguments for calling the function. The function
is called with
init_fn(size=[.., ..], **kwargs)
.
-
word_vecs
¶ 2D numpy array of shape [vocab_size, embedding_dim].
-
vector_size
¶ The embedding dimention size.
load_word2vec¶
-
texar.tf.data.
load_word2vec
(filename, vocab, word_vecs)[source]¶ Loads embeddings in the word2vec binary format which has a header line containing the number of vectors and their dimensionality (two integers), followed with number-of-vectors lines each of which is formatted as ‘<word-string> <embedding-vector>’.
Parameters: Returns: The updated
word_vecs
.
Data¶
DataBase¶
-
class
texar.tf.data.
DataBase
(hparams)[source]¶ Base class inheritted by all data classes.
-
static
default_hparams
()[source]¶ Returns a dictionary of default hyperparameters.
{ "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "data", }
Here:
- “num_epochs”: int
Number of times the dataset should be repeated. An OutOfRangeError signal will be raised after the whole repeated dataset has been iterated through.
E.g., For training data, set it to 1 (default) so that you will get the signal after each epoch of training. Set to -1 to repeat the dataset indefinitely.
- “batch_size”: int
- Batch size, i.e., the number of consecutive elements of the dataset to combine in a single batch.
- “allow_smaller_final_batch”: bool
- Whether to allow the final batch to be smaller if there are insufficient elements left. If False, the final batch is discarded if it is smaller than batch size. Note that, if True, output_shapes of the resulting dataset will have a a static batch_size dimension equal to “batch_size”.
- “shuffle”: bool
- Whether to randomly shuffle the elements of the dataset.
- “shuffle_buffer_size”: int
The buffer size for data shuffling. The larger, the better the resulting data is mixed.
If None (default), buffer size is set to the size of the whole dataset (i.e., make the shuffling the maximally effective).
- “shard_and_shuffle”: bool
Whether to first shard the dataset and then shuffle each block respectively. Useful when the whole data is too large to be loaded efficiently into the memory.
If True,
shuffle_buffer_size
must be specified to determine the size of each shard.- “num_parallel_calls”: int
- Number of elements from the datasets to process in parallel.
- “prefetch_buffer_size”: int
- The maximum number of elements that will be buffered when prefetching.
- max_dataset_size : int
- Maximum number of instances to include in the dataset. If set to -1 or greater than the size of dataset, all instances will be included. This constraint is imposed after data shuffling and filtering.
- seed : int, optional
The random seed for shuffle.
Note that if a seed is set, the shuffle order will be exact the same every time when going through the (repeated) dataset.
For example, consider a dataset with elements [1, 2, 3], with “num_epochs”=2 and some fixed seed, the resulting sequence can be: 2 1 3, 1 3 2 | 2 1 3, 1 3 2, … That is, the orders are different within every num_epochs, but are the same across the num_epochs.
- name : str
- Name of the data.
-
num_epochs
¶ Number of epochs.
-
batch_size
¶ The batch size.
-
name
¶ Name of the module.
-
static
MonoTextData¶
-
class
texar.tf.data.
MonoTextData
(hparams)[source]¶ Text data processor that reads single set of text files. This can be used for, e.g., language models, auto-encoders, etc.
Parameters: hparams – A dict or instance of HParams
containing hyperparameters. Seedefault_hparams()
for the defaults.By default, the processor reads raw data files, performs tokenization, batching and other pre-processing steps, and results in a TF Dataset whose element is a python dict including three fields:
- “text”:
- A string Tensor of shape [batch_size, max_time] containing
the raw text toknes. max_time is the length of the longest
sequence in the batch.
Short sequences in the batch are padded with empty string.
BOS and EOS tokens are added as per
hparams
. Out-of-vocabulary tokens are NOT replaced with UNK.
- “text_ids”:
- An int64 Tensor of shape [batch_size, max_time] containing the token indexes.
- “length”:
- An int Tensor of shape [batch_size] containing the length of each sequence in the batch (including BOS and EOS if added).
If
'variable_utterance'
is set to True inhparams
, the resulting dataset has elements with four fields:- “text”:
- A string Tensor of shape
[batch_size, max_utterance, max_time], where max_utterance is
either the maximum number of utterances in each elements of the
batch, or
max_utterance_cnt
as specified inhparams
.
- “text_ids”:
- An int64 Tensor of shape [batch_size, max_utterance, max_time] containing the token indexes.
- “length”:
- An int Tensor of shape [batch_size, max_utterance] containing the length of each sequence in the batch.
- “utterance_cnt”:
- An int Tensor of shape [batch_size] containing the number of utterances of each element in the batch.
The above field names can be accessed through
text_name
,text_id_name
,length_name
, andutterance_cnt_name
, respectively.Example
hparams={ 'dataset': { 'files': 'data.txt', 'vocab_file': 'vocab.txt' }, 'batch_size': 1 } data = MonoTextData(hparams) iterator = DataIterator(data) batch = iterator.get_next() iterator.switch_to_dataset(sess) # initializes the dataset batch_ = sess.run(batch) # batch_ == { # 'text': [['<BOS>', 'example', 'sequence', '<EOS>']], # 'text_ids': [[1, 5, 10, 2]], # 'length': [4] # }
-
static
default_hparams
()[source]¶ Returns a dicitionary of default hyperparameters:
{ # (1) Hyperparams specific to text dataset "dataset": { "files": [], "compression_type": None, "vocab_file": "", "embedding_init": {}, "delimiter": " ", "max_seq_length": None, "length_filter_mode": "truncate", "pad_to_max_seq_length": False, "bos_token": "<BOS>" "eos_token": "<EOS>" "other_transformations": [], "variable_utterance": False, "utterance_delimiter": "|||", "max_utterance_cnt": 5, "data_name": None, } # (2) General hyperparams "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "mono_text_data", # (3) Bucketing "bucket_boundaries": [], "bucket_batch_sizes": None, "bucket_length_fn": None, }
Here:
For the hyperparameters in the
"dataset"
field:- “files”: str or list
A (list of) text file path(s).
Each line contains a single text sequence.
- “compression_type”: str, optional
One of “” (no compression), “ZLIB”, or “GZIP”.
- “vocab_file”: str
Path to vocabulary file. Each line of the file should contain one vocabulary token.
Used to create an instance of
Vocab
.- “embedding_init”: dict
The hyperparameters for pre-trained embedding loading and initialization.
The structure and default values are defined in
texar.tf.data.Embedding.default_hparams()
.- “delimiter”: str
The delimiter to split each line of the text files into tokens.
- “max_seq_length”: int, optional
Maximum length of output sequences. Data samples exceeding the length will be truncated or discarded according to
"length_filter_mode"
. The length does not include any added"bos_token"
or"eos_token"
. If None (default), no filtering is performed.- “length_filter_mode”: str
Either “truncate” or “discard”. If “truncate” (default), tokens exceeding the
"max_seq_length"
will be truncated. If “discard”, data samples longer than the"max_seq_length"
will be discarded.- “pad_to_max_seq_length”: bool
If True, pad all data instances to length
"max_seq_length"
. Raises error if"max_seq_length"
is not provided.- “bos_token”: str
The Begin-Of-Sequence token prepended to each sequence.
Set to an empty string to avoid prepending.
- “eos_token”: str
The End-Of-Sequence token appended to each sequence.
Set to an empty string to avoid appending.
- “other_transformations”: list
A list of transformation functions or function names/paths to further transform each single data instance.
(More documentations to be added.)
- “variable_utterance”: bool
If True, each line of the text file is considered to contain multiple sequences (utterances) separated by
"utterance_delimiter"
.For example, in dialog data, each line can contain a series of dialog history utterances. See the example in examples/hierarchical_dialog for a use case.
- “utterance_delimiter”: str
The delimiter to split over utterance level. Should not be the same with
"delimiter"
. Used only when"variable_utterance"``==True
.- “max_utterance_cnt”: int
Maximally allowed number of utterances in a data instance. Extra utterances are truncated out.
- “data_name”: str
Name of the dataset.
2. For the general hyperparameters, see
texar.tf.data.DataBase.default_hparams()
for details.3. Bucketing is to group elements of the dataset together by length and then pad and batch. (See more at bucket_by_sequence_length). For bucketing hyperparameters:
- “bucket_boundaries”: list
An int list containing the upper length boundaries of the buckets.
Set to an empty list (default) to disable bucketing.
- “bucket_batch_sizes”: list
An int list containing batch size per bucket. Length should be len(bucket_boundaries) + 1.
If None, every bucket whill have the same batch size specified in
batch_size
.- “bucket_length_fn”: str or callable
Function maps dataset element to tf.int32 scalar, determines the length of the element.
This can be a function, or the name or full module path to the function. If function name is given, the function must be in the
texar.tf.custom
module.If None (default), length is determined by the number of tokens (including BOS and EOS if added) of the element.
-
list_items
()[source]¶ Returns the list of item names that the data can produce.
Returns: A list of strings.
-
dataset
¶ The dataset, an instance of TF dataset.
-
dataset_size
()[source]¶ Returns the number of data instances in the data files.
Note that this is the total data count in the raw files, before any filtering and truncation.
-
embedding_init_value
¶ The Tensor containing the embedding value loaded from file. None if embedding is not specified.
-
text_name
¶ The name of text tensor, “text” by default.
-
length_name
¶ The name of length tensor, “length” by default.
-
text_id_name
¶ The name of text index tensor, “text_ids” by default.
-
utterance_cnt_name
¶ The name of utterance count tensor, “utterance_cnt” by default.
-
batch_size
¶ The batch size.
-
name
¶ Name of the module.
-
num_epochs
¶ Number of epochs.
PairedTextData¶
-
class
texar.tf.data.
PairedTextData
(hparams)[source]¶ Text data processor that reads parallel source and target text. This can be used in, e.g., seq2seq models.
Parameters: hparams (dict) – Hyperparameters. See default_hparams()
for the defaults.By default, the processor reads raw data files, performs tokenization, batching and other pre-processing steps, and results in a TF Dataset whose element is a python dict including six fields:
- “source_text”:
- A string Tensor of shape [batch_size, max_time] containing the raw text toknes of source sequences. max_time is the length of the longest sequence in the batch. Short sequences in the batch are padded with empty string. By default only EOS token is appended to each sequence. Out-of-vocabulary tokens are NOT replaced with UNK.
- “source_text_ids”:
- An int64 Tensor of shape [batch_size, max_time] containing the token indexes of source sequences.
- “source_length”:
- An int Tensor of shape [batch_size] containing the length of each source sequence in the batch (including BOS and/or EOS if added).
- “target_text”:
- A string Tensor as “source_text” but for target sequences. By default both BOS and EOS are added.
- “target_text_ids”:
- An int64 Tensor as “source_text_ids” but for target sequences.
- “target_length”:
- An int Tensor of shape [batch_size] as “source_length” but for target sequences.
If
'variable_utterance'
is set to True in'source_dataset'
and/or'target_dataset'
ofhparams
, the corresponding fields “source_*” and/or “target_*” are respectively changed to contain variable utterance text data, as inMonoTextData
.The above field names can be accessed through
source_text_name
,source_text_id_name
,source_length_name
,source_utterance_cnt_name
, and those prefixed with target_, respectively.Example
hparams={ 'source_dataset': {'files': 's', 'vocab_file': 'vs'}, 'target_dataset': {'files': ['t1', 't2'], 'vocab_file': 'vt'}, 'batch_size': 1 } data = PairedTextData(hparams) iterator = DataIterator(data) batch = iterator.get_next() iterator.switch_to_dataset(sess) # initializes the dataset batch_ = sess.run(batch) # batch_ == { # 'source_text': [['source', 'sequence', '<EOS>']], # 'source_text_ids': [[5, 10, 2]], # 'source_length': [3] # 'target_text': [['<BOS>', 'target', 'sequence', '1', '<EOS>']], # 'target_text_ids': [[1, 6, 10, 20, 2]], # 'target_length': [5] # }
-
static
default_hparams
()[source]¶ Returns a dicitionary of default hyperparameters.
{ # (1) Hyperparams specific to text dataset "source_dataset": { "files": [], "compression_type": None, "vocab_file": "", "embedding_init": {}, "delimiter": " ", "max_seq_length": None, "length_filter_mode": "truncate", "pad_to_max_seq_length": False, "bos_token": None, "eos_token": "<EOS>", "other_transformations": [], "variable_utterance": False, "utterance_delimiter": "|||", "max_utterance_cnt": 5, "data_name": "source", }, "target_dataset": { # ... # Same fields are allowed as in "source_dataset" with the # same default values, except the # following new fields/values: "bos_token": "<BOS>" "vocab_share": False, "embedding_init_share": False, "processing_share": False, "data_name": "target" } # (2) General hyperparams "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "paired_text_data", # (3) Bucketing "bucket_boundaries": [], "bucket_batch_sizes": None, "bucket_length_fn": None, }
Here:
1. Hyperparameters in the
"source_dataset"
and attr:”target_dataset” fields have the same definition as those intexar.tf.data.MonoTextData.default_hparams()
, for source and target text, respectively.For the new hyperparameters in “target_dataset”:
- “vocab_share”: bool
- Whether to share the vocabulary of source. If True, the vocab file of target is ignored.
- “embedding_init_share”: bool
Whether to share the embedding initial value of source. If True,
"embedding_init"
of target is ignored."vocab_share"
must be true to share the embedding initial value.- “processing_share”: bool
- Whether to share the processing configurations of source, including “delimiter”, “bos_token”, “eos_token”, and “other_transformations”.
2. For the general hyperparameters, see
texar.tf.data.DataBase.default_hparams()
for details.3. For bucketing hyperparameters, see
texar.tf.data.MonoTextData.default_hparams()
for details, except that the default bucket_length_fn is the maximum sequence length of source and target sequences.
-
list_items
()[source]¶ Returns the list of item names that the data can produce.
Returns: A list of strings.
-
dataset
¶ The dataset.
-
dataset_size
()[source]¶ Returns the number of data instances in the dataset.
Note that this is the total data count in the raw files, before any filtering and truncation.
-
source_embedding_init_value
¶ The Tensor containing the embedding value of source data loaded from file. None if embedding is not specified.
-
target_embedding_init_value
¶ The Tensor containing the embedding value of target data loaded from file. None if embedding is not specified.
-
embedding_init_value
()[source]¶ A pair of Tensor containing the embedding values of source and target data loaded from file.
-
source_text_name
¶ The name of the source text tensor, “source_text” by default.
-
source_length_name
¶ The name of the source length tensor, “source_length” by default.
-
source_text_id_name
¶ The name of the source text index tensor, “source_text_ids” by default.
-
source_utterance_cnt_name
¶ The name of the source text utterance count tensor, “source_utterance_cnt” by default.
-
target_text_name
¶ The name of the target text tensor, “target_text” bt default.
-
target_length_name
¶ The name of the target length tensor, “target_length” by default.
-
target_text_id_name
¶ The name of the target text index tensor, “target_text_ids” by default.
-
target_utterance_cnt_name
¶ The name of the target text utterance count tensor, “target_utterance_cnt” by default.
-
text_name
¶ The name of text tensor, “text” by default.
-
length_name
¶ The name of length tensor, “length” by default.
-
text_id_name
¶ The name of text index tensor, “text_ids” by default.
-
utterance_cnt_name
¶ The name of the text utterance count tensor, “utterance_cnt” by default.
-
batch_size
¶ The batch size.
-
name
¶ Name of the module.
-
num_epochs
¶ Number of epochs.
ScalarData¶
-
class
texar.tf.data.
ScalarData
(hparams)[source]¶ Scalar data where each line of the files is a scalar (int or float), e.g., a data label.
Parameters: hparams (dict) – Hyperparameters. See default_hparams()
for the defaults.The processor reads and processes raw data and results in a TF dataset whose element is a python dict including one field. The field name is specified in
hparams["dataset"]["data_name"]
. If not specified, the default name is “data”. The field name can be accessed throughdata_name
.This field is a Tensor of shape [batch_size] containing a batch of scalars, of either int or float type as specified in
hparams
.Example
hparams={ 'dataset': { 'files': 'data.txt', 'data_name': 'label' }, 'batch_size': 2 } data = ScalarData(hparams) iterator = DataIterator(data) batch = iterator.get_next() iterator.switch_to_dataset(sess) # initializes the dataset batch_ = sess.run(batch) # batch_ == { # 'label': [2, 9] # }
-
static
default_hparams
()[source]¶ Returns a dicitionary of default hyperparameters.
{ # (1) Hyperparams specific to scalar dataset "dataset": { "files": [], "compression_type": None, "data_type": "int", "other_transformations": [], "data_name": None, } # (2) General hyperparams "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "scalar_data", }
Here:
For the hyperparameters in the
"dataset"
field:- “files”: str or list
A (list of) file path(s).
Each line contains a single scalar number.
- “compression_type”: str, optional
One of “” (no compression), “ZLIB”, or “GZIP”.
- “data_type”: str
The scalar type. Currently supports “int” and “float”.
- “other_transformations”: list
A list of transformation functions or function names/paths to further transform each single data instance.
(More documentations to be added.)
- “data_name”: str
Name of the dataset.
2. For the general hyperparameters, see
texar.tf.data.DataBase.default_hparams()
for details.
-
list_items
()[source]¶ Returns the list of item names that the data can produce.
Returns: A list of strings.
-
dataset
¶ The dataset.
-
dataset_size
()[source]¶ Returns the number of data instances in the dataset.
Note that this is the total data count in the raw files, before any filtering and truncation.
-
batch_size
¶ The batch size.
-
name
¶ Name of the module.
-
num_epochs
¶ Number of epochs.
-
static
TFRecordData¶
-
class
texar.tf.data.
TFRecordData
(hparams)[source]¶ TFRecord data which loads and processes TFRecord files.
This module can be used to process image data, features, etc.
Parameters: hparams (dict) – Hyperparameters. See default_hparams()
for the defaults.The module reads and restores data from TFRecord files and results in a TF Dataset whose element is a Python dict that maps feature names to feature values. The features names and dtypes are specified in
hparams["dataset"]["feature_original_types"]
.The module also provides simple processing options for image data, such as image resize.
Example
# Read data from TFRecord file hparams={ 'dataset': { 'files': 'image1.tfrecord', 'feature_original_types': { 'height': ['tf.int64', 'FixedLenFeature'], 'width': ['tf.int64', 'FixedLenFeature'], 'label': ['tf.int64', 'FixedLenFeature'], 'image_raw': ['tf.string', 'FixedLenFeature'] } }, 'batch_size': 1 } data = TFRecordData(hparams) iterator = DataIterator(data) batch = iterator.get_next() iterator.switch_to_dataset(sess) # initializes the dataset batch_ = sess.run(batch) # batch_ == { # 'data': { # 'height': [239], # 'width': [149], # 'label': [1], # # # 'image_raw' is a list of image data bytes in this # # example. # 'image_raw': [...], # } # }
# Read image data from TFRecord file and do resizing hparams={ 'dataset': { 'files': 'image2.tfrecord', 'feature_original_types': { 'label': ['tf.int64', 'FixedLenFeature'], 'image_raw': ['tf.string', 'FixedLenFeature'] }, 'image_options': { 'image_feature_name': 'image_raw', 'resize_height': 512, 'resize_width': 512, } }, 'batch_size': 1 } data = TFRecordData(hparams) iterator = DataIterator(data) batch = iterator.get_next() iterator.switch_to_dataset(sess) # initializes the dataset batch_ = sess.run(batch) # batch_ == { # 'data': { # 'label': [1], # # # "image_raw" is a list of a "numpy.ndarray" image # # in this example. Each image has a width of 512 and # # height of 512. # 'image_raw': [...] # } # }
-
static
default_hparams
()[source]¶ Returns a dicitionary of default hyperparameters.
{ # (1) Hyperparams specific to TFRecord dataset 'dataset': { 'files': [], 'feature_original_types': {}, 'feature_convert_types': {}, 'image_options': {}, "num_shards": None, "shard_id": None, "other_transformations": [], "data_name": None, } # (2) General hyperparams "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "tfrecord_data", }
Here:
For the hyperparameters in the
"dataset"
field:- “files”: str or list
A (list of) TFRecord file path(s).
- “feature_original_types”: dict
The feature names (str) with their data types and length types, key and value in pair feature_name: [dtype, feature_len_type, len],
- dtype is a TF Dtype such as tf.string and tf.int32, or its string name such as ‘tf.string’ and ‘tf.int32’. The feature will be read from the files and parsed into this dtype.
- feature_len_type is of type str, and can be either ‘FixedLenFeature’ or ‘VarLenFeature’ for fixed length features and non-fixed length features, respectively.
- len is an int and is optional. It is the length for ‘FixedLenFeature’. Ignored if ‘VarLenFeature’ is used.
Example:
feature_original_types = { "input_ids": ["tf.int64", "FixedLenFeature", 128], "label_ids": ["tf.int64", "FixedLenFeature"], "name_lists": ["tf.string", "VarLenFeature"], }
- “feature_convert_types”: dict, optional
Specifies dtype converting after reading the data files. This dict maps feature names to desired data dtypes. For example, you can first read a feature into dtype tf.float64 by specifying in “feature_original_types” above, and convert the feature to dtype “tf.int64” by specifying here. Features not specified here will not do dtype-convert.
- dtype is a TF Dtype such as tf.string and tf.int32, or its string name such as ‘tf.string’ and ‘tf.int32’.
Be noticed that this converting process is after all the data are restored, feature_original_types has to be set firstly.
Example:
feature_convert_types = { "input_ids": "tf.int32", "label_ids": "tf.int32", }
- “image_options”: dict, optional
Specifies the image feature name and performs image resizing, includes three fields:
- “image_feature_name”:
- A str, the name of the feature which contains the image data. If set, the image data will be restored in format numpy.ndarray.
- “resize_height”:
- A int, the height of the image after resizing.
- “resize_width”:
- A int, the width of the image after resizing
If either resize_height or resize_width is not set, image data will be restored with original shape.
- “num_shards”: int, optional
The number of data shards in distributed mode. Usually set to the number of processes in distributed computing. Used in combination with
"shard_id"
.- “shard_id”: int, optional
Sets the unique id to identify a shard. The module will processes only the corresponding shard of the whole data. Used in combination with
"num_shards"
.E.g., in a case of distributed computing on 2 GPUs, the hparams of the data module for the two processes can be as below, respectively.
For gpu 0:
dataset: { ... "num_shards": 2, "shard_id": 0 }
For gpu 1:
dataset: { ... "num_shards": 2, "shard_id": 1 }
Also refer to examples/bert for a use case.
- “other_transformations”: list
A list of transformation functions or function names/paths to further transform each single data instance.
- “data_name”: str
Name of the dataset.
2. For the general hyperparameters, see
texar.tf.data.DataBase.default_hparams()
for details.
-
list_items
()[source]¶ Returns the list of item names that the data can produce.
Returns: A list of strings.
-
feature_names
¶ A list of feature names.
-
batch_size
¶ The batch size.
-
name
¶ Name of the module.
-
num_epochs
¶ Number of epochs.
-
static
MultiAlignedData¶
-
class
texar.tf.data.
MultiAlignedData
(hparams)[source]¶ Data consisting of multiple aligned parts.
Parameters: hparams (dict) – Hyperparameters. See default_hparams()
for the defaults.The processor can read any number of parallel fields as specified in the “datasets” list of
hparams
, and result in a TF Dataset whose element is a python dict containing data fields from each of the specified datasets. Fields from a text dataset or TFRecord dataset have names prefixed by its “data_name”. Fields from a scalar dataset are specified by its “data_name”.Example
hparams={ 'datasets': [ {'files': 'a.txt', 'vocab_file': 'v.a', 'data_name': 'x'}, {'files': 'b.txt', 'vocab_file': 'v.b', 'data_name': 'y'}, {'files': 'c.txt', 'data_type': 'int', 'data_name': 'z'} ] 'batch_size': 1 } data = MultiAlignedData(hparams) iterator = DataIterator(data) batch = iterator.get_next() iterator.switch_to_dataset(sess) # initializes the dataset batch_ = sess.run(batch) # batch_ == { # 'x_text': [['<BOS>', 'x', 'sequence', '<EOS>']], # 'x_text_ids': [['1', '5', '10', '2']], # 'x_length': [4] # 'y_text': [['<BOS>', 'y', 'sequence', '1', '<EOS>']], # 'y_text_ids': [['1', '6', '10', '20', '2']], # 'y_length': [5], # 'z': [1000], # } ... hparams={ 'datasets': [ {'files': 'd.txt', 'vocab_file': 'v.d', 'data_name': 'm'}, { 'files': 'd.tfrecord', 'data_type': 'tf_record', "feature_original_types": { 'image': ['tf.string', 'FixedLenFeature'] }, 'image_options': { 'image_feature_name': 'image', 'resize_height': 512, 'resize_width': 512, }, 'data_name': 't', } ] 'batch_size': 1 } data = MultiAlignedData(hparams) iterator = DataIterator(data) batch = iterator.get_next() iterator.switch_to_dataset(sess) # initializes the dataset batch_ = sess.run(batch) # batch_ == { # 'x_text': [['<BOS>', 'NewYork', 'City', 'Map', '<EOS>']], # 'x_text_ids': [['1', '100', '80', '65', '2']], # 'x_length': [5], # # # "t_image" is a list of a "numpy.ndarray" image # # in this example. Its width equals to 512 and # # its height equals to 512. # 't_image': [...] # }
-
static
default_hparams
()[source]¶ Returns a dicitionary of default hyperparameters.
{ # (1) Hyperparams specific to text dataset "datasets": [] # (2) General hyperparams "num_epochs": 1, "batch_size": 64, "allow_smaller_final_batch": True, "shuffle": True, "shuffle_buffer_size": None, "shard_and_shuffle": False, "num_parallel_calls": 1, "prefetch_buffer_size": 0, "max_dataset_size": -1, "seed": None, "name": "multi_aligned_data", }
Here:
1. “datasets” is a list of dict each of which specifies a dataset which can be text, scalar or TFRecord. The
"data_name"
field of each dataset is used as the name prefix of the data fields from the respective dataset. The"data_name"
field of each dataset should not be the same.For scalar dataset, the allowed hyperparameters and default values are the same as the “dataset” field of
texar.tf.data.ScalarData.default_hparams()
. Note that"data_type"
must be explicily specified (either “int” or “float”).For TFRecord dataset, the allowed hyperparameters and default values are the same as the “dataset” field of
texar.tf.data.TFRecordData.default_hparams()
. Note that"data_type"
must be explicily specified (tf_record”).For text dataset, the allowed hyperparameters and default values are the same as the “dataset” filed of
texar.tf.data.MonoTextData.default_hparams()
, with several extra hyperparameters:- “data_type”: str
The type of the dataset, one of {“text”, “int”, “float”, “tf_record”}. If set to “int” or “float”, the dataset is considered to be a scalar dataset. If set to “tf_record”, the dataset is considered to be a TFRecord dataset. If not specified or set to “text”, the dataset is considered to be a text dataset.
- “vocab_share_with”: int, optional
Share the vocabulary of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.
If specified, the vocab file of current dataset is ignored. Default is None which disables the vocab sharing.
- “embedding_init_share_with”: int, optional
Share the embedding initial value of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.
If specified, the
"embedding_init"
field of the current dataset is ignored. Default is None which disables the initial value sharing.- “processing_share_with”: int, optional
Share the processing configurations of a preceding text dataset with the specified index in the list (starting from 0). The specified dataset must be a text dataset, and must have an index smaller than the current dataset.
If specified, relevant field of the current dataset are ignored, including “delimiter”, “bos_token”, “eos_token”, and “other_transformations”. Default is None which disables the processing sharing.
2. For the general hyperparameters, see
texar.tf.data.DataBase.default_hparams()
for details.
-
list_items
()[source]¶ Returns the list of item names that the data can produce.
Returns: A list of strings.
-
dataset
¶ The dataset.
-
dataset_size
()[source]¶ Returns the number of data instances in the dataset.
Note that this is the total data count in the raw files, before any filtering and truncation.
-
vocab
(name_or_id)[source]¶ Returns the
Vocab
of text dataset by its name or id. None if the dataset is not of text type.Parameters: name_or_id (str or int) – Data name or the index of text dataset.
-
embedding_init_value
(name_or_id)[source]¶ Returns the Tensor of embedding init value of the dataset by its name or id. None if the dataset is not of text type.
-
text_name
(name_or_id)[source]¶ The name of text tensor of text dataset by its name or id. If the dataaet is not of text type, returns None.
-
length_name
(name_or_id)[source]¶ The name of length tensor of text dataset by its name or id. If the dataset is not of text type, returns None.
-
text_id_name
(name_or_id)[source]¶ The name of length tensor of text dataset by its name or id. If the dataset is not of text type, returns None.
-
utterance_cnt_name
(name_or_id)[source]¶ The name of utterance count tensor of text dataset by its name or id. If the dataset is not variable utterance text data, returns None.
-
data_name
¶ The name of the data tensor of scalar dataset by its name or id.. If the dataset is not a scalar data, returns None.
-
batch_size
¶ The batch size.
-
name
¶ Name of the module.
-
num_epochs
¶ Number of epochs.
-
static
Data Iterators¶
DataIteratorBase¶
-
class
texar.tf.data.
DataIteratorBase
(datasets)[source]¶ Base class for all data iterator classes to inherit. A data iterator is a wrapper of tf.data.Iterator, and can switch between and iterate through multiple datasets.
Parameters: datasets – Datasets to iterates through. This can be:
- A single instance of tf.data.Dataset or instance of subclass of
DataBase
. - A dict that maps dataset name to instance of tf.data.Dataset or subclass of
DataBase
. - A list of instances of subclasses of
texar.tf.data.DataBase
. The name of instances (texar.tf.data.DataBase.name
) must be unique.
-
num_datasets
¶ Number of datasets.
-
dataset_names
¶ A list of dataset names.
- A single instance of tf.data.Dataset or instance of subclass of
DataIterator¶
-
class
texar.tf.data.
DataIterator
(datasets)[source]¶ Data iterator that switches and iterates through multiple datasets.
This is a wrapper of TF reinitializble iterator.
Parameters: datasets – Datasets to iterates through. This can be:
- A single instance of tf.data.Dataset or instance of subclass of
DataBase
. - A dict that maps dataset name to instance of tf.data.Dataset or subclass of
DataBase
. - A list of instances of subclasses of
texar.tf.data.DataBase
. The name of instances (texar.tf.data.DataBase.name
) must be unique.
Example
train_data = MonoTextData(hparams_train) test_data = MonoTextData(hparams_test) iterator = DataIterator({'train': train_data, 'test': test_data}) batch = iterator.get_next() sess = tf.Session() for _ in range(200): # Run 200 epochs of train/test # Starts iterating through training data from the beginning iterator.switch_to_dataset(sess, 'train') while True: try: train_batch_ = sess.run(batch) except tf.errors.OutOfRangeError: print("End of training epoch.") # Starts iterating through test data from the beginning iterator.switch_to_dataset(sess, 'test') while True: try: test_batch_ = sess.run(batch) except tf.errors.OutOfRangeError: print("End of test epoch.")
- A single instance of tf.data.Dataset or instance of subclass of
TrainTestDataIterator¶
-
class
texar.tf.data.
TrainTestDataIterator
(train=None, val=None, test=None)[source]¶ Data iterator that alternatives between train, val, and test datasets.
train
,val
, andtest
can be instance of either tf.data.Dataset or subclass ofDataBase
. At least one of them must be provided.This is a wrapper of
DataIterator
.Parameters: - train (optional) – Training data.
- val (optional) – Validation data.
- test (optional) – Test data.
Example
train_data = MonoTextData(hparams_train) val_data = MonoTextData(hparams_val) iterator = TrainTestDataIterator(train=train_data, val=val_data) batch = iterator.get_next() sess = tf.Session() for _ in range(200): # Run 200 epochs of train/val # Starts iterating through training data from the beginning iterator.switch_to_train_data(sess) while True: try: train_batch_ = sess.run(batch) except tf.errors.OutOfRangeError: print("End of training epoch.") # Starts iterating through val data from the beginning iterator.switch_to_val_dataset(sess) while True: try: val_batch_ = sess.run(batch) except tf.errors.OutOfRangeError: print("End of val epoch.")
-
switch_to_train_data
(sess)[source]¶ Starts to iterate through training data (from the beginning).
Parameters: sess – The current tf session.
FeedableDataIterator¶
-
class
texar.tf.data.
FeedableDataIterator
(datasets)[source]¶ Data iterator that iterates through multiple datasets and switches between datasets.
The iterator can switch to a dataset and resume from where we left off last time we visited the dataset. This is a wrapper of TF feedable iterator.
Parameters: datasets – Datasets to iterates through. This can be:
- A single instance of tf.data.Dataset or instance of subclass of
DataBase
. - A dict that maps dataset name to instance of tf.data.Dataset or subclass of
DataBase
. - A list of instances of subclasses of
texar.tf.data.DataBase
. The name of instances (texar.tf.data.DataBase.name
) must be unique.
Example
train_data = MonoTextData(hparams={'num_epochs': 200, ...}) test_data = MonoTextData(hparams_test) iterator = FeedableDataIterator({'train': train_data, 'test': test_data}) batch = iterator.get_next() sess = tf.Session() def _eval_epoch(): # Iterate through test data for one epoch # Initialize and start from beginning of test data iterator.initialize_dataset(sess, 'test') while True: try: fetch_dict = { # Read from test data iterator.handle: Iterator.get_handle(sess, 'test') } test_batch_ = sess.run(batch, feed_dict=feed_dict) except tf.errors.OutOfRangeError: print("End of val epoch.") # Initialize and start from beginning of training data iterator.initialize_dataset(sess, 'train') step = 0 while True: try: fetch_dict = { # Read from training data iterator.handle: Iterator.get_handle(sess, 'train') } train_batch_ = sess.run(batch, fetch_dict=fetch_dict) step +=1 if step % 200 == 0: # Evaluate periodically _eval_epoch() except tf.errors.OutOfRangeError: print("End of training.")
-
get_handle
(sess, dataset_name=None)[source]¶ Returns a dataset handle used to feed the
handle
placeholder to fetch data from the dataset.Parameters: - sess – The current tf session.
- dataset_name (optional) – Name of the dataset. If not provided, there must be only one Dataset.
Returns: A string handle to be fed to the
handle
placeholder.Example
next_element = iterator.get_next() train_handle = iterator.get_handle(sess, 'train') # Gets the next training element ne_ = sess.run(next_element, feed_dict={iterator.handle: train_handle})
-
restart_dataset
(sess, dataset_name=None)[source]¶ Restarts datasets so that next iteration will fetch data from the beginning of the datasets.
Parameters: - sess – The current tf session.
- dataset_name (optional) – A dataset name or a list of dataset names that specifies which dataset(s) to restart. If None, all datasets are restart.
-
initialize_dataset
(sess, dataset_name=None)[source]¶ Initializes datasets. A dataset must be initialized before being used.
Parameters: - sess – The current tf session.
- dataset_name (optional) – A dataset name or a list of dataset names that specifies which dataset(s) to initialize. If None, all datasets are initialized.
-
handle
¶ The handle placeholder that can be fed with a dataset handle to fetch data from the dataset.
- A single instance of tf.data.Dataset or instance of subclass of
TrainTestFeedableDataIterator¶
-
class
texar.tf.data.
TrainTestFeedableDataIterator
(train=None, val=None, test=None)[source]¶ Feedable data iterator that alternatives between train, val, and test datasets.
This is a wrapper of
FeedableDataIterator
. The iterator can switch to a dataset and resume from where it was left off when it was visited last time.train
,val
, andtest
can be instance of either tf.data.Dataset or subclass ofDataBase
. At least one of them must be provided.Parameters: - train (optional) – Training data.
- val (optional) – Validation data.
- test (optional) – Test data.
Example
train_data = MonoTextData(hparams={'num_epochs': 200, ...}) test_data = MonoTextData(hparams_test) iterator = TrainTestFeedableDataIterator(train=train_data, test=test_data) batch = iterator.get_next() sess = tf.Session() def _eval_epoch(): # Iterate through test data for one epoch # Initialize and start from beginning of test data iterator.initialize_test_dataset(sess) while True: try: fetch_dict = { # Read from test data iterator.handle: Iterator.get_test_handle(sess) } test_batch_ = sess.run(batch, feed_dict=feed_dict) except tf.errors.OutOfRangeError: print("End of test epoch.") # Initialize and start from beginning of training data iterator.initialize_train_dataset(sess) step = 0 while True: try: fetch_dict = { # Read from training data iterator.handle: Iterator.get_train_handle(sess) } train_batch_ = sess.run(batch, fetch_dict=fetch_dict) step +=1 if step % 200 == 0: # Evaluate periodically _eval_epoch() except tf.errors.OutOfRangeError: print("End of training.")
-
get_train_handle
(sess)[source]¶ Returns the handle of the training dataset. The handle can be used to feed the
handle
placeholder to fetch training data.Parameters: sess – The current tf session. Returns: A string handle to be fed to the handle
placeholder.Example
next_element = iterator.get_next() train_handle = iterator.get_train_handle(sess) # Gets the next training element ne_ = sess.run(next_element, feed_dict={iterator.handle: train_handle})
-
get_val_handle
(sess)[source]¶ Returns the handle of the validation dataset. The handle can be used to feed the
handle
placeholder to fetch validation data.Parameters: sess – The current tf session. Returns: A string handle to be fed to the handle
placeholder.
-
get_test_handle
(sess)[source]¶ Returns the handle of the test dataset. The handle can be used to feed the
handle
placeholder to fetch test data.Parameters: sess – The current tf session. Returns: A string handle to be fed to the handle
placeholder.
-
restart_train_dataset
(sess)[source]¶ Restarts the training dataset so that next iteration will fetch data from the beginning of the training dataset.
Parameters: sess – The current tf session.
Data Utils¶
random_shard_dataset¶
maybe_tuple¶
make_partial¶
maybe_download¶
read_words¶
make_vocab¶
-
texar.tf.data.
make_vocab
(filenames, max_vocab_size=-1, newline_token=None, return_type='list', return_count=False)[source]¶ Builds vocab of the files.
Parameters: - filenames (str) – A (list of) files.
- max_vocab_size (int) – Maximum size of the vocabulary. Low frequency words that exceeding the limit will be discarded. Set to -1 (default) if no truncation is wanted.
- newline_token (str, optional) – The token to replace the original newline token “\n”. For example, newline_token=tx.data.SpecialTokens.EOS. If None, no replacement is performed.
- return_type (str) – Either “list” or “dict”. If “list” (default), this function returns a list of words sorted by frequency. If “dict”, this function returns a dict mapping words to their index sorted by frequency.
- return_count (bool) – Whether to return word counts. If True and
return_type
is “dict”, then a count dict is returned, which is a mapping from words to their frequency.
Returns: - If
return_count
is False, returns a list or dict containing the vocabulary words. - If
return_count
if True, returns a pair of list or dict (a, b), where a is a list or dict containing the vocabulary words, b is a list of dict containing the word counts.
count_file_lines¶
make_chained_transformation¶
-
texar.tf.data.
make_chained_transformation
(tran_fns, *args, **kwargs)[source]¶ Returns a dataset transformation function that applies a list of transformations sequentially.
Parameters: - tran_fns (list) – A list of dataset transformation function.
- *args – Extra arguments for each of the transformation function.
- **kwargs – Extra keyword arguments for each of the transformation function.
Returns: A transformation function to be used in tf.data.Dataset.map.
make_combined_transformation¶
-
texar.tf.data.
make_combined_transformation
(tran_fns, name_prefix=None, *args, **kwargs)[source]¶ Returns a dataset transformation function that applies transformations to each component of the data.
The data to be transformed must be a tuple of the same length of
tran_fns
.Parameters: - tran_fns (list) – A list of elements where each element is a transformation function or a list of transformation functions.
- name_prefix (list, optional) – Prefix to the field names of each
component of the data, to prevent fields with the same name
in different components from overriding each other. If not None,
must be of the same length of
tran_fns
. - *args – Extra arguments for each of the transformation function.
- **kwargs – Extra keyword arguments for each of the transformation function.
Returns: A transformation function to be used in tf.data.Dataset.map.