API¶
Simple Imputer¶
DataWig SimpleImputer: Uses some simple default encoders and featurizers that usually yield decent imputation quality
-
class
datawig.simple_imputer.
SimpleImputer
(input_columns: List[str], output_column: str, output_path: str = '', num_hash_buckets: int = 32768, num_labels: int = 100, tokens: str = 'chars', numeric_latent_dim: int = 100, numeric_hidden_layers: int = 1, is_explainable: bool = False)[source]¶ SimpleImputer model based on n-grams of concatenated strings of input columns and concatenated numerical features, if provided.
Given a data frame with string columns, a model is trained to predict observed values in label column using values observed in other columns.
The model can then be used to impute missing values.
Parameters: - input_columns – list of input column names (as strings)
- output_column – output column name (as string)
- output_path – path to store model and metrics
- num_hash_buckets – number of hash buckets used for the n-gram hashing vectorizer, only used for non-numerical input columns, ignored otherwise
- num_labels – number of imputable values considered after, only used for non-numerical input columns, ignored otherwise
- tokens – string, ‘chars’ or ‘words’ (default ‘chars’), determines tokenization strategy for n-grams, only used for non-numerical input columns, ignored otherwise
- numeric_latent_dim – int, number of latent dimensions for hidden layer of NumericalFeaturizers; only used for numerical input columns, ignored otherwise
- numeric_hidden_layers – number of numeric hidden layers
- is_explainable – if this is True, a stateful tf-idf encoder is used that allows explaining classes and single instances
Example usage:
from datawig.simple_imputer import SimpleImputer import pandas as pd
fn_train = os.path.join(datawig_test_path, “resources”, “shoes”, “train.csv.gz”) fn_test = os.path.join(datawig_test_path, “resources”, “shoes”, “test.csv.gz”)
df_train = pd.read_csv(training_data_files) df_test = pd.read_csv(testing_data_files)
output_path = “imputer_model”
# set up imputer model imputer = SimpleImputer( input_columns=[‘item_name’, ‘bullet_point’], output_column=’brand’)
# train the imputer model imputer = imputer.fit(df_train)
# obtain imputations imputations = imputer.predict(df_test)
-
check_data_types
(data_frame: pandas.core.frame.DataFrame) → None[source]¶ Checks whether a column contains string or numeric data
Parameters: data_frame – Returns:
-
check_for_label_shift
(target_data: pandas.core.frame.DataFrame) → dict[source]¶ Detect label shift in the validation data
Parameters: - test_data – data frame that contains labels
- target_data – unlabelled data for which predictions are to be generated
Returns: dictionary with labels as keys and weights as values.
-
static
complete
(data_frame: pandas.core.frame.DataFrame, precision_threshold: float = 0.0, inplace: bool = False, hpo: bool = False, verbose: int = 0, num_epochs: int = 100, iterations: int = 1, output_path: str = '.')[source]¶ Given a dataframe with missing values, this function detects all imputable columns, trains an imputation model on all other columns and imputes values for each missing value. Several imputation iterators can be run. Imputable columns are either numeric columns or non-numeric categorical columns; for determining whether a
column is categorical (as opposed to a plain text column) we use the following heuristic: a non-numeric categorical column should have least 10 times as many rows as there were unique values- If an imputation model did not reach the precision specified in the precision_threshold parameter for a given
- imputation value, that value will not be imputed; thus depending on the precision_threshold, the returned dataframe can still contain some missing values.
For numeric columns, we do not filter for accuracy. :param data_frame: original dataframe :param precision_threshold: precision threshold for categorical imputations (default: 0.0) :param inplace: whether or not to perform imputations inplace (default: False) :param hpo: whether or not to perform hyperparameter optimization (default: False) :param verbose: verbosity level, values > 0 log to stdout (default: 0) :param num_epochs: number of epochs for each imputation model training (default: 100) :param iterations: number of iterations for iterative imputation (default: 1) :param output_path: path to store model and metrics :return: dataframe with imputations
-
explain
(label: str, k: int = 10, label_column: str = None) → dict[source]¶ Return dictionary with a list of tuples for each explainable input column. Each tuple denotes one of the top k features with highest correlation to the label.
Parameters: - label – label value to explain
- k – number of explanations for each input encoder to return. If not given, return top 10 explanations.
- label_column – name of label column to be explained (optional, defaults to the first available column.)
-
explain_instance
(instance: pandas.core.series.Series, k: int = 10, label_column: str = None, label: str = None) → dict[source]¶ Return dictionary with list of tuples for each explainable input column of the given instance. Each entry shows the most highly correlated features to the given label (or the top predicted label of not provided).
Parameters: - instance – row of data frame (or dictionary)
- k – number of explanations (ngrams) for text inputs
- label_column – name of label column to be explained (optional)
- label – explain why instance is classified as label, otherwise explain top-label per input
-
fit
(train_df: pandas.core.frame.DataFrame, test_df: pandas.core.frame.DataFrame = None, ctx: <module 'mxnet.context' from '/home/docs/checkouts/readthedocs.org/user_builds/datawig/envs/latest/lib/python3.7/site-packages/mxnet/context.py'> = [cpu(0)], learning_rate: float = 0.004, num_epochs: int = 100, patience: int = 5, test_split: float = 0.1, weight_decay: float = 0.0, batch_size: int = 16, final_fc_hidden_units: List[int] = None, calibrate: bool = True, class_weights: dict = None, instance_weights: list = None) → Any[source]¶ Trains and stores imputer model
Parameters: - train_df – training data as dataframe
- test_df – test data as dataframe; if not provided, a ratio of test_split of the training data are used as test data
- ctx – List of mxnet contexts (if no gpu’s available, defaults to [mx.cpu()]) User can also pass in a list gpus to be used, ex. [mx.gpu(0), mx.gpu(2), mx.gpu(4)]
- learning_rate – learning rate for stochastic gradient descent (default 4e-4)
- num_epochs – maximal number of training epochs (default 10)
- patience – used for early stopping; after [patience] epochs with no improvement, training is stopped. (default 3)
- test_split – if no test_df is provided this is the ratio of test data to be held separate for determining model convergence
- weight_decay – regularizer (default 0)
:param batch_size (default 16) :param final_fc_hidden_units: list dimensions for FC layers after the final concatenation :param calibrate: Control automatic model calibration :param class_weights: Dictionary with labels as keys and weights as values.
Weighs each instance’s contribution to the likelihood based on the corresponding class.Parameters: instance_weights – List of weights for each instance in train_df.
-
fit_hpo
(train_df: pandas.core.frame.DataFrame, test_df: pandas.core.frame.DataFrame = None, hps: dict = None, num_evals: int = 10, max_running_hours: float = 96.0, hpo_run_name: str = None, user_defined_scores: list = None, num_epochs: int = None, patience: int = None, test_split: float = 0.2, weight_decay: List[float] = None, batch_size: int = 16, num_hash_bucket_candidates: List[float] = [4096, 32768, 262144], tokens_candidates: List[str] = ['words', 'chars'], numeric_latent_dim_candidates: List[int] = None, numeric_hidden_layers_candidates: List[int] = None, final_fc_hidden_units: List[List[int]] = None, learning_rate_candidates: List[float] = None, normalize_numeric: bool = True, hpo_max_train_samples: int = None, ctx: <module 'mxnet.context' from '/home/docs/checkouts/readthedocs.org/user_builds/datawig/envs/latest/lib/python3.7/site-packages/mxnet/context.py'> = [cpu(0)]) → Any[source]¶ Fits an imputer model with hyperparameter optimization. The parameter ranges are searched randomly.
Grids are specified using the *_candidates arguments (old) or with more flexibility via the dictionary hps.
Parameters: - train_df – training data as dataframe
- test_df – test data as dataframe; if not provided, a ratio of test_split of the training data are used as test data
- hps – nested dictionary where hps[global][parameter_name] is list of parameters. Similarly, hps[column_name][parameter_name] is a list of parameter values for each input column. Further, hps[column_name][‘type’] is in [‘numeric’, ‘categorical’, ‘string’] and is inferred if not provided.
- num_evals – number of evaluations for random search
- max_running_hours – Time before the hpo run is terminated in hours.
- hpo_run_name – string to identify the current hpo run.
- user_defined_scores – list with entries (Callable, str), where callable is a function accepting **kwargs true, predicted, confidence. Allows custom scoring functions.
Below are parameters of the old implementation, kept to ascertain backwards compatibility. :param num_epochs: maximal number of training epochs (default 10) :param patience: used for early stopping; after [patience] epochs with no improvement,
training is stopped. (default 3)Parameters: - test_split – if no test_df is provided this is the ratio of test data to be held separate for determining model convergence
- weight_decay – regularizer (default 0)
:param batch_size (default 16) :param num_hash_bucket_candidates: candidates for gridsearch hyperparameter
optimization (default [2**10, 2**13, 2**15, 2**18, 2**20])Parameters: - tokens_candidates – candidates for tokenization (default [‘words’, ‘chars’])
- numeric_latent_dim_candidates – candidates for latent dimensionality of numerical features (default [10, 50, 100])
- numeric_hidden_layers_candidates – candidates for number of hidden layers of
- final_fc_hidden_units – list of lists w/ dimensions for FC layers after the final concatenation (NOTE: for HPO, this expects a list of lists)
- learning_rate_candidates – learning rate for stochastic gradient descent (default 4e-4) numerical features (default [0, 1, 2])
- learning_rate_candidates – candidates for learning rate (default [1e-1, 1e-2, 1e-3])
- normalize_numeric – boolean indicating whether or not to normalize numeric values
- hpo_max_train_samples – training set size for hyperparameter optimization. use is deprecated.
- ctx – List of mxnet contexts (if no gpu’s available, defaults to [mx.cpu()]) User can also pass in a list gpus to be used, ex. [mx.gpu(0), mx.gpu(2), mx.gpu(4)] This parameter is deprecated.
Returns: pd.DataFrame with with hyper-parameter configurations and results
-
static
load
(output_path: str) → Any[source]¶ Loads model from output path
Parameters: output_path – output_path field of trained SimpleImputer model Returns: SimpleImputer model
-
load_hpo_model
(hpo_name: int = None)[source]¶ Load model after hyperparameter optimisation has ran. Overwrites local artifacts of self.imputer.
Parameters: hpo_name – Index of the model to be loaded. Default, load model with highest weighted precision or mean squared error. Returns: imputer object
-
load_metrics
() → Dict[str, Any][source]¶ Loads various metrics of the internal imputer model, returned as dictionary :return: Dict[str,Any]
-
predict
(data_frame: pandas.core.frame.DataFrame, precision_threshold: float = 0.0, imputation_suffix: str = '_imputed', score_suffix: str = '_imputed_proba', inplace: bool = False)[source]¶ - Imputes most likely value if it is above a certain precision threshold determined on the
- validation set
Precision is calculated as part of the datawig.evaluate_and_persist_metrics function.
Returns original dataframe with imputations and respective likelihoods as estimated by imputation model; in additional columns; names of imputation columns are that of the label suffixed with imputation_suffix, names of respective likelihood columns are suffixed with score_suffix
Parameters: - data_frame – data frame (pandas)
- precision_threshold – double between 0 and 1 indicating precision threshold
- imputation_suffix – suffix for imputation columns
- score_suffix – suffix for imputation score columns
- inplace – add column with imputed values and column with confidence scores to data_frame, returns the modified object (True). Create copy of data_frame with additional columns, leave input unmodified (False).
Returns: data_frame original dataframe with imputations and likelihood in additional column
Imputer¶
DataWig Imputer: Imputes missing values in tables
-
class
datawig.imputer.
Imputer
(data_encoders: List[datawig.column_encoders.ColumnEncoder], data_featurizers: List[datawig.mxnet_input_symbols.Featurizer], label_encoders: List[datawig.column_encoders.ColumnEncoder], output_path='')[source]¶ Imputer model based on deep learning trained with MxNet
Given a data frame with string columns, a model is trained to predict observed values in one or more column using values observed in other columns. The model can then be used to impute missing values.
Parameters: - data_encoders – list of datawig.mxnet_input_symbol.ColumnEncoders, output_column name must match field_name of data_featurizers
- data_featurizers – list of Featurizer;
- label_encoders – list of CategoricalEncoder or NumericalEncoder
- output_path – path to store model and metrics
-
calibrate
(test_iter: datawig.iterators.ImputerIterDf)[source]¶ Cecks model calibration and fits temperature scaling. If the fit improves model calibration, the temperature parameter is assigned as property to self and used for all further predictions in self.predict_mxnet_iter(). Saves calibration information to dictionary.
Parameters: test_iter – iterator, see ImputerIter in iterators.py Returns: None
-
explain
(label: str, k: int = 10, label_column: str = None) → dict[source]¶ Return dictionary with a list of tuples for each explainable input column. Each tuple denotes one of the top k features with highest correlation to the label.
Parameters: - label – label value to explain
- k – number of explanations for each input encoder to return. If not given, return top 10 explanations.
- label_column – name of label column to be explained (optional, defaults to the first available column.)
-
explain_instance
(instance: pandas.core.series.Series, k: int = 10, label_column: str = None, label: str = None) → dict[source]¶ Return dictionary with list of tuples for each explainable input column of the given instance. Each entry shows the most highly correlated features to the given label (or the top predicted label of not provided).
Parameters: - instance – row of data frame (or dictionary)
- k – number of explanations (ngrams) for text inputs
- label_column – name of label column to be explained (optional)
- label – explain why instance is classified as label, otherwise explain top-label per input
-
fit
(train_df: pandas.core.frame.DataFrame, test_df: pandas.core.frame.DataFrame = None, ctx: <module 'mxnet.context' from '/home/docs/checkouts/readthedocs.org/user_builds/datawig/envs/latest/lib/python3.7/site-packages/mxnet/context.py'> = [cpu(0)], learning_rate: float = 0.001, num_epochs: int = 100, patience: int = 3, test_split: float = 0.1, weight_decay: float = 0.0, batch_size: int = 16, final_fc_hidden_units: List[int] = None, calibrate: bool = True)[source]¶ Trains and stores imputer model
Parameters: - train_df – training data as dataframe
- test_df – test data as dataframe; if not provided, [test_split] % of the training data are used as test data
- ctx – List of mxnet contexts (if no gpu’s available, defaults to [mx.cpu()]) User can also pass in a list gpus to be used, ex. [mx.gpu(0), mx.gpu(2), mx.gpu(4)]
- learning_rate – learning rate for stochastic gradient descent (default 1e-4)
- num_epochs – maximal number of training epochs (default 100)
- patience – used for early stopping; after [patience] epochs with no improvement, training is stopped. (default 3)
- test_split – if no test_df is provided this is the ratio of test data to be held separate for determining model convergence
- weight_decay – regularizer (default 0)
- batch_size – default 16
- final_fc_hidden_units – list of dimensions for the final fully connected layer.
- calibrate – whether to calibrate predictions
Returns: trained imputer model
-
static
load
(output_path: str) → Any[source]¶ Loads model from output path
Parameters: output_path – output_path field of trained Imputer model Returns: imputer model
-
predict
(data_frame: pandas.core.frame.DataFrame, precision_threshold: float = 0.0, imputation_suffix: str = '_imputed', score_suffix: str = '_imputed_proba', inplace: bool = False) → pandas.core.frame.DataFrame[source]¶ Computes imputations for numerical or categorical values
For categorical imputations, most likely values are imputed if values are above a certain precision threshold computed on the validation set Precision is calculated as part of the datawig.evaluate_and_persist_metrics function.
For numerical imputations, no thresholding is applied.
Returns original dataframe with imputations and respective likelihoods as estimated by imputation model in additional columns; names of imputation columns are that of the label suffixed with imputation_suffix, names of respective likelihood columns are suffixed with score_suffix
Parameters: - data_frame – pandas data_frame
- precision_threshold – double between 0 and 1 indicating precision threshold for each imputation
- imputation_suffix – suffix for imputation columns
- score_suffix – suffix for imputation score columns
- inplace – add column with imputed values and column with confidence scores to data_frame, returns the modified object (True). Create copy of data_frame with additional columns, leave input unmodified (False).
Returns: dataframe with imputations and their likelihoods in additional columns
-
predict_above_precision
(data_frame: pandas.core.frame.DataFrame, precision_threshold=0.95) → dict[source]¶ Returns the probabilities for each class, filtering out predictions below the precision threshold.
Parameters: - data_frame – data frame
- precision_threshold – don’t predict if predicted class probability is below this precision threshold
Returns: dict of {‘column_name’: array}, array is a numpy array of shape samples-by-labels
-
predict_proba
(data_frame: pandas.core.frame.DataFrame) → dict[source]¶ Returns the probabilities for each class :param data_frame: data frame :return: dict of {‘column_name’: array}, array is a numpy array of shape samples-by-labels
-
predict_proba_top_k
(data_frame: pandas.core.frame.DataFrame, top_k: int = 5) → dict[source]¶ Returns tuples of (label, probability) for the top_k most likely predicted classes
Parameters: - data_frame – pandas data frame
- top_k – number of most likely predictions to return
Returns: dict of {‘column_name’: list} where list is a list of (label, probability) tuples
Column Encoders¶
Column Encoders: used for translating values of a table into numerical representation such that Featurizers can operate on them
-
class
datawig.column_encoders.
BowEncoder
(input_columns: Any, output_column: str = None, max_tokens: int = 262144, tokens: str = 'chars', ngram_range: tuple = None, prefixed_concatenation: bool = True)[source]¶ Bag-of-Words encoder for text data, using sklearn’s HashingVectorizer
Parameters: - input_columns – List[str] with column names to be used as input for this ColumnEncoder
- output_column – Name of output field, used as field name in downstream MxNet iterator
- max_tokens – Number of hash buckets (dimensionality of sparse ngram vector). default 2**18
- tokens – How to tokenize the input data, supports ‘words’ and ‘chars’.
- ngram_range – length of ngrams to use as features
- prefixed_concatenation – whether or not to prefix values with column name before concat
-
decode
(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶ Raises NotImplementedError, hashed bag-of-words cannot be decoded due to hash collisions
Parameters: token_index_sequence – Returns:
-
fit
(data_frame: pandas.core.frame.DataFrame)[source]¶ Does nothing, HashingVectorizers do not need to be fit.
Parameters: data_frame – Returns:
-
is_fitted
() → bool[source]¶ Returns true if the column encoder does not require fitting (anymore or at all)
Parameters: self – Returns: True if the encoder is fitted
-
transform
(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶ Transforms one or more string columns into Bag-of-words vectors, hashed into a max_features dimensional feature space. Nans and missing values will be replaced by zero vectors.
Parameters: data_frame – pandas DataFrame with text columns Returns: numpy array (rows by max_features)
-
class
datawig.column_encoders.
CategoricalEncoder
(input_columns: Any, output_column: str = None, token_to_idx: Dict[str, int] = None, max_tokens: int = 10000)[source]¶ Transforms categorical variable from string representation into number
Parameters: - input_columns – List[str] with column names to be used as input for this ColumnEncoder
- output_column – Name of output field, used as field name in downstream MxNet iterator
- token_to_idx – token to index mapping, 0 is reserved for missing tokens, 1 … max_tokens for most to least frequent tokens
- max_tokens – maximum number of tokens
-
decode
(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶ Decodes a pandas Series of token indices
Parameters: col – pandas Series of token indices Returns: pandas Series of tokens
-
decode_token
(token_idx: int) → str[source]¶ Decodes a token index into a token
Parameters: token_idx – token index Returns: token
-
fit
(data_frame: pandas.core.frame.DataFrame)[source]¶ Fits a CategoricalEncoder by extracting the value histogram of a column and capping it at max_tokens. Issues warning if less than 100 values were observed.
Parameters: data_frame – pandas data frame
-
is_fitted
()[source]¶ Checks if ColumnEncoder (still) needs to be fitted to data
Returns: True if the column encoder does not require fitting (anymore or at all)
-
transform
(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶ Transforms string column of pandas dataframe into categoricals
Parameters: data_frame – pandas data frame Returns: numpy array (rows by 1)
-
static
transform_func_categorical
(col: pandas.core.series.Series, token_to_idx: Dict[str, int], missing_token_idx: int) → Any[source]¶ Transforms categorical values into their indices
Parameters: - col – pandas Series with categorical values
- token_to_idx – Dict[str, int] with mapping from token to token index
- missing_token_idx – index for missing symbol
Returns:
-
class
datawig.column_encoders.
ColumnEncoder
(input_columns: List[str], output_column=None, output_dim=1)[source]¶ Abstract super class of column encoders. Transforms value representation of columns (e.g. strings) into numerical representations to be fed into MxNet.
Options for ColumnEncoders are:
SequentialEncoder: for sequences of symbols (e.g. characters or words), BowEncoder: bag-of-word representation, as sparse vectors CategoricalEncoder: for categorical variables NumericalEncoder: for numerical valuesParameters: - input_columns – List[str] with column names to be used as input for this ColumnEncoder
- output_column – Name of output field, used as field name in downstream MxNet iterator
- output_dim – dimensionality of encoded column values (1 for categorical, vocabulary size for sequential and BoW)
-
decode
(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶ Decodes a pandas Series of token indices
Parameters: col – pandas Series of token indices Returns: pandas Series of tokens
-
fit
(data_frame: pandas.core.frame.DataFrame)[source]¶ Fits a ColumnEncoder if needed (i.e. vocabulary/alphabet)
Parameters: data_frame – pandas DataFrame Returns:
-
exception
datawig.column_encoders.
NotFittedError
[source]¶ Error thrown when unfitted encoder is used
-
class
datawig.column_encoders.
NumericalEncoder
(input_columns: Any, output_column: str = None, normalize=True)[source]¶ Numerical encoder, concatenates columns in field_names into one vector fills nans with the mean of a column
Parameters: - input_columns – List[str] with column names to be used as input for this ColumnEncoder
- output_column – Name of output field, used as field name in downstream MxNet iterator
- normalize – whether to normalize by the standard deviation or not, default True
-
decode
(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶ Undoes the normalization, scales by scale and adds the mean
Parameters: col – pandas Series (normalized) Returns: pandas Series (unnormalized)
-
fit
(data_frame: pandas.core.frame.DataFrame)[source]¶ Does nothing or fits the normalizer, if normalization is specified
Parameters: data_frame – DataFrame with numerical columns specified when instantiating NumericalEncoder
-
is_fitted
()[source]¶ Returns true if the column encoder does not require fitting (anymore or at all)
Parameters: self – Returns: True if the encoder is fitted
-
transform
(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶ Concatenates the numerical columns specified when instantiating the NumericalEncoder Normalizes features if specified in the NumericalEncoder
Parameters: data_frame – DataFrame with numerical columns specified in NumericalEncoder Returns: np.array with numerical features (rows by number of numerical columns)
-
class
datawig.column_encoders.
SequentialEncoder
(input_columns: Any, output_column: str = None, token_to_idx: Dict[str, int] = None, max_tokens: int = 1000, seq_len: int = 500)[source]¶ Transforms sequence of characters into sequence of numbers
Parameters: - input_columns – List[str] with column names to be used as input for this ColumnEncoder
- output_column – Name of output field, used as field name in downstream MxNet iterator
- token_to_idx – token to index mapping 0 is reserved for missing tokens, 1 … max_tokens-1 for most to least frequent tokens
- max_tokens – maximum number of tokens
- seq_len – length of sequence, shorter sequences get padded to, longer sequences truncated at seq_len symbols
-
decode
(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶ Decodes a pandas Series of token indices
Parameters: col – pandas Series of token index iterables Returns: pd.Series of strings
-
decode_seq
(token_index_sequence: Iterable[int]) → str[source]¶ Decodes a sequence of token indices into a string
Parameters: token_index_sequence – an iterable of token indices Returns: str the decoded string
-
fit
(data_frame: pandas.core.frame.DataFrame)[source]¶ Fits a SequentialEncoder by extracting the character value histogram of a column and capping it at max_tokens
Parameters: data_frame – pandas data frame
-
is_fitted
() → bool[source]¶ Checks if ColumnEncoder (still) needs to be fitted to data
Returns: True if the column encoder does not require fitting (anymore or at all)
-
transform
(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶ Transforms column of pandas dataframe into sequence of tokens
Parameters: data_frame – pandas DataFrame Returns: numpy array (rows by seq_len)
-
static
transform_func_seq_single
(string: str, token_to_idx: Dict[str, int], seq_len: int, missing_token_idx: int) → List[int][source]¶ Transforms a single string into a sequence of token ids
Parameters: - string – a sequence of symbols as string
- token_to_idx – Dict[str, int] with mapping from token to token index
- seq_len – length of sequence
- missing_token_idx – index for missing symbol
Returns: List[int] with transformed values
-
class
datawig.column_encoders.
TfIdfEncoder
(input_columns: Any, output_column: str = None, max_tokens: int = 262144, tokens: str = 'chars', ngram_range: tuple = None, prefixed_concatenation: bool = True)[source]¶ TfIdf bag of word encoder for text data, using sklearn’s TfidfVectorizer
Parameters: - input_columns – List[str] with column names to be used as input for this ColumnEncoder
- output_column – Name of output field, used as field name in downstream MxNet iterator
- max_tokens – Number of feature buckets (dimensionality of sparse ngram vector). default 2**18
- tokens – How to tokenize the input data, supports ‘words’ and ‘chars’.
- ngram_range – length of ngrams to use as features
- prefixed_concatenation – whether or not to prefix values with column name before concat