API¶

Simple Imputer¶

DataWig SimpleImputer: Uses some simple default encoders and featurizers that usually yield decent imputation quality

class datawig.simple_imputer.SimpleImputer(input_columns: List[str], output_column: str, output_path: str = '', num_hash_buckets: int = 32768, num_labels: int = 100, tokens: str = 'chars', numeric_latent_dim: int = 100, numeric_hidden_layers: int = 1, is_explainable: bool = False)[source]¶

SimpleImputer model based on n-grams of concatenated strings of input columns and concatenated numerical features, if provided.

Given a data frame with string columns, a model is trained to predict observed values in label column using values observed in other columns.

The model can then be used to impute missing values.

Parameters:

input_columns – list of input column names (as strings)
output_column – output column name (as string)
output_path – path to store model and metrics
num_hash_buckets – number of hash buckets used for the n-gram hashing vectorizer, only used for non-numerical input columns, ignored otherwise
num_labels – number of imputable values considered after, only used for non-numerical input columns, ignored otherwise
tokens – string, ‘chars’ or ‘words’ (default ‘chars’), determines tokenization strategy for n-grams, only used for non-numerical input columns, ignored otherwise
numeric_latent_dim – int, number of latent dimensions for hidden layer of NumericalFeaturizers; only used for numerical input columns, ignored otherwise
numeric_hidden_layers – number of numeric hidden layers
is_explainable – if this is True, a stateful tf-idf encoder is used that allows explaining classes and single instances

Example usage:

from datawig.simple_imputer import SimpleImputer import pandas as pd

fn_train = os.path.join(datawig_test_path, “resources”, “shoes”, “train.csv.gz”) fn_test = os.path.join(datawig_test_path, “resources”, “shoes”, “test.csv.gz”)

df_train = pd.read_csv(training_data_files) df_test = pd.read_csv(testing_data_files)

output_path = “imputer_model”

# set up imputer model imputer = SimpleImputer( input_columns=[‘item_name’, ‘bullet_point’], output_column=’brand’)

# train the imputer model imputer = imputer.fit(df_train)

# obtain imputations imputations = imputer.predict(df_test)

check_data_types(data_frame: pandas.core.frame.DataFrame) → None[source]¶

Checks whether a column contains string or numeric data

Parameters:	data_frame –
Returns:

check_for_label_shift(target_data: pandas.core.frame.DataFrame) → dict[source]¶

Detect label shift in the validation data

Parameters:	test_data – data frame that contains labels target_data – unlabelled data for which predictions are to be generated
Returns:	dictionary with labels as keys and weights as values.

static complete(data_frame: pandas.core.frame.DataFrame, precision_threshold: float = 0.0, inplace: bool = False, hpo: bool = False, verbose: int = 0, num_epochs: int = 100, iterations: int = 1, output_path: str = '.')[source]¶

Given a dataframe with missing values, this function detects all imputable columns, trains an imputation model on all other columns and imputes values for each missing value. Several imputation iterators can be run. Imputable columns are either numeric columns or non-numeric categorical columns; for determining whether a

column is categorical (as opposed to a plain text column) we use the following heuristic: a non-numeric categorical column should have least 10 times as many rows as there were unique values

If an imputation model did not reach the precision specified in the precision_threshold parameter for a given: imputation value, that value will not be imputed; thus depending on the precision_threshold, the returned dataframe can still contain some missing values.

For numeric columns, we do not filter for accuracy. :param data_frame: original dataframe :param precision_threshold: precision threshold for categorical imputations (default: 0.0) :param inplace: whether or not to perform imputations inplace (default: False) :param hpo: whether or not to perform hyperparameter optimization (default: False) :param verbose: verbosity level, values > 0 log to stdout (default: 0) :param num_epochs: number of epochs for each imputation model training (default: 100) :param iterations: number of iterations for iterative imputation (default: 1) :param output_path: path to store model and metrics :return: dataframe with imputations

explain(label: str, k: int = 10, label_column: str = None) → dict[source]¶

Return dictionary with a list of tuples for each explainable input column. Each tuple denotes one of the top k features with highest correlation to the label.

Parameters:	label – label value to explain k – number of explanations for each input encoder to return. If not given, return top 10 explanations. label_column – name of label column to be explained (optional, defaults to the first available column.)

explain_instance(instance: pandas.core.series.Series, k: int = 10, label_column: str = None, label: str = None) → dict[source]¶

Return dictionary with list of tuples for each explainable input column of the given instance. Each entry shows the most highly correlated features to the given label (or the top predicted label of not provided).

Parameters:	instance – row of data frame (or dictionary) k – number of explanations (ngrams) for text inputs label_column – name of label column to be explained (optional) label – explain why instance is classified as label, otherwise explain top-label per input

fit(train_df: pandas.core.frame.DataFrame, test_df: pandas.core.frame.DataFrame = None, ctx: <module 'mxnet.context' from '/home/docs/checkouts/readthedocs.org/user_builds/datawig/envs/latest/lib/python3.7/site-packages/mxnet/context.py'> = [cpu(0)], learning_rate: float = 0.004, num_epochs: int = 100, patience: int = 5, test_split: float = 0.1, weight_decay: float = 0.0, batch_size: int = 16, final_fc_hidden_units: List[int] = None, calibrate: bool = True, class_weights: dict = None, instance_weights: list = None) → Any[source]¶

Trains and stores imputer model

Parameters:

train_df – training data as dataframe
test_df – test data as dataframe; if not provided, a ratio of test_split of the training data are used as test data
ctx – List of mxnet contexts (if no gpu’s available, defaults to [mx.cpu()]) User can also pass in a list gpus to be used, ex. [mx.gpu(0), mx.gpu(2), mx.gpu(4)]
learning_rate – learning rate for stochastic gradient descent (default 4e-4)
num_epochs – maximal number of training epochs (default 10)
patience – used for early stopping; after [patience] epochs with no improvement, training is stopped. (default 3)
test_split – if no test_df is provided this is the ratio of test data to be held separate for determining model convergence
weight_decay – regularizer (default 0)

:param batch_size (default 16) :param final_fc_hidden_units: list dimensions for FC layers after the final concatenation :param calibrate: Control automatic model calibration :param class_weights: Dictionary with labels as keys and weights as values.

Weighs each instance’s contribution to the likelihood based on the corresponding class.

Parameters:	instance_weights – List of weights for each instance in train_df.

fit_hpo(train_df: pandas.core.frame.DataFrame, test_df: pandas.core.frame.DataFrame = None, hps: dict = None, num_evals: int = 10, max_running_hours: float = 96.0, hpo_run_name: str = None, user_defined_scores: list = None, num_epochs: int = None, patience: int = None, test_split: float = 0.2, weight_decay: List[float] = None, batch_size: int = 16, num_hash_bucket_candidates: List[float] = [4096, 32768, 262144], tokens_candidates: List[str] = ['words', 'chars'], numeric_latent_dim_candidates: List[int] = None, numeric_hidden_layers_candidates: List[int] = None, final_fc_hidden_units: List[List[int]] = None, learning_rate_candidates: List[float] = None, normalize_numeric: bool = True, hpo_max_train_samples: int = None, ctx: <module 'mxnet.context' from '/home/docs/checkouts/readthedocs.org/user_builds/datawig/envs/latest/lib/python3.7/site-packages/mxnet/context.py'> = [cpu(0)]) → Any[source]¶

Fits an imputer model with hyperparameter optimization. The parameter ranges are searched randomly.

Grids are specified using the *_candidates arguments (old) or with more flexibility via the dictionary hps.

Parameters:

train_df – training data as dataframe
test_df – test data as dataframe; if not provided, a ratio of test_split of the training data are used as test data
hps – nested dictionary where hps[global][parameter_name] is list of parameters. Similarly, hps[column_name][parameter_name] is a list of parameter values for each input column. Further, hps[column_name][‘type’] is in [‘numeric’, ‘categorical’, ‘string’] and is inferred if not provided.
num_evals – number of evaluations for random search
max_running_hours – Time before the hpo run is terminated in hours.
hpo_run_name – string to identify the current hpo run.
user_defined_scores – list with entries (Callable, str), where callable is a function accepting **kwargs true, predicted, confidence. Allows custom scoring functions.

Below are parameters of the old implementation, kept to ascertain backwards compatibility. :param num_epochs: maximal number of training epochs (default 10) :param patience: used for early stopping; after [patience] epochs with no improvement,

training is stopped. (default 3)

Parameters:	test_split – if no test_df is provided this is the ratio of test data to be held separate for determining model convergence weight_decay – regularizer (default 0)

:param batch_size (default 16) :param num_hash_bucket_candidates: candidates for gridsearch hyperparameter

optimization (default [2**10, 2**13, 2**15, 2**18, 2**20])

Parameters:

tokens_candidates – candidates for tokenization (default [‘words’, ‘chars’])
numeric_latent_dim_candidates – candidates for latent dimensionality of numerical features (default [10, 50, 100])
numeric_hidden_layers_candidates – candidates for number of hidden layers of
final_fc_hidden_units – list of lists w/ dimensions for FC layers after the final concatenation (NOTE: for HPO, this expects a list of lists)
learning_rate_candidates – learning rate for stochastic gradient descent (default 4e-4) numerical features (default [0, 1, 2])
learning_rate_candidates – candidates for learning rate (default [1e-1, 1e-2, 1e-3])
normalize_numeric – boolean indicating whether or not to normalize numeric values
hpo_max_train_samples – training set size for hyperparameter optimization. use is deprecated.
ctx – List of mxnet contexts (if no gpu’s available, defaults to [mx.cpu()]) User can also pass in a list gpus to be used, ex. [mx.gpu(0), mx.gpu(2), mx.gpu(4)] This parameter is deprecated.

Returns:

pd.DataFrame with with hyper-parameter configurations and results

static load(output_path: str) → Any[source]¶

Loads model from output path

Parameters:	output_path – output_path field of trained SimpleImputer model
Returns:	SimpleImputer model

load_hpo_model(hpo_name: int = None)[source]¶

Load model after hyperparameter optimisation has ran. Overwrites local artifacts of self.imputer.

Parameters:	hpo_name – Index of the model to be loaded. Default, load model with highest weighted precision or mean squared error.
Returns:	imputer object

load_metrics() → Dict[str, Any][source]¶: Loads various metrics of the internal imputer model, returned as dictionary :return: Dict[str,Any]

predict(data_frame: pandas.core.frame.DataFrame, precision_threshold: float = 0.0, imputation_suffix: str = '_imputed', score_suffix: str = '_imputed_proba', inplace: bool = False)[source]¶

Imputes most likely value if it is above a certain precision threshold determined on the: validation set

Precision is calculated as part of the datawig.evaluate_and_persist_metrics function.

Returns original dataframe with imputations and respective likelihoods as estimated by imputation model; in additional columns; names of imputation columns are that of the label suffixed with imputation_suffix, names of respective likelihood columns are suffixed with score_suffix

Parameters:

data_frame – data frame (pandas)
precision_threshold – double between 0 and 1 indicating precision threshold
imputation_suffix – suffix for imputation columns
score_suffix – suffix for imputation score columns
inplace – add column with imputed values and column with confidence scores to data_frame, returns the modified object (True). Create copy of data_frame with additional columns, leave input unmodified (False).

Returns:

data_frame original dataframe with imputations and likelihood in additional column

save()[source]¶: Saves model to disk; mxnet module and imputer are stored separately

Imputer¶

DataWig Imputer: Imputes missing values in tables

class datawig.imputer.Imputer(data_encoders: List[datawig.column_encoders.ColumnEncoder], data_featurizers: List[datawig.mxnet_input_symbols.Featurizer], label_encoders: List[datawig.column_encoders.ColumnEncoder], output_path='')[source]¶

Imputer model based on deep learning trained with MxNet

Given a data frame with string columns, a model is trained to predict observed values in one or more column using values observed in other columns. The model can then be used to impute missing values.

Parameters:	data_encoders – list of datawig.mxnet_input_symbol.ColumnEncoders, output_column name must match field_name of data_featurizers data_featurizers – list of Featurizer; label_encoders – list of CategoricalEncoder or NumericalEncoder output_path – path to store model and metrics

calibrate(test_iter: datawig.iterators.ImputerIterDf)[source]¶

Cecks model calibration and fits temperature scaling. If the fit improves model calibration, the temperature parameter is assigned as property to self and used for all further predictions in self.predict_mxnet_iter(). Saves calibration information to dictionary.

Parameters:	test_iter – iterator, see ImputerIter in iterators.py
Returns:	None

explain(label: str, k: int = 10, label_column: str = None) → dict[source]¶

Return dictionary with a list of tuples for each explainable input column. Each tuple denotes one of the top k features with highest correlation to the label.

Parameters:	label – label value to explain k – number of explanations for each input encoder to return. If not given, return top 10 explanations. label_column – name of label column to be explained (optional, defaults to the first available column.)

explain_instance(instance: pandas.core.series.Series, k: int = 10, label_column: str = None, label: str = None) → dict[source]¶

Return dictionary with list of tuples for each explainable input column of the given instance. Each entry shows the most highly correlated features to the given label (or the top predicted label of not provided).

Parameters:	instance – row of data frame (or dictionary) k – number of explanations (ngrams) for text inputs label_column – name of label column to be explained (optional) label – explain why instance is classified as label, otherwise explain top-label per input

fit(train_df: pandas.core.frame.DataFrame, test_df: pandas.core.frame.DataFrame = None, ctx: <module 'mxnet.context' from '/home/docs/checkouts/readthedocs.org/user_builds/datawig/envs/latest/lib/python3.7/site-packages/mxnet/context.py'> = [cpu(0)], learning_rate: float = 0.001, num_epochs: int = 100, patience: int = 3, test_split: float = 0.1, weight_decay: float = 0.0, batch_size: int = 16, final_fc_hidden_units: List[int] = None, calibrate: bool = True)[source]¶

Trains and stores imputer model

Parameters:

train_df – training data as dataframe
test_df – test data as dataframe; if not provided, [test_split] % of the training data are used as test data
ctx – List of mxnet contexts (if no gpu’s available, defaults to [mx.cpu()]) User can also pass in a list gpus to be used, ex. [mx.gpu(0), mx.gpu(2), mx.gpu(4)]
learning_rate – learning rate for stochastic gradient descent (default 1e-4)
num_epochs – maximal number of training epochs (default 100)
patience – used for early stopping; after [patience] epochs with no improvement, training is stopped. (default 3)
test_split – if no test_df is provided this is the ratio of test data to be held separate for determining model convergence
weight_decay – regularizer (default 0)
batch_size – default 16
final_fc_hidden_units – list of dimensions for the final fully connected layer.
calibrate – whether to calibrate predictions

Returns:

trained imputer model

static load(output_path: str) → Any[source]¶

Loads model from output path

Parameters:	output_path – output_path field of trained Imputer model
Returns:	imputer model

predict(data_frame: pandas.core.frame.DataFrame, precision_threshold: float = 0.0, imputation_suffix: str = '_imputed', score_suffix: str = '_imputed_proba', inplace: bool = False) → pandas.core.frame.DataFrame[source]¶

Computes imputations for numerical or categorical values

For categorical imputations, most likely values are imputed if values are above a certain precision threshold computed on the validation set Precision is calculated as part of the datawig.evaluate_and_persist_metrics function.

For numerical imputations, no thresholding is applied.

Returns original dataframe with imputations and respective likelihoods as estimated by imputation model in additional columns; names of imputation columns are that of the label suffixed with imputation_suffix, names of respective likelihood columns are suffixed with score_suffix

Parameters:

data_frame – pandas data_frame
precision_threshold – double between 0 and 1 indicating precision threshold for each imputation
imputation_suffix – suffix for imputation columns
score_suffix – suffix for imputation score columns
inplace – add column with imputed values and column with confidence scores to data_frame, returns the modified object (True). Create copy of data_frame with additional columns, leave input unmodified (False).

Returns:

dataframe with imputations and their likelihoods in additional columns

predict_above_precision(data_frame: pandas.core.frame.DataFrame, precision_threshold=0.95) → dict[source]¶

Returns the probabilities for each class, filtering out predictions below the precision threshold.

Parameters:	data_frame – data frame precision_threshold – don’t predict if predicted class probability is below this precision threshold
Returns:	dict of {‘column_name’: array}, array is a numpy array of shape samples-by-labels

predict_proba(data_frame: pandas.core.frame.DataFrame) → dict[source]¶: Returns the probabilities for each class :param data_frame: data frame :return: dict of {‘column_name’: array}, array is a numpy array of shape samples-by-labels

predict_proba_top_k(data_frame: pandas.core.frame.DataFrame, top_k: int = 5) → dict[source]¶

Returns tuples of (label, probability) for the top_k most likely predicted classes

Parameters:	data_frame – pandas data frame top_k – number of most likely predictions to return
Returns:	dict of {‘column_name’: list} where list is a list of (label, probability) tuples

save()[source]¶: Saves model to disk, except mxnet module which is stored separately during fit

transform(data_frame: pandas.core.frame.DataFrame) → dict[source]¶: Imputes values given an mxnet iterator (see iterators) :param data_frame: pandas data frame (pandas) :return: dict of {‘column_name’: list} where list contains the string predictions

transform_and_compute_metrics(data_frame: pandas.core.frame.DataFrame, metrics_path=None) → dict[source]¶

Returns predictions and metrics (average and per class)

Parameters:	data_frame – data frame metrics_path – if not None and exists, metrics are serialized as json to this path.
Returns:

Column Encoders¶

Column Encoders: used for translating values of a table into numerical representation such that Featurizers can operate on them

class datawig.column_encoders.BowEncoder(input_columns: Any, output_column: str = None, max_tokens: int = 262144, tokens: str = 'chars', ngram_range: tuple = None, prefixed_concatenation: bool = True)[source]¶

Bag-of-Words encoder for text data, using sklearn’s HashingVectorizer

Parameters:

input_columns – List[str] with column names to be used as input for this ColumnEncoder
output_column – Name of output field, used as field name in downstream MxNet iterator
max_tokens – Number of hash buckets (dimensionality of sparse ngram vector). default 2**18
tokens – How to tokenize the input data, supports ‘words’ and ‘chars’.
ngram_range – length of ngrams to use as features
prefixed_concatenation – whether or not to prefix values with column name before concat

decode(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶

Raises NotImplementedError, hashed bag-of-words cannot be decoded due to hash collisions

Parameters:	token_index_sequence –
Returns:

fit(data_frame: pandas.core.frame.DataFrame)[source]¶

Does nothing, HashingVectorizers do not need to be fit.

Parameters:	data_frame –
Returns:

is_fitted() → bool[source]¶

Returns true if the column encoder does not require fitting (anymore or at all)

Parameters:	self –
Returns:	True if the encoder is fitted

transform(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶

Transforms one or more string columns into Bag-of-words vectors, hashed into a max_features dimensional feature space. Nans and missing values will be replaced by zero vectors.

Parameters:	data_frame – pandas DataFrame with text columns
Returns:	numpy array (rows by max_features)

class datawig.column_encoders.CategoricalEncoder(input_columns: Any, output_column: str = None, token_to_idx: Dict[str, int] = None, max_tokens: int = 10000)[source]¶

Transforms categorical variable from string representation into number

Parameters:	input_columns – List[str] with column names to be used as input for this ColumnEncoder output_column – Name of output field, used as field name in downstream MxNet iterator token_to_idx – token to index mapping, 0 is reserved for missing tokens, 1 … max_tokens for most to least frequent tokens max_tokens – maximum number of tokens

decode(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶

Decodes a pandas Series of token indices

Parameters:	col – pandas Series of token indices
Returns:	pandas Series of tokens

decode_token(token_idx: int) → str[source]¶

Decodes a token index into a token

Parameters:	token_idx – token index
Returns:	token

fit(data_frame: pandas.core.frame.DataFrame)[source]¶

Fits a CategoricalEncoder by extracting the value histogram of a column and capping it at max_tokens. Issues warning if less than 100 values were observed.

Parameters:	data_frame – pandas data frame

is_fitted()[source]¶

Checks if ColumnEncoder (still) needs to be fitted to data

Returns:	True if the column encoder does not require fitting (anymore or at all)

transform(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶

Transforms string column of pandas dataframe into categoricals

Parameters:	data_frame – pandas data frame
Returns:	numpy array (rows by 1)

static transform_func_categorical(col: pandas.core.series.Series, token_to_idx: Dict[str, int], missing_token_idx: int) → Any[source]¶

Transforms categorical values into their indices

Parameters:	col – pandas Series with categorical values token_to_idx – Dict[str, int] with mapping from token to token index missing_token_idx – index for missing symbol
Returns:

class datawig.column_encoders.ColumnEncoder(input_columns: List[str], output_column=None, output_dim=1)[source]¶

Abstract super class of column encoders. Transforms value representation of columns (e.g. strings) into numerical representations to be fed into MxNet.

Options for ColumnEncoders are:

SequentialEncoder: for sequences of symbols (e.g. characters or words), BowEncoder: bag-of-word representation, as sparse vectors CategoricalEncoder: for categorical variables NumericalEncoder: for numerical values

Parameters:	input_columns – List[str] with column names to be used as input for this ColumnEncoder output_column – Name of output field, used as field name in downstream MxNet iterator output_dim – dimensionality of encoded column values (1 for categorical, vocabulary size for sequential and BoW)

decode(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶

Decodes a pandas Series of token indices

Parameters:	col – pandas Series of token indices
Returns:	pandas Series of tokens

fit(data_frame: pandas.core.frame.DataFrame)[source]¶

Fits a ColumnEncoder if needed (i.e. vocabulary/alphabet)

Parameters:	data_frame – pandas DataFrame
Returns:

is_fitted()[source]¶

Checks if ColumnEncoder (still) needs to be fitted to data

Returns:	True if the column encoder does not require fitting (anymore or at all)

transform(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶

Transforms values in one or more columns of DataFrame into a numpy array that can be fed into a Featurizer

Parameters:	data_frame –
Returns:	List of integers

exception datawig.column_encoders.NotFittedError[source]¶: Error thrown when unfitted encoder is used

class datawig.column_encoders.NumericalEncoder(input_columns: Any, output_column: str = None, normalize=True)[source]¶

Numerical encoder, concatenates columns in field_names into one vector fills nans with the mean of a column

Parameters:	input_columns – List[str] with column names to be used as input for this ColumnEncoder output_column – Name of output field, used as field name in downstream MxNet iterator normalize – whether to normalize by the standard deviation or not, default True

decode(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶

Undoes the normalization, scales by scale and adds the mean

Parameters:	col – pandas Series (normalized)
Returns:	pandas Series (unnormalized)

fit(data_frame: pandas.core.frame.DataFrame)[source]¶

Does nothing or fits the normalizer, if normalization is specified

Parameters:	data_frame – DataFrame with numerical columns specified when instantiating NumericalEncoder

is_fitted()[source]¶

Returns true if the column encoder does not require fitting (anymore or at all)

Parameters:	self –
Returns:	True if the encoder is fitted

transform(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶

Concatenates the numerical columns specified when instantiating the NumericalEncoder Normalizes features if specified in the NumericalEncoder

Parameters:	data_frame – DataFrame with numerical columns specified in NumericalEncoder
Returns:	np.array with numerical features (rows by number of numerical columns)

class datawig.column_encoders.SequentialEncoder(input_columns: Any, output_column: str = None, token_to_idx: Dict[str, int] = None, max_tokens: int = 1000, seq_len: int = 500)[source]¶

Transforms sequence of characters into sequence of numbers

Parameters:

input_columns – List[str] with column names to be used as input for this ColumnEncoder
output_column – Name of output field, used as field name in downstream MxNet iterator
token_to_idx – token to index mapping 0 is reserved for missing tokens, 1 … max_tokens-1 for most to least frequent tokens
max_tokens – maximum number of tokens
seq_len – length of sequence, shorter sequences get padded to, longer sequences truncated at seq_len symbols

decode(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶

Decodes a pandas Series of token indices

Parameters:	col – pandas Series of token index iterables
Returns:	pd.Series of strings

decode_seq(token_index_sequence: Iterable[int]) → str[source]¶

Decodes a sequence of token indices into a string

Parameters:	token_index_sequence – an iterable of token indices
Returns:	str the decoded string

fit(data_frame: pandas.core.frame.DataFrame)[source]¶

Fits a SequentialEncoder by extracting the character value histogram of a column and capping it at max_tokens

Parameters:	data_frame – pandas data frame

is_fitted() → bool[source]¶

Checks if ColumnEncoder (still) needs to be fitted to data

Returns:	True if the column encoder does not require fitting (anymore or at all)

transform(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶

Transforms column of pandas dataframe into sequence of tokens

Parameters:	data_frame – pandas DataFrame
Returns:	numpy array (rows by seq_len)

static transform_func_seq_single(string: str, token_to_idx: Dict[str, int], seq_len: int, missing_token_idx: int) → List[int][source]¶

Transforms a single string into a sequence of token ids

Parameters:	string – a sequence of symbols as string token_to_idx – Dict[str, int] with mapping from token to token index seq_len – length of sequence missing_token_idx – index for missing symbol
Returns:	List[int] with transformed values

class datawig.column_encoders.TfIdfEncoder(input_columns: Any, output_column: str = None, max_tokens: int = 262144, tokens: str = 'chars', ngram_range: tuple = None, prefixed_concatenation: bool = True)[source]¶

TfIdf bag of word encoder for text data, using sklearn’s TfidfVectorizer

Parameters:

input_columns – List[str] with column names to be used as input for this ColumnEncoder
output_column – Name of output field, used as field name in downstream MxNet iterator
max_tokens – Number of feature buckets (dimensionality of sparse ngram vector). default 2**18
tokens – How to tokenize the input data, supports ‘words’ and ‘chars’.
ngram_range – length of ngrams to use as features
prefixed_concatenation – whether or not to prefix values with column name before concat

decode(col: pandas.core.series.Series) → pandas.core.series.Series[source]¶

Given a series of indices, decode it to input tokens

Parameters:	col –
Returns:	pd.Series of tokens

fit(data_frame: pandas.core.frame.DataFrame)[source]¶

Parameters:	data_frame –
Returns:

is_fitted() → bool[source]¶

Parameters:	self –
Returns:	True if the encoder is fitted

transform(data_frame: pandas.core.frame.DataFrame) → numpy.core.multiarray.array[source]¶

Transforms one or more string columns into Bag-of-words vectors.

Parameters:	data_frame – pandas DataFrame with text columns
Returns:	numpy array (rows by max_features)