Friesian Feature API¶
friesian.feature.table¶
- class zoo.friesian.feature.table.FeatureTable(df)[source]¶
Bases:
zoo.friesian.feature.table.Table- add_feature(item_cols, feature_tbl, default_value)[source]¶
Get the category or other field from another map like FeatureTable
- Parameters
item_cols – list[string]
feature_tbl – FeatureTable with two columns [category, item]
defalut_cat_index – default value for category if key does not exist
- Returns
FeatureTable
- add_hist_seq(user_col, cols, sort_col='time', min_len=1, max_len=100)[source]¶
Generate a list of item visits in history
- Parameters
user_col – string, user column.
cols – list of string, ctolumns need to be aggragated
sort_col – string, sort by sort_col
min_len – int, minimal length of a history list
max_len – int, maximal length of a history list
- Returns
FeatureTable
- add_length(col_name)[source]¶
Generagte length of a colum
- Parameters
col_name – string.
- Returns
FeatureTable
- add_neg_hist_seq(item_size, item_history_col, neg_num)[source]¶
Generate a list negative samples for each item in item_history_col
- Parameters
item_size – int, max of item.
item2cat – FeatureTable with a dataframe of item to catgory mapping
item_history_col – string, this column should be a list of visits in history
neg_num – int, for each positive record, add neg_num of negative samples
- Returns
FeatureTable
- add_negative_samples(item_size, item_col='item', label_col='label', neg_num=1)[source]¶
Generate negative item visits for each positive item visit
- Parameters
item_size – integer, max of item.
item_col – string, name of item column
label_col – string, name of label column
neg_num – integer, for each positive record, add neg_num of negative samples
- Returns
FeatureTable
- encode_string(columns, indices)[source]¶
Encode columns with provided list of StringIndex
- Parameters
columns – str or a list of str, target columns to be encoded.
indices – StringIndex or a list of StringIndex, StringIndexes of target columns. The StringIndex should at least have two columns: id and the corresponding categorical column.
- Returns
A new FeatureTable which transforms categorical features into unique integer values with provided StringIndexes.
- gen_ind2ind(cols, indices)[source]¶
Generate a mapping between of indices
- Parameters
cols – a list of str, target columns to generate StringIndex.
indices – list of StringIndex
- Returns
FeatureTable
- gen_string_idx(columns, freq_limit)[source]¶
Generate unique index value of categorical features
- Parameters
columns – str or a list of str, target columns to generate StringIndex.
freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be omitted from the encoding. Can be represented as both an integer, dict or None. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. None means all the categories that appear will be encoded.
- Returns
List of StringIndex
- join(table, on=None, how=None)[source]¶
Join a FeatureTable with another FeatureTable, it is wrapper of spark dataframe join
- Parameters
table – FeatureTable
on – string, join on this column
how – string
- Returns
FeatureTable
- mask(mask_cols, seq_len=100)[source]¶
Mask mask_cols columns
- Parameters
mask_cols – list of string, columns need to be masked with 1s and 0s.
seq_len – int, length of masked column
- Returns
FeatureTable
- mask_pad(padding_cols, mask_cols, seq_len=100)[source]¶
Mask and pad columns
- Parameters
padding_cols – list of string, columns need to be padded with 0s.
mask_cols – list of string, columns need to be masked with 1s and 0s.
seq_len – int, length of masked column
- Returns
FeatureTable
- pad(padding_cols, seq_len=100)[source]¶
Post padding padding columns
- Parameters
padding_cols – list of string, columns need to be padded with 0s.
seq_len – int, length of padded column
- Returns
FeatureTable
- class zoo.friesian.feature.table.StringIndex(df, col_name)[source]¶
Bases:
zoo.friesian.feature.table.Table- classmethod read_parquet(paths, col_name=None)[source]¶
Loads Parquet files, returning the result as a StringIndex.
- Parameters
paths – str or a list of str. The path/paths to Parquet file(s).
col_name – str. The column name of the corresponding categorical column. If col_name is None, the file name will be used as col_name.
- Returns
A StringIndex.
- write_parquet(path, mode='overwrite')[source]¶
Write StringIndex to Parquet file
- Parameters
path – str. The path to the folder of the Parquet file. Note that the col_name will be used as basename of the Parquet file.
mode – str. append, overwrite, error or ignore. append: Append contents of this StringIndex to existing data. overwrite: Overwrite existing data. error: Throw an exception if data already exists. ignore: Silently ignore this operation if data already exists.
- class zoo.friesian.feature.table.Table(df)[source]¶
Bases:
object- clip(columns, min=0)[source]¶
clips continuous values so that they are within a min bound. For instance by setting the min value to 0, all negative values in columns will be replaced with 0.
- Parameters
columns – list of str, the target columns to be clipped.
min – int, The mininum value to clip values to: values less than this will be replaced with this value.
- Returns
A new Table that replaced the value less than min with specified min
- count()[source]¶
Returns the number of rows in this Table.
- Returns
The number of rows in current Table
- distinct()[source]¶
A wrapper of dataframe distinct :return: A new Table that only has distinct rows
- drop(*cols)[source]¶
Returns a new Table that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).
- Parameters
cols – a string name of the column to drop, or a list of string name of the columns to drop.
- Returns
A new Table that drops the specified column.
- dropna(how='any', thresh=None, subset=None)[source]¶
Drop null values. a wrapper of dataframe dropna :return: A new Table that replaced the null values with specified value
- fillna(value, columns)[source]¶
Replace null values.
- Parameters
value – int, long, float, string, or boolean. Value to replace null values with.
columns – list of str, the target columns to be filled. If columns=None and value is int, all columns of integer type will be filled. If columns=None and value is long, float, string or boolean, all columns will be filled.
- Returns
A new Table that replaced the null values with specified value
- log(columns, clipping=True)[source]¶
Calculates the log of continuous columns.
- Parameters
columns – list of str, the target columns to calculate log.
clipping – boolean, if clipping=True, the negative values in columns will be clipped to 0 and log(x+1) will be calculated. If clipping=False, log(x) will be calculated.
- Returns
A new Table that replaced value in columns with logged value.
- merge_cols(columns, target)[source]¶
Merge column values as a list to a new col.
- Parameters
columns – list of str, the target columns to be merged.
target – str, the new column name of the merged column.
- Returns
A new Table that replaced columns with a new target column of merged list value.
- rename(columns)[source]¶
Rename columns with new column names
- Parameters
columns – dict. Name pairs. For instance, {‘old_name1’: ‘new_name1’, ‘old_name2’: ‘new_name2’}”
- Returns
A new Table with new column names.
- show(n=20, truncate=True)[source]¶
Prints the first n rows to the console.
- Parameters
n – int, number of rows to show.
truncate – If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.