理解特征工程

入门TensorFlow的第二道拦路虎就是特征工程，也就是各种示例代码中经常出现的tf.feature_column.xxx。为了理解特征工程，我查阅了一些资料和代码，下面是我的总结。

什么是特征工程

在去训练一个模型的时候，总会用到原始数据。原始数据是每列有着含义的N行数据。所谓特征工程就是把原始数据的列转换成特征的过程。

什么是特征

在机器学习和模式识别中，特征是在观测现象中的一种独立、可测量的属性。选择信息量大的、有差别性的、独立的特征是模式识别、分类和回归问题的关键一步。

原始数据的列为什么不是特征

特征是从原始数据的列转换而来。从关系上来说，可能是一对一，也可能是多对一（比如经度和维度两列合成一个特征），原始数据还可能存在冗余列，比如年龄和出生年月；从数据类型上来说，原始数据可以是任意类型，但特征一定是数字或者布尔；从数值上来说，原始数据随心所欲，特征可能就要格式化或者归一化；从代码层面上来说，特征应该是tensorflow框架指定的数据类型feature_column。所以原始数据的列和特征有着区别。

怎么做特征工程

st=>start: Start
e=>end: END
op1=>operation: 特征提取
op2=>operation: 特征选择
op3=>operation: 特征构建
op4=>operation: 评估模型
cond=>condition: 效果?
st->op1(right)->op2(right)->op3(right)->op4(right)->cond
cond(yes)->e
cond(no)->op2

特征构建

使用numeric_column构建数值特征

numeric_column的入参如下：

numeric_column(
    key,
    shape=(1,),
    default_value=None,
    dtype=tf.float32,
    normalizer_fn=None
)

下面是一些例子：

numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength")
# 默认数据类型是tf.float32，可以显示指定为tf.float64
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength",dtype=tf.float64)
# 可以通过shape指定特征的形状
vector_feature_column = tf.feature_column.numeric_column(key="Bowling",shape=[10])
matrix_feature_column = tf.feature_column.numeric_column(key="MyMatrix",shape=[10,5])

使用bucketized_column构建分区特征

bucketized_column的入参如下：

bucketized_column(
    source_column,  # 必须是numeric_column
    boundaries
)

下面是一个例子：

# A numeric column for the raw input.
numeric_feature_column = tf.feature_column.numeric_column("Year")
# Bucketize the numeric column on the years 1960, 1980, and 2000
bucketized_feature_column = tf.feature_column.bucketized_column(
    source_column = numeric_feature_column,
    boundaries = [1960, 1980, 2000])

在创建存储分区化列之前，我们先创建了一个数值列来表示原始年份。
我们将数值列作为第一个参数传递到 tf.feature_column.bucketized_column() 中。
指定一个三元素 boundaries 矢量可以创建一个四元素存储分区化矢量

使用categorical_column_with_vocabulary_list构建分类词特征

categorical_column_with_vocabulary_list的入参如下：

categorical_column_with_vocabulary_list(
    key,
    vocabulary_list,
    dtype=None,
    default_value=-1,
    num_oov_buckets=0
)

key: feature名字
vocabulary_list: 对于category来说，进行转换的list.也就是category列表.
dtype: 仅仅string和int被支持，其他的类型是无法进行这个操作的.
default_value: 当不在vocabulary_list中的默认值，这时候num_oov_buckets必须是0.
num_oov_buckets: 用来处理那些不在vocabulary_list中的值，如果是0，那么使用default_value进行填充;如果大于0，则会在[len(vocabulary_list), len(vocabulary_list)+num_oov_buckets]这个区间上重新计算当前特征的值.

下面是一个例子：

# Given input "feature_name_from_input_fn" which is a string,
# create a categorical feature to our model by mapping the input to one of 
# the elements in the vocabulary list.
vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_list(
        key="feature_name_from_input_fn",
        vocabulary_list=["kitchenware", "electronics", "sports"])

vocabulary_list在词汇列表较长的时候，需要手工输入很多。所以这个方法有个改进的方法，从文件中读取词汇列表。如下：

# Given input "feature_name_from_input_fn" which is a string,
# create a categorical feature to our model by mapping the input to one of 
# the elements in the vocabulary file
vocabulary_feature_column =
    tf.feature_column.categorical_column_with_vocabulary_file(
        key="feature_name_from_input_fn",
        vocabulary_file="product_class.txt",
        vocabulary_size=3)

# product_class.txt should have one line for vocabulary element, in our case:
kitchenware
electronics
sports

使用categorical_column_with_hash_bucket构建哈希分区特征

categorical_column_with_hash_bucket的入参如下：

categorical_column_with_hash_bucket(
    key,
    hash_bucket_size,
    dtype=tf.string
)

下面是一个例子：

# Create categorical output for input "feature_name_from_input_fn".
# Category becomes: hash_value("feature_name_from_input_fn") % hash_bucket_size
hashed_feature_column =
    tf.feature_column.categorical_column_with_hash_bucket(
        key = "feature_name_from_input_fn",
        hash_buckets_size = 100) # The number of categories

哈希分区构建特征的思路与前面不一样，前面都是先有分法，再去构建特征，而哈希分区是我有数据之后，我愿意给他分多少类，我就设置hash_buckets_size是多少。size值越大也容易精确的分桶，但开销也越大。

使用categorical_column_with_identity构建ID分类特征

categorical_column_with_identity的入参如下：

categorical_column_with_identity(
    key,
    num_buckets,
    default_value=None
)

这种构建方法与哈希分区并无实质上的不同。下面是一个例子:

# Create a categorical output for input "feature_name_from_input_fn",
# which must be of integer type. Value is expected to be >= 0 and < num_buckets
identity_feature_column = tf.feature_column.categorical_column_with_identity(
    key='feature_name_from_input_fn', 
    num_buckets=4) # Values [0, 4)

# The 'feature_name_from_input_fn' above needs to match an integer key that is 
# returned from input_fn (see below). So for this case, 'Integer_1' or
# 'Integer_2' would be valid strings instead of 'feature_name_from_input_fn'.
# For more information, please check out Part 1 of this blog series.
def input_fn():
    ...<code>...
    return ({ 'Integer_1':[values], ..<etc>.., 'Integer_2':[values] },
            [Label_values])

使用weighted_categorical_column构建带权重的分类特征

weighted_categorical_column的入参如下：

weighted_categorical_column(
    categorical_column,
    weight_feature_key,
    dtype=tf.float32
)

下面是一段示例代码:

color_data = {'color': [['R'], ['G'], ['B'], ['A']],
                  'weight': [[1.0], [2.0], [4.0], [8.0]]}  # 4行样本

    color_column = feature_column.categorical_column_with_vocabulary_list(
        'color', ['R', 'G', 'B'], dtype=tf.string, default_value=-1
    )

    color_weight_categorical_column = feature_column.weighted_categorical_column(color_column, 'weight')

使用crossed_column构建组合特征

crossed_column入参如下：

tf.feature_column.crossed_column(
    keys,
    hash_bucket_size,
    hash_key=None
)

使用两个特征组合起来去形成一个新的特征。下面是示例代码:

def test_crossed_column():
    """
    crossed column测试
    :return:
    """
    featrues = {
        'price': [['A', 'A'], ['B', 'D'], ['C', 'A']],
        'color': [['R', 'R'], ['G', 'G'], ['B', 'B']]
    }

    price = feature_column.categorical_column_with_vocabulary_list('price',
                                                                   ['A', 'B', 'C', 'D'])
    color = feature_column.categorical_column_with_vocabulary_list('color',
                                                                   ['R', 'G', 'B'])
    p_x_c = feature_column.crossed_column([price, color], 16)

    p_x_c_identy = feature_column.indicator_column(p_x_c)

    p_x_c_identy_dense_tensor = feature_column.input_layer(featrues, [p_x_c_identy])

    with tf.Session() as session:
        session.run(tf.global_variables_initializer())

        session.run(tf.tables_initializer())

        print('use input_layer' + '_' * 40)
        print(session.run([p_x_c_identy_dense_tensor]))

理解特征工程

什么是特征工程

什么是特征

原始数据的列为什么不是特征

怎么做特征工程

特征构建

使用numeric_column构建数值特征

使用bucketized_column构建分区特征

使用categorical_column_with_vocabulary_list构建分类词特征

使用categorical_column_with_hash_bucket构建哈希分区特征

使用categorical_column_with_identity构建ID分类特征

使用weighted_categorical_column构建带权重的分类特征

使用crossed_column构建组合特征

参考资料

END