Steel-trap: Informative features

Let's say you want to generate a synthetic data-set to play around with for classification,
and you set

n_samples = 100

n_features = 1000

and you generate the following data


import numpy as np

import matplotlib.pyplot as plt


X1 = np.asarray(np.randn(n_samples/2, n_features))

X2 = np.asarray(np.randn(n_samples/2, n_features)) + 5

X = np.append(X1, X2, axis=0)

rnd.shuffle(X)



plt.scatter(X[:,0], X[:,1])

plt.show()

For a binary classification, the function which determines our labels is \[y = sign(X \bullet \omega)\]
Where \(\omega\) is our coefficients.
For now, let's set our coefficients equal to a bunch of zeros:
coef = (np.zeros(n_features))
If we wish to make it so that we have, say, 10 informative features, we can for example set 10 of our coefficients equal to a non-zero value. Now when we dot it with our data, X, we will basically
tell it that the 10 non-zero coefficients are our informative features, while the rest that will be
multiplied by zeros are not informative.

So,

coef[:10] = 1

y = np.sign(np.dot(X,coef))

will give us our corresponding labels such that we have 10 informative features.
A way to visualise this, is to use the Scikit-Learn package's f_classif function.
If you have the Scikit-learn package installed, do the following:

from sklearn.feature_selection import f_classif

p,v = f_classif(X,y)

plt.plot(p)

plt.show()

Here you can see that the first 10 features are rated as the most informative.

Steel-trap

Friday, April 20, 2012

Informative features

No comments:

Post a Comment