Friday, April 20, 2012

Informative features

Let's say you want to generate a synthetic data-set to play around with for classification,
and you set

n_samples = 100
n_features = 1000



and you generate the following data
import numpy as np
import matplotlib.pyplot as plt
X1 = np.asarray(np.randn(n_samples/2, n_features))
X2 = np.asarray(np.randn(n_samples/2, n_features)) + 5
X = np.append(X1, X2, axis=0)
rnd.shuffle(X)

plt.scatter(X[:,0], X[:,1])
plt.show()




For a binary classification, the function which determines our labels is \[y = sign(X \bullet \omega)\]
Where \(\omega\) is our coefficients.
For now, let's set our coefficients equal to a bunch of zeros:
coef = (np.zeros(n_features))
If we wish to make it so that we have, say, 10 informative features, we can for example set 10 of our coefficients equal to a non-zero value. Now when we dot it with our data, X, we will basically
tell it that the 10 non-zero coefficients are our informative features, while the rest that will be
multiplied by zeros are not informative.

So,

coef[:10] = 1
y = np.sign(np.dot(X,coef))


will give us our corresponding labels such that we have 10 informative features.
A way to visualise this, is to use the Scikit-Learn package's f_classif function.
If you have the Scikit-learn package installed, do the following:

from sklearn.feature_selection import f_classif
p,v = f_classif(X,y)
plt.plot(p)
plt.show()



Here you can see that the first 10 features are rated as the most informative.

No comments:

Post a Comment