and you set
n_samples = 100
n_features = 1000
and you generate the following data
import numpy as np
import matplotlib.pyplot as plt
X1 = np.asarray(np.randn(n_samples/2, n_features))
X2 = np.asarray(np.randn(n_samples/2, n_features)) + 5
X = np.append(X1, X2, axis=0)
For a binary classification, the function which determines our labels is \[y = sign(X \bullet \omega)\]
Where \(\omega\) is our coefficients.
For now, let's set our coefficients equal to a bunch of zeros:
coef = (np.zeros(n_features))
If we wish to make it so that we have, say, 10 informative features, we can for example set 10 of our coefficients equal to a non-zero value. Now when we dot it with our data, X, we will basically
tell it that the 10 non-zero coefficients are our informative features, while the rest that will be
multiplied by zeros are not informative.
coef[:10] = 1
y = np.sign(np.dot(X,coef))
will give us our corresponding labels such that we have 10 informative features.
A way to visualise this, is to use the Scikit-Learn package's f_classif function.
If you have the Scikit-learn package installed, do the following:
from sklearn.feature_selection import f_classif
p,v = f_classif(X,y)
Here you can see that the first 10 features are rated as the most informative.