In the Fairness and Explainability with SageMaker Clarify example, I am running a bias analysis on the 'Sex' facet ,where the facet value is 0, and the label is 0:
bias_config = clarify.BiasConfig(label_values_or_threshold=[0],
facet_name='Sex',
facet_values_or_threshold=[0],
group_name='Age')
This raises 2 questions:
How would I use it to detect bias on multi-label dataset? (I tried label_values_or_threshold=[0,1] but it didn't work). Would I need to re-run the job, each time for a different label?
Similarly, if I want to detect bias in for multiple facets (i.e 'Sex' and 'Age'), would I need to run the bias detection job for each facet_name?
How would we use it to detect bias on multi-label dataset? (I tried label_values_or_threshold=[0,1] but it didn't work). Would we need to re-run the job, each time for a different label?
By "multi-label" do you mean "categorical" label or "multi-tags" label?
Clarify supports categorical label, for example, if label value is one of enums "Dog", "Cat", "Fish", then you can specify label_values_or_threshold=["Dog", "Cat"] and Clarify will split the dataset into advantaged group (samples with label value "Dog" or "Cat") and disadvantaged group (samples with label value "Fish").
Clarify doesn't support multi-tags label. By multi-tags I mean, for example, a dataset like below.
features are N sentences extracted from a web page
label is N tags to describe the web page is about. Like,
feature1, feature2, feature3, ..., label
"pop", "beatles", "jazz", ..., "music, beatles"
“iphone”, “android”, “browser”, ..., “computer, internet, design”
“php”, “python”, “java”, ... , “programming ,java, web, internet”
Similarly, if we wanted to detect bias in for multiple facets (i.e 'Sex' and 'Age'), would we need to run the bias detection job for each facet_name?
Similarly, if I want to detect bias in for multiple facets (i.e 'Sex' and 'Age'), would I need to run the bias detection job for each facet_name?
Clarify supports multiple facets in a single run, although the configuration is not exposed by the SageMaker Python SDK API.
If you use Processing Job API and compose the analysis_config.json by yourself, you can append a list of facet objects to the facet configuration entry (see Configure the Analysis). Example,
...
"facet": [
{
"name_or_index" : "Sex",
"value_or_threshold": [0]
},
{
"name_or_index" : "Age",
"value_or_threshold": [40]
}
],
...
If you have to use SageMaker Python SDK API, then a workaround is appending additional facets to the analysis config (not recommended but currently there is no better way),
bias_config = clarify.BiasConfig(label_values_or_threshold=[0],
facet_name='Sex',
facet_values_or_threshold=[0])
bias_config.analysis_config['facet'].append({
'name_or_index': 'Age',
'value_or_threshold': [40],
})
Related
The Google built-in object detection documentation/reference says the the num_classes argument should be set as follows:
E.g., for num_classes=5, the range of image/class/label in input tf.Example needs to be [0, 4].
Yet, most other resources (e.g., here) on how to create your own dataset in the object detection API world say that labels should start with 1, that is, for 5 classes they should be [1,5].
My questions are:
Is the example in the reference documentation correct, that is, should I use [0,4] for 5 classes?
Does it matter at all, i.e., can this break the training procedure?
Is the "built-in object detection" algorithm special in other ways or can I follow the "using your own dataset" function to create my TFrecord files?
Seems like the labels should be [1,5].
The documentation has changed :)
See the updated documnetation
under --> Hyperparameters --> num_classes
We have a huge set of data in CSV format, containing a few numeric elements, like this:
Year,BinaryDigit,NumberToPredict,JustANumber, ...other stuff
1954,1,762,16, ...other stuff
1965,0,142,16, ...other stuff
1977,1,172,16, ...other stuff
The thing here is that there is a strong correlation between the third column and the columns before that. So I have pre-processed the data and it's now available in a format I think is perfect:
1954,1,762
1965,0,142
1977,1,172
What I want is a predicition on the value in the third column, using the first two as input. So in the case above, I want the input 1965,0 to return 142. In real life this file is thousands of rows, but since there's a pattern, I'd like to retrieve the most possible value.
So far I've setup a train job on the CSV file using the Linear Learner algorithm, with the following settings:
label_size = 1
feature_dim = 2
predictor_type = regression
I've also created a model from it, and setup an endpoint. When I invoke it, I get a score in return.
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
My goal here is to get the third column prediction instead. How can I achieve that? I have read a lot of the documentation regarding this, but since I'm not very familiar with AWS, I might as well have used the wrong algorithms for what I am trying to do.
(Please feel free to edit this question to better suit AWS terminology)
For csv input, the label should be in the first column, as mentioned here: So you should preprocess your data to put the label (the column you want to predict) on the left.
Next, you need to decide whether this is a regression problem or a classification problem.
If you want to predict a number that's as close as possible to the true number, that's regression. For example, the truth might be 4, and the model might predict 4.15. If you need an integer prediction, you could round the model's output.
If you want the prediction to be one of a few categories, then you have a classification problem. For example, we might encode 'North America' = 0, 'Europe' = 1, 'Africa' = 2, and so on. In this case, a fractional prediction wouldn't make sense.
For regression, use 'predictor_type' = 'regressor' and for classification with more than 2 classes, use 'predictor_type' = 'multiclass_classifier' as documented here.
The output of regression will contain only a 'score' field, which is the model's prediction. The output of multiclass classification will contain a 'predicted_label' field, which is the model's prediction, as well as a 'score' field, which is a vector of probabilities representing the model's confidence. The index with the highest probability will be the one that's predicted as the 'predicted_label'. The output formats are documented here.
predictor_type = regression is not able to return the predicted label, according to
the linear-learner documentation:
For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats. For binary classification models, it returns both the score and the predicted label. For regression, it returns only the score.
For more information on input and output file formats, see Linear
Learner Response Formats for inference, and the Linear Learner Sample
Notebooks.
I'm looking to perform feature selection with a multi-label dataset using sklearn. I want to get the final set of features across labels, which I will then use in another machine learning package. I was planning to use the method I saw here, which selects relevant features for each label separately.
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.multiclass import OneVsRestClassifier
clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)
I then plan to extract the indices of the included features, per label, using this:
selected_features = []
for i in multi_clf.estimators_:
selected_features += list(i.named_steps["chi2"].get_support(indices=True))
Now, my question is, how do I choose which selected features to include in my final model? I could use every unique feature (which would include features that were only relevant for one label), or I could do something to select features that were relevant for more labels.
My initial idea is to create a histogram of the number of labels a given feature was selected for, and to identify a threshold based on visual inspection. My concern is that this method is subjective. Is there a more principled way of performing feature selection for multilabel datasets using sklearn?
According to the conclusions in this paper:
[...] rank features according to the average or the maximum
Chi-squared score across all labels, led to most of the best
classifiers while using less features.
Then, in order to select a good subset of features you just need to do (something like) this:
from sklearn.feature_selection import chi2, SelectKBest
selected_features = []
for label in labels:
selector = SelectKBest(chi2, k='all')
selector.fit(X, Y[label])
selected_features.append(list(selector.scores_))
// MeanCS
selected_features = np.mean(selected_features, axis=0) > threshold
// MaxCS
selected_features = np.max(selected_features, axis=0) > threshold
Note: in the code above I'm assuming that X is the output of some text vectorizer (the vectorized version of the texts) and Y is a pandas dataframe with one column per label (so I can select the column Y[label]). Also, there is a threshold variable that should be fixed beforehand.
http://scikit-learn.org/stable/modules/feature_selection.html
There is a multitude of options, but SelectKBest and Recursive feature elimination are two reasonably popular ones.
RFE works by leaving uniformative features out of the model, and retraining, and comparing the results, so that the features left at the end are the ones which enable the best prediction accuracy.
What is best is highly dependant on your data and use case.
Aside from what can loosely be described as cross validation approaches to feature selection, you can look at Bayesian model selection, which is a more theoretical approach and tends to favor more simple models over complex ones.
I have date and 3 other elements of each job, that python reed them from a txt. and now I want to use these information to create a diagram with Bokeh.
how can I use date format(x-axies.start and end time of each job) and string formats(y-axies.3 elements for each job) in bokeh?
***Does anyone know of a working example for the Step Line chart type which exemplifies how to build the necessary data structure?
EDIT: the original answer below is obsolete, as the ggplot compat layer was removed from Bokeh many years ago. However, Bokeh now has its own built-in Step glyph:
https://docs.bokeh.org/en/latest/docs/user_guide/plotting.html#step-lines
OBSOLETE ANSWER:
I'm not sure if this is what you are asking for or not, but you can
use ggplot.py to generate a step-chart and then output to Bokeh:
http://docs.bokeh.org/docs/gallery/step.html
There will probably be a native step chart in Bokeh.charts later this
year (or sooner, if an outside contributor pitches in).
You would need to add your own data to this part:
df = pd.DataFrame({
"x": range(100),
"y": np.random.choice([-1, 1], 100)
})
I have two data sets, one for training and one for testing.
I am going to predict the values of a column with numerical type in test data set. In order to predict the value of an instance, I have to find the k nearest neighbors of that instance in training data set, and calculate the average of values. (waiting also can be used).
For example:
column0 column1 column2
......a..................b....................10
......a..................b....................12
......c..................d....................16
......a..................b....................?
I need a method of data mining to give me the result = (10+12)/2 = 11
Which method should I use to get such a result?
And do you know any good document which explains how to use that method?
KNN in Weka is implemented as IBk. It is capable of predicting numerical and nominal values.
If you are using the Weka Explorer (GUI) you can find it by looking for the "Choose" button under the Classify tab. Once there navigate the folders:
classifiers -> lazy -> IBk
Once you select IBk, click on the box immediately to the right of the button. This will open up a large number of options. If you then click on the button "More" in the options window, you will see all of the options explained. If you need more of an explanation of the classifier they even list the academic paper that the classifier is based on. You can do this for all of the classifiers to obtain additional information.