In data mining what is a class label..? please give an example - data-mining

i don't understand what it means.
in database a tuple means a field value and a attribute means a table field?
am i correct?
and what is a Class label in Data Mining?

Very short answer: class label is the discrete attribute whose value you want to predict based on the values of other attributes. (Do read the rest of the answer.)
The term class label is usually used in the contex of supervised machine learning, and in classification in particular, where one is given a set of examples of the form (attribute values, classLabel) and the goal is to learn a rule that computes the label from the attribute values. The class label always takes on a finite (as opposed to inifinite) number of different values.
For a concrete example, we might be given a set of adult people and we'd like to predict whether they're homeless or not. Suppose the attributes were highest educational level achieved and origin (examples are of the from (origin, educationalLevel; isHomeless):
(Manhattan, PhD; no)
(Brooklyn, Primary school; yes)
...
In this particular case, isHomeless is the class label. The goal is to learn a function that computes whether the person with a given attribute values is homeless or not. (More specifically, to learn a function that makes as little mistakes as possible under a certain quantification of the number of mistakes.)
The Wikipedia article Supervised learning gives a good description.
Regarding the other question: no, a tuple means the whole set of values of the attributes in a given row. For example, if you had a table Table person(id, name, surname) then a tuple representing the first row could be (0, 'Akhil', 'Mohan').

Basically a class label (in classification) can be compared to a response variable (in regression): a value we want to predict in terms of other (independent) variables.
Difference is that a class labels is usually a discrete/Categorcial variable (eg-Yes-No, 0-1, etc.), whereas a response variable is normally a continuous/real-number variable.
You can find more about Regression and Classification related to Response variables and Class lables at https://math.stackexchange.com/questions/141381/regression-vs-classification.

Take an example of email spam filter, it classifies that an email is a spam or not, for which we define 2 classes which are spam(class 1) and not spam(class 2). Both of these are class labels or you can say that, if an email have some certain attributes then it belongs to spam class or not spam class

Related

Confusion matrix in Weka

I want to calculate confusion matrix, f1 score, roc etc. But the Weka output is showing this. How can I get the confusion matrix, f1 score, roc, etc?
First of all, your dataset seems to have a numeric class attribute. Correlation coefficient is a statistic generated for regression models. A confusion matrix (which you want) is only computed for classification models.
Secondly, you are using ZeroR as classifier, which is not a very useful classifier (only for determining a baseline). ZeroR either predicts the mean class value (numeric class attribute) or the majority class (nominal class attribute).
Solutions:
Ensure that you are using the right attribute for your class. Assuming that you are using the Weka Explorer, check the combobox on the Classify panel that it has the right attribute selected. On the command-line, use the -c flag to specify the index of the class attribute (1-based index, first and last can be used as well).
If you imported your data from a CSV file and the class attribute column contains only numeric values, then Weka will have left it as numeric (it doesn't know that this column represents a nominal attribute). In that case, make sure that you convert your class attribute to a nominal one, e.g., by using the NumericToNominal filter in the Preprocess panel.
Choose a different classifier, like RandomForest or J48, which tend to generate reasonable models with just the default parameters.

AWS Personalize items attributes

I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html
There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.

Convert String attributes to numeric values in WEKA

I am new to weka.. My data contains a column of student name. I want to convert these names to numeric values, over the whole column.
Eg: Suppose there are 10 names abcd ,cdef,xyz ,etc. I want to pre process the data so that corresponding to each name there is distinct numeric value, like abcd changes to 1 ,cdef changes to 2 ,etc.
Also two or more rows can have same name. So in this case, same name should have same value.
Please help me...
Weka supports 4 non-relational attribute types: nominal, numeric, string and date. You can find out more about them in Weka Manual (it can be found in the same folder were you downloaded Weka), chapter "The ARFF Header Section".
You should find out what is the type of the "student's name" attribute (probably string, but could be nominal), and decide what should be the type of the attribute with converted values (numeric, nominal, or string).
There can be 2 scenarios:
(1) If types of the existing and desired attributes are the same (string-string or nominal-nominal, i.e. you only want to change values, not attribute type), you could do so
(a) manually - open the data file in Weka Explorer, and click Edit... button, or
(b) write a small program using Weka's Attribute class functions value and setValue.
(2) Types are different - Weka attribute types cannot be converted, so you will have to create and insert a new attribute with the converted values, and delete the old attribute. An example of how to create a new attribute can be found at
http://weka.wikispaces.com/Programmatic+Use#Step.
As far as I understand, strictly converting names into a "numeric" type doesn't seem like the best approach, within the context of WEKA - WEKA will treat numeric attributes differently than it does "string" or "nominal" attributes (for example, for running certain "attribute selection" algorithms, you can not use "numeric" types - they need to be "discretized" or converted into nominal form).
So, for your case, I think you can convert your "string" names into just "nominal" type using the StringToNominal class (this class acts as a WEKA "filter" to help convert a given "string" attribute into an attribute of type "nominal"). This will also take care about the repeating names - the list of "nominal" values for the names (that will be generated after you apply this filter) will contain any given name (that appears any number of times) only one time.
"Nominal" attributes also have the advantage that implicitly, they do have a numeric representation (the index of the value within the set of values; similar to how the "enums" in Java have a numeric index). So, you can utilize that as the "numeric" information corresponding to the names (though as I said earlier, it's probably best to just use it as "nominal" attribute; really depends on your particular use case).
I had the same problem as the one mentioned in the question, and I could "address" it in the following way.
I first applied the StringToNominal filter as mentioned before (don't forget to change the attribute range (from "last" to "first-last")). Once done that, I saved the dataset in LibSVM format, which changes the nominal values to numeric ones.
Then, if you close Weka and open it again, you will have the same dataset with the same number of features but they will be numeric. Now some changes should be done, first of all, normalizing all the numeric values in the dataset, using the Normalize filter. After that, apply the NumericToNominal filter to the last attribute.
Then, you will have a similar dataset with numeric values.
Hope this helps.

Designing sets of data and support class extension OO approach, in c++

i'm currently working on a project and something came up on the design.
I have a class named Key which is composed of several Fields. This Field class it's a mother class and their sons like Age, Name, etc implement Field. Inside the Key class there's an attribute which is an array of Fields, to hold different kinds of Fields.
class Key {
private:
Field * fieldList;
}
I'm working on a team and a design choice came up that i couldn't defend cause i didn't knew how to answer to the following problem... or maybe the lack of it? I trust that you'll be able to open my mind on this.
The purpose of this Key class is to hold several fields. The existence of this class is because i'm going to handle data of this kind.
(Name, Age....)
This is how i thought it would look already implemented:
Key myKey = Key();
Age newAge = Age(50);
myKey.add(newAge);
This is what the prototype of the add method of the Key class would look like:
void Key::add(Field);
As you may have assumed, since the Key class has an array of Field's this method receives a Field and since Age is also Field, cause of inheritance, then this works like a charm. Same can be said of the Name class and other classes that could come up in the future.
This is the same idea as in a database where you have rows with data and the columns belong to the attributes, so a same column has the same type of attribute.
We would also like to compare 2 Key's only by one of the Fields, for example:
Let's say i have 2 Key's with this data:
(John, 50) <- myKey1
(Paul, 60) <- myKey2
My method to do this would look like this:
myKey1.compareTo(myKey2, 2)
This would answer if the 2nd attribute of the first myKey1 is bigger, equal or less than the one on the second myKey2.
There's a problem with this. When i used the add method, i randomly added Field's of different types, say Age first, then Name second, etc to the Key object. My Key object added them to it's internal array by order of appearance.
So when i use the compareTo method, nothing is assuring me that inside both objects, the 2nd elements of their arrays will have a Field say, the Name Field, and therefore if that were not to be true, it could be comparing a Name with Age, cause inside it only holds an array of Field's, that are equal type as long the Key class knows.
This was my approach to my solution, but what i couldn't answer is, how to fix this problem.
Another member of my team proposed, that we implement a method for the key class for each of the existing fields, that is:
myKey.addAge(newAge);
myKey.addName(newName);
Inside it would still have the Field array but this time, the class can assure you that Age will go in the 1st place of the array, and that Name would go in the 2nd position of the array, cause each method would make sure of it.
The obvious problem with this, is that i would have to add a method for each type of Field that exists. That means that if in the future i wish to add say "born date" and so creating the new Date class, i'll have to add a method addDate, and so on and so on...
Another reason my team member gave me is that, "we can't trust an exterior user that he will add the Fields the way they're supposed to be ordered" when pointing why my approach was bad.
So to conclude:
On the first approach, the Key class depends on the programmer that added Fields, to make sure they have the order they should, but as a benefit no need to add a method for each type of field.
On the second approach, the Key class makes sure the order is the right one, by implementing a method for each type Field that exists, but then, by each type of new Field created, the class would grow bigger and bigger.
Any ideas with this? is there a workaround for this?
Thanks in advance, and i apologize if i wasn't clear with it, i'll add new details if needed.
Expanding on #tp1's excellent idea of an ID field in the Field class and an enum, you can actually make it very flexible. If you are comfortable limiting the number of field types to 32, you could even take a set of flags as the ID in CompareTo. Then you could compare multiple fields at the same time. Does that approach make sense?

Django: Query with F() into an object not behaving as expected

I am trying to navigate into the Price model to compare prices, but met with an unexpected result.
My model:
class ProfitableBooks(models.Model):
price = models.ForeignKey('Price',primary_key=True)
In my view:
foo = ProfitableBooks.objects.filter(price__buy__gte=F('price__sell'))
Producing this error:
'ProfitableBooks' object has no attribute 'sell'
Is this your actual model or a simplification? I think the problem may lie in having a model whose only field is its primary key is a foreign key. If I try to parse that out, it seems to imply that it's essentially a field acting as a proxy for a queryset-- you could never have more profitable books than prices because of the nature of primary keys. It also would seem to mean that your elided books field must have no overlap in prices due to the implied uniqueness constraints.
If I understand correctly, you're trying to compare two values in another model: price.buy vs. price.sell, and you want to know if this unpictured Book model is profitable or not. While I'm not sure exactly how the F() object breaks down here, my intuition is that F() is intended to facilitate a kind of efficient querying and updating where you're comparing or adjusting a model value based on another value in the database. It may not be equipped to deal with a 'shell' model like this which has no fields except a joint primary/foreign key and a comparison of two values both external to the model from which the query is conducted (and also distinct from the Book model which has the identifying info about books, I presume).
The documentation says you can use a join in an F() object as long as you are filtering and not updating, and I assume your price model has a buy and sell field, so it seems to qualify. So I'm not 100% sure where this breaks down behind the scenes. But from a practical perspective, if you want to accomplish exactly the result implied here, you could just do a simple query on your price model, b/c again, there's no distinct data in the ProfitableBooks model (it only returns prices), and you're also implying that each price.buy and price.sell have exactly one corresponding book. So Price.objects.filter(buy__gte=F('sell')) gives the result you've requested in your snipped.
If you want to get results which are book objects, you should do a query like the one you've got here, but start from your Book model instead. You could put that query in a queryset manager called "profitable_books" or something, if you wanted to substantiate it in some way.