Train and test-sets are not compatible error in weka - weka

I'm trying to do text-classification in Weka, but I'm having a lot of problem getting the test-set to work. Here's my training set (it's short as I'm just starting to learn weka!):
#relation sentiment
#attribute phrase string
#attribute value {pos, neg}
#data
'That was really unlucky', neg
'The car crashed horribly', neg
'The culpirit got away',neg
'Fortunally everyone made it out', pos
'She was glad noone was hurt',pos
'And the sun was at least shining',pos
I then use StringToWordVector on the set, and then apply the NumericToBinary. This is the end result of the training set:
#relation 'sentiment-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"-weka.filters.unsupervised.attribute.NumericToBinary'
#attribute value {pos,neg}
#attribute And_binarized {0,1}
#attribute Fortunally_binarized {0,1}
#attribute She_binarized {0,1}
#attribute at_binarized {0,1}
#attribute everyone_binarized {0,1}
#attribute glad_binarized {0,1}
#attribute hurt_binarized {0,1}
#attribute it_binarized {0,1}
#attribute least_binarized {0,1}
#attribute made_binarized {0,1}
#attribute noone_binarized {0,1}
#attribute out_binarized {0,1}
#attribute shining_binarized {0,1}
#attribute sun_binarized {0,1}
#attribute the_binarized {0,1}
#attribute was_binarized {0,1}
#attribute That_binarized {0,1}
#attribute The_binarized {0,1}
#attribute away_binarized {0,1}
#attribute car_binarized {0,1}
#attribute crashed_binarized {0,1}
#attribute culpirit_binarized {0,1}
#attribute got_binarized {0,1}
#attribute horribly_binarized {0,1}
#attribute really_binarized {0,1}
#attribute unlucky numeric
#data
{0 neg,16 1,17 1,25 1,26 1}
{0 neg,18 1,20 1,21 1,24 1}
{0 neg,18 1,19 1,22 1,23 1}
{2 1,5 1,8 1,10 1,12 1}
{3 1,6 1,7 1,11 1,16 1}
{1 1,4 1,9 1,13 1,14 1,15 1,16 1}
I now start working on the testing set, which is:
#relation sentiment
#attribute phrase string
#data
'That was really unlucky'
'The car crashed horribly'
'The culpirit got away'
My hope is that weka can classify this text as 'neg'. To make them compatible I use the same filters as I did on the training set (StringToWordVector and NumericToBinary). This is the end result of the test-set:
#relation 'sentiment-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"-weka.filters.unsupervised.attribute.NumericToBinary'
#attribute That_binarized {0,1}
#attribute The_binarized {0,1}
#attribute away_binarized {0,1}
#attribute car_binarized {0,1}
#attribute crashed_binarized {0,1}
#attribute culpirit_binarized {0,1}
#attribute got_binarized {0,1}
#attribute horribly_binarized {0,1}
#attribute really_binarized {0,1}
#attribute unlucky_binarized {0,1}
#attribute was numeric
#data
{0 1,8 1,9 1,10 1}
{1 1,3 1,4 1,7 1}
{1 1,2 1,5 1,6 1}
However, it gives me the error that the training set and the test-set are not compatible, and I can't really figure out why. This intuitively seems like something that weka should understand.
Thanks for any help!

Your training and testing test should have same header. Right now they are different.
Read following link for an example for text classification.. This is another link which shows other ways to solve this problem.

Related

Converting a single txt file to arff file automatically

I have a single .txt file including a lot of Arabic text, and I want to convert this file to an .arff file automatically, so I can use it in Weka to get rules from it.
As my Professor requested I need to have 30 attributes, and each attribute should have all words in the text file, and each line of data will include real sentences, but separated to words using , and if the sentence includes less than 30 words, the remaining part will be filled with ?.
The arff file should look like the following:
#relation RelName
#attribute 'x1'{*will include all words in the text file*}
#attribute 'x2'{*will include all words in the text file*}
.
.
.
#attribute 'x30'{*will include all words in the text file*}
#data
Wordx,Wordy,Wordz,Wordq,Wordw,?,?,?,?,?...................,? //till 30 word
.
.
.
.
and so on
So is there anyway to generate this format .arff file from a single .txt file automatically? Thank you for your help
You can use arff 0.9. Works well with python 2.x and 3.x.
EG:
import arff
data = [[1,2,3], [10, 20, 30]]
arff.dump('result.arff', data, relation="test", names=['one', 'two', 'three'])
The command is going to create a relation test with three attributes 'one', 'two' and 'three'. First column will contain 1,10. Second column contains 2,20. Third column contains 3,30.

Weka - Is there a good way to handle (a lot of) numeric attributes for classifying a nominal value?

Best
I've a lot of numeric values and at the end, I want to predict a result.
My result can have the nominal values of '0','1' or 'x'.
What I like to know is, how can I get the best results.
Can some classifiers handle numeric attributes better than another?
And sometimes it seems that a classifier has a focus on a less interesting attribute...
Also at the moment h. means home team and a. means away team. Would it be better if I split this and add an attribute, location #location {'h', 'a'} --> 0 will become 1 and vica versa
#relation estimation
#attribute h.teamSize numeric
#attribute h.lineUpTeamFormation {'5-2-0-3-1' ... '6-2-0-4-1'}
#attribute h.teamRatingAVG numeric
#attribute h.teamRatingHighest numeric
#attribute h.teamRatingLowest numeric
#attribute h.teamRatingMed numeric
#attribute h.teamRatingMedRating numeric
#attribute h.lineUpTeamRating.att numeric
#attribute h.lineUpTeamRating.attMid numeric
#attribute h.lineUpTeamRating.mid numeric
#attribute h.lineUpTeamRating.defMid numeric
#attribute h.lineUpTeamRating.def numeric
#attribute h.lineUpTeamRatingAVG.att numeric
#attribute h.lineUpTeamRatingAVG.attMid numeric
#attribute h.lineUpTeamRatingAVG.mid numeric
#attribute h.lineUpTeamRatingAVG.defMid numeric
#attribute h.lineUpTeamRatingAVG.def numeric
#attribute h.lineUpTeamRatingHighest.att numeric
#attribute h.lineUpTeamRatingHighest.attMid numeric
#attribute h.lineUpTeamRatingHighest.mid numeric
#attribute h.lineUpTeamRatingHighest.defMid numeric
#attribute h.lineUpTeamRatingHighest.def numeric
#attribute h.lineUpTeamRatingLowest.att numeric
#attribute h.lineUpTeamRatingLowest.attMid numeric
#attribute h.lineUpTeamRatingLowest.mid numeric
#attribute h.lineUpTeamRatingLowest.defMid numeric
#attribute h.lineUpTeamRatingLowest.def numeric
#attribute a.teamSize numeric
#attribute a.lineUpTeamFormation {'5-2-0-3-1' ... '6-2-0-4-1'}
#attribute a.teamRatingAVG numeric
#attribute a.teamRatingHighest numeric
#attribute a.teamRatingLowest numeric
#attribute a.teamRatingMed numeric
#attribute a.teamRatingMedRating numeric
#attribute a.lineUpTeamRating.att numeric
#attribute a.lineUpTeamRating.attMid numeric
#attribute a.lineUpTeamRating.mid numeric
#attribute a.lineUpTeamRating.defMid numeric
#attribute a.lineUpTeamRating.def numeric
#attribute a.lineUpTeamRatingAVG.att numeric
#attribute a.lineUpTeamRatingAVG.attMid numeric
#attribute a.lineUpTeamRatingAVG.mid numeric
#attribute a.lineUpTeamRatingAVG.defMid numeric
#attribute a.lineUpTeamRatingAVG.def numeric
#attribute a.lineUpTeamRatingHighest.att numeric
#attribute a.lineUpTeamRatingHighest.attMid numeric
#attribute a.lineUpTeamRatingHighest.mid numeric
#attribute a.lineUpTeamRatingHighest.defMid numeric
#attribute a.lineUpTeamRatingHighest.def numeric
#attribute a.lineUpTeamRatingLowest.att numeric
#attribute a.lineUpTeamRatingLowest.attMid numeric
#attribute a.lineUpTeamRatingLowest.mid numeric
#attribute a.lineUpTeamRatingLowest.defMid numeric
#attribute a.lineUpTeamRatingLowest.def numeric
#attribute result {'0','1','x'}
#data
11.0,"4-1-1-4-1",1563.0046902930617,1716.018383910481,1493.642106150469,1542.5395864396032,1604.830245030475,1594.8952627985404,6230.782838756112,1552.485746007047,1716.018383910481,6098.869361751494,1594.8952627985404,1557.695709689028,1552.485746007047,1716.018383910481,1524.7173404378734,1594.8952627985404,1617.8284702417561,1552.485746007047,1716.018383910481,1542.4611979096933,1594.8952627985404,1493.642106150469,1552.485746007047,1716.018383910481,1510.4250125761928,11.0,"5-1-1-2-2",1588.961662996073,1747.6289170494754,1508.4062919834894,1565.5233012334515,1628.0176045164824,3459.80148294728,3079.552081457912,1542.4682316024448,1576.1754548839763,7820.5810420651915,1729.90074147364,1539.776040728956,1542.4682316024448,1576.1754548839763,1564.1162084130383,1747.6289170494754,1549.4953619285486,1542.4682316024448,1576.1754548839763,1613.8600439857894,1712.1725658978046,1530.0567195293636,1542.4682316024448,1576.1754548839763,1508.4062919834894,"x"
11.0,"4-2-2-2-1",1475.8094913912312,1502.0682887709222,1444.990021885439,1483.7603435487183,1473.5291553281807,1490.639636207262,2978.5093856157946,2950.4346148352724,2892.2037554297044,5922.117013215507,1490.639636207262,1489.2546928078973,1475.2173074176362,1446.1018777148522,1480.5292533038767,1490.639636207262,1492.9037337533382,1502.0682887709222,1447.2137335442653,1496.2886114276891,1490.639636207262,1485.6056518624566,1448.3663260643502,1444.990021885439,1460.927921231502,11.0,"4-1-2-2-2",1484.7390000692892,1512.2300048742143,1453.444107111614,1486.4669707831615,1482.837055992914,3013.771836727523,2964.5776806684476,2961.501146916992,1453.444107111614,5938.834229337606,1506.8859183637614,1482.2888403342238,1480.750573458496,1453.444107111614,1484.7085573344016,1512.2300048742143,1501.9409533482967,1493.2838448180084,1453.444107111614,1502.7776443004382,1501.5418318533088,1462.6367273201508,1468.2173020989835,1453.444107111614,1464.7837448131381,"1"
11.0,"6-0-1-2-2",1445.77970697302,1506.5657818615387,1393.7116666209088,1430.4622334716257,1450.1387242412238,2937.7942649521,3010.9183806060323,1402.8170557672368,0.0,8552.047075377852,1468.89713247605,1505.4591903030162,1402.8170557672368,NaN,1425.341179229642,1483.5459383871223,1506.5657818615387,1402.8170557672368,-1.0,1465.0738948215799,1454.248326564978,1504.3525987444937,1402.8170557672368,2.147483647E9,1393.7116666209088,11.0,"4-2-2-2-1",1430.4629022453128,1474.4893525633652,1404.2919287564614,1426.6619540429597,1439.3906406599133,1404.2919287564614,2864.6817220202643,2906.4018234232753,2831.550186683904,5728.166263814535,1404.2919287564614,1432.3408610101321,1453.2009117116377,1415.775093341952,1432.0415659536338,1404.2919287564614,1452.1579439472125,1474.4893525633652,1426.6619540429597,1458.4115214984754,1404.2919287564614,1412.5237780730517,1431.9124708599102,1404.8882326409444,1413.8219682802633,"x"
11.0,"6-1-1-2-1",1455.2875865157116,1533.8148260877508,1408.8080092768812,1454.6219157957269,1471.311417682316,1440.5588774260157,2975.472084744947,1454.6219157957269,1489.241573073469,8648.269000632668,1440.5588774260157,1487.7360423724735,1454.6219157957269,1489.241573073469,1441.3781667721114,1440.5588774260157,1533.8148260877508,1454.6219157957269,1489.241573073469,1475.4245410744663,1440.5588774260157,1441.6572586571963,1454.6219157957269,1489.241573073469,1408.8080092768812,11.0,"7-1-1-1-1",1478.6812699237746,1573.5345947486803,1376.2807543215677,1487.4841795952277,1474.907674535124,1573.5345947486803,1438.3659332206364,1510.946520366525,1376.2807543215677,10366.36616650411,1573.5345947486803,1438.3659332206364,1510.946520366525,1376.2807543215677,1480.90945235773,1573.5345947486803,1438.3659332206364,1510.946520366525,1376.2807543215677,1501.6224047599273,1573.5345947486803,1438.3659332206364,1510.946520366525,1376.2807543215677,1421.1718685458247,"0"
...
I hope that someone with experience can give me some advice.
Thus:
A good way to deal with numeric data
A good way to deal with lots of attributes
[I know that there isn't such a thing as the best way but I'm already happy with a good way :)
Kind regards
Trial and Error may be the best way to determine the classifier that is 'best'. It really comes down to a number of factors, such as the layout and preprocessing of the data, the amount of data, and the fit of the problem to the classifier.
At a quick glance, you might be able to try J48, Neural Networks or SVMs. The only part that might need changing is the Formation attributes (split them into 5 attributes perhaps?). Besides this, a lot of classifiers would then be able to predict the nominal output based on the numeric information supplied.
As for the home vs away part, it looks good as it stands, and would likely be better to leave out the extra attribute. These types of problems typically favour the home team, but you appear to already know who is home and who is away, so it shouldn't really add much to the model.
Have a play with what's available and see how you go. The results may surprise you!

R: How to extract some large numbers, but not others, from data frame

I have tried using gsub to solve this, but this is way too difficult. I do not know how to tell the function to return only certain numbers, but not others.
My problem:
I have a large data frame which has one column of test.comments for every performed test. It is a large chunk of text, out of which only certain numbers are of interest to me.
Example:
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions
What I would like to do is add the value 18,900,000,000 (but not the phone number and other random numbers) in a separate column.
Sometimes, the number is surrounded by _______:
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED
In some cases, the number is also small:
A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen
or
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen.
What I am hoping to have is a robust command that would return
18,900,000,000
33,400,000
900
<250
It would also help me to have a command that just returns numbers > 1,000 and I could manually edit the other cases.
But there must be a more elegant solution ?!?
edit:
Thank you for your help everyone, Sven's solution worked best for me!
Here's a possible solution with sub:
sub(".*?([<>]?[,0-9]+)[ _]+BK.*", "\\1", vec)
# [1] "18,900,000,000" "33,400,000" "900" "<250"
where vec is a vector containing the 4 examples.
Two approaches so far, neither are completely robust, and I'm not sure how to fix them since I'm not a good regexxxer
p1 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions"
p2 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED"
p3 <- "A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."
This first one doesn't grab the 900 in the third example string
pattern <- '(?:\\s+)*[\\d<>]((?:[\\d,])*(?![\\s-\\d]))'
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] " 18,900,000"
regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"
regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "<250"
This second one grabs extra number strings in the first example but does grab the 900 in the third example
pattern <- "[\\d<>]((?:[\\d,])*)"
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] "18,900,000,000" "1" "10" "555"
# [5] "122" "634"
regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"
regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "900" "<250"
This will pull out the targets in those examples (added fourth case):
dput(test)
c("** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions",
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED",
"A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen",
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."
)
Need better example if this is not working well:
> gsub("(^[^>_0-9]+)([0-9,]{14}|[_]+[<0-9,]+[_]+|[,0-9]+ BK)(.+$)",
"\\2", test)
[1] "18,900,000,000 BK" "__33,400,000____" "900 BK"
[4] "__<250__________"
Then you can just remove the underscores and commas. The logic is that the reports seem to have a preset number of spaces for the data (which is all digits and commas if 14 characters or if not all digits are padded on either side with underscores.

Handling % sign in the string attribute in weka

I have a arrf file which is like this
#relation training_set
#attribute URL string
#attribute DOI numeric
#attribute ISBN numeric
#attribute Conclusions numeric
#attribute Source_Type {Scientific, Non_Scientific}
#data
http://www.nejm.org/doi/full/10.1056/nejmra1002842 , 0 , 0 , 1 , 0 , Scientific
http://www.plosone.org/article/info%3adoi%2f10.1371%2fjournal.pone.0014270#pone-0014270-t003 , 1 , 0 , 1 , 0 , Scientific
I have a problem in loading this file to weka because I have a "%" sign in the url data. I know that % are treated as comments in weka but is there a way to fetch this kind of string ? I am not making URL as a nominal data because it is an identifier in the training set ?
By wrapping the string in single-quotes, I was able to load your file successfully into Weka (I also added another attribute to match the structure of your data):
#relation training_set
#attribute URL string
#attribute DOI numeric
#attribute ISBN numeric
#attribute Conclusions numeric
#attribute Binary numeric
#attribute Source_Type {Scientific, Non_Scientific}
#data
'http://www.nejm.org/doi/full/10.1056/nejmra1002842' , 0 , 0 , 1 , 0 , Scientific
'http://www.plosone.org/article/info%3adoi%2f10.1371%2fjournal.pone.0014270#pone-0014270-t003' , 1 , 0 , 1 , 0 , Scientific
Hope this Helps!

Unable to determine as arff (Reason: java.io.IOException: premature end of line, read Token [EOL], line 1182

I have some data and I am processing it and converting it in a way that it generates a .arff file as follows :
............
#attribute murdered_to numeric
#attribute envy.although_it numeric
#attribute vampire_that numeric
#attribute list_without numeric
#attribute award_at numeric
#attribute #% numeric
#attribute the_addict numeric
#attribute the_drag numeric
#attribute card_against numeric
#attribute communications_mainly numeric
#attribute clue_for numeric
#attribute justified.a numeric
#attribute superb_learning numeric
#attribute ford_escape numeric
#attribute a_life-changing numeric
.
.
.
This is just a part of the attribute list. I need to open the arff file in weka but it throws the error as mentioned in the subject. The error is pointing to the line :
#attribute the_addict numeric
I am not able to find the error in the file which is throwing the error.
I'm pretty sure the error is in the line before the one you quoted
#attribute #% numeric
The name of your attribute is invalid, it must start with an alphabetic character, as specified in the ARFF documentation pointed at by etov.
The format for the #attribute statement is:
#attribute <attribute-name> <datatype>
where the must start with an alphabetic character. If spaces are to be included in the name then the entire name must be quoted.