I have a single .txt file including a lot of Arabic text, and I want to convert this file to an .arff file automatically, so I can use it in Weka to get rules from it.
As my Professor requested I need to have 30 attributes, and each attribute should have all words in the text file, and each line of data will include real sentences, but separated to words using , and if the sentence includes less than 30 words, the remaining part will be filled with ?.
The arff file should look like the following:
#relation RelName
#attribute 'x1'{*will include all words in the text file*}
#attribute 'x2'{*will include all words in the text file*}
.
.
.
#attribute 'x30'{*will include all words in the text file*}
#data
Wordx,Wordy,Wordz,Wordq,Wordw,?,?,?,?,?...................,? //till 30 word
.
.
.
.
and so on
So is there anyway to generate this format .arff file from a single .txt file automatically? Thank you for your help
You can use arff 0.9. Works well with python 2.x and 3.x.
EG:
import arff
data = [[1,2,3], [10, 20, 30]]
arff.dump('result.arff', data, relation="test", names=['one', 'two', 'three'])
The command is going to create a relation test with three attributes 'one', 'two' and 'three'. First column will contain 1,10. Second column contains 2,20. Third column contains 3,30.
Best
I've a lot of numeric values and at the end, I want to predict a result.
My result can have the nominal values of '0','1' or 'x'.
What I like to know is, how can I get the best results.
Can some classifiers handle numeric attributes better than another?
And sometimes it seems that a classifier has a focus on a less interesting attribute...
Also at the moment h. means home team and a. means away team. Would it be better if I split this and add an attribute, location #location {'h', 'a'} --> 0 will become 1 and vica versa
#relation estimation
#attribute h.teamSize numeric
#attribute h.lineUpTeamFormation {'5-2-0-3-1' ... '6-2-0-4-1'}
#attribute h.teamRatingAVG numeric
#attribute h.teamRatingHighest numeric
#attribute h.teamRatingLowest numeric
#attribute h.teamRatingMed numeric
#attribute h.teamRatingMedRating numeric
#attribute h.lineUpTeamRating.att numeric
#attribute h.lineUpTeamRating.attMid numeric
#attribute h.lineUpTeamRating.mid numeric
#attribute h.lineUpTeamRating.defMid numeric
#attribute h.lineUpTeamRating.def numeric
#attribute h.lineUpTeamRatingAVG.att numeric
#attribute h.lineUpTeamRatingAVG.attMid numeric
#attribute h.lineUpTeamRatingAVG.mid numeric
#attribute h.lineUpTeamRatingAVG.defMid numeric
#attribute h.lineUpTeamRatingAVG.def numeric
#attribute h.lineUpTeamRatingHighest.att numeric
#attribute h.lineUpTeamRatingHighest.attMid numeric
#attribute h.lineUpTeamRatingHighest.mid numeric
#attribute h.lineUpTeamRatingHighest.defMid numeric
#attribute h.lineUpTeamRatingHighest.def numeric
#attribute h.lineUpTeamRatingLowest.att numeric
#attribute h.lineUpTeamRatingLowest.attMid numeric
#attribute h.lineUpTeamRatingLowest.mid numeric
#attribute h.lineUpTeamRatingLowest.defMid numeric
#attribute h.lineUpTeamRatingLowest.def numeric
#attribute a.teamSize numeric
#attribute a.lineUpTeamFormation {'5-2-0-3-1' ... '6-2-0-4-1'}
#attribute a.teamRatingAVG numeric
#attribute a.teamRatingHighest numeric
#attribute a.teamRatingLowest numeric
#attribute a.teamRatingMed numeric
#attribute a.teamRatingMedRating numeric
#attribute a.lineUpTeamRating.att numeric
#attribute a.lineUpTeamRating.attMid numeric
#attribute a.lineUpTeamRating.mid numeric
#attribute a.lineUpTeamRating.defMid numeric
#attribute a.lineUpTeamRating.def numeric
#attribute a.lineUpTeamRatingAVG.att numeric
#attribute a.lineUpTeamRatingAVG.attMid numeric
#attribute a.lineUpTeamRatingAVG.mid numeric
#attribute a.lineUpTeamRatingAVG.defMid numeric
#attribute a.lineUpTeamRatingAVG.def numeric
#attribute a.lineUpTeamRatingHighest.att numeric
#attribute a.lineUpTeamRatingHighest.attMid numeric
#attribute a.lineUpTeamRatingHighest.mid numeric
#attribute a.lineUpTeamRatingHighest.defMid numeric
#attribute a.lineUpTeamRatingHighest.def numeric
#attribute a.lineUpTeamRatingLowest.att numeric
#attribute a.lineUpTeamRatingLowest.attMid numeric
#attribute a.lineUpTeamRatingLowest.mid numeric
#attribute a.lineUpTeamRatingLowest.defMid numeric
#attribute a.lineUpTeamRatingLowest.def numeric
#attribute result {'0','1','x'}
#data
11.0,"4-1-1-4-1",1563.0046902930617,1716.018383910481,1493.642106150469,1542.5395864396032,1604.830245030475,1594.8952627985404,6230.782838756112,1552.485746007047,1716.018383910481,6098.869361751494,1594.8952627985404,1557.695709689028,1552.485746007047,1716.018383910481,1524.7173404378734,1594.8952627985404,1617.8284702417561,1552.485746007047,1716.018383910481,1542.4611979096933,1594.8952627985404,1493.642106150469,1552.485746007047,1716.018383910481,1510.4250125761928,11.0,"5-1-1-2-2",1588.961662996073,1747.6289170494754,1508.4062919834894,1565.5233012334515,1628.0176045164824,3459.80148294728,3079.552081457912,1542.4682316024448,1576.1754548839763,7820.5810420651915,1729.90074147364,1539.776040728956,1542.4682316024448,1576.1754548839763,1564.1162084130383,1747.6289170494754,1549.4953619285486,1542.4682316024448,1576.1754548839763,1613.8600439857894,1712.1725658978046,1530.0567195293636,1542.4682316024448,1576.1754548839763,1508.4062919834894,"x"
11.0,"4-2-2-2-1",1475.8094913912312,1502.0682887709222,1444.990021885439,1483.7603435487183,1473.5291553281807,1490.639636207262,2978.5093856157946,2950.4346148352724,2892.2037554297044,5922.117013215507,1490.639636207262,1489.2546928078973,1475.2173074176362,1446.1018777148522,1480.5292533038767,1490.639636207262,1492.9037337533382,1502.0682887709222,1447.2137335442653,1496.2886114276891,1490.639636207262,1485.6056518624566,1448.3663260643502,1444.990021885439,1460.927921231502,11.0,"4-1-2-2-2",1484.7390000692892,1512.2300048742143,1453.444107111614,1486.4669707831615,1482.837055992914,3013.771836727523,2964.5776806684476,2961.501146916992,1453.444107111614,5938.834229337606,1506.8859183637614,1482.2888403342238,1480.750573458496,1453.444107111614,1484.7085573344016,1512.2300048742143,1501.9409533482967,1493.2838448180084,1453.444107111614,1502.7776443004382,1501.5418318533088,1462.6367273201508,1468.2173020989835,1453.444107111614,1464.7837448131381,"1"
11.0,"6-0-1-2-2",1445.77970697302,1506.5657818615387,1393.7116666209088,1430.4622334716257,1450.1387242412238,2937.7942649521,3010.9183806060323,1402.8170557672368,0.0,8552.047075377852,1468.89713247605,1505.4591903030162,1402.8170557672368,NaN,1425.341179229642,1483.5459383871223,1506.5657818615387,1402.8170557672368,-1.0,1465.0738948215799,1454.248326564978,1504.3525987444937,1402.8170557672368,2.147483647E9,1393.7116666209088,11.0,"4-2-2-2-1",1430.4629022453128,1474.4893525633652,1404.2919287564614,1426.6619540429597,1439.3906406599133,1404.2919287564614,2864.6817220202643,2906.4018234232753,2831.550186683904,5728.166263814535,1404.2919287564614,1432.3408610101321,1453.2009117116377,1415.775093341952,1432.0415659536338,1404.2919287564614,1452.1579439472125,1474.4893525633652,1426.6619540429597,1458.4115214984754,1404.2919287564614,1412.5237780730517,1431.9124708599102,1404.8882326409444,1413.8219682802633,"x"
11.0,"6-1-1-2-1",1455.2875865157116,1533.8148260877508,1408.8080092768812,1454.6219157957269,1471.311417682316,1440.5588774260157,2975.472084744947,1454.6219157957269,1489.241573073469,8648.269000632668,1440.5588774260157,1487.7360423724735,1454.6219157957269,1489.241573073469,1441.3781667721114,1440.5588774260157,1533.8148260877508,1454.6219157957269,1489.241573073469,1475.4245410744663,1440.5588774260157,1441.6572586571963,1454.6219157957269,1489.241573073469,1408.8080092768812,11.0,"7-1-1-1-1",1478.6812699237746,1573.5345947486803,1376.2807543215677,1487.4841795952277,1474.907674535124,1573.5345947486803,1438.3659332206364,1510.946520366525,1376.2807543215677,10366.36616650411,1573.5345947486803,1438.3659332206364,1510.946520366525,1376.2807543215677,1480.90945235773,1573.5345947486803,1438.3659332206364,1510.946520366525,1376.2807543215677,1501.6224047599273,1573.5345947486803,1438.3659332206364,1510.946520366525,1376.2807543215677,1421.1718685458247,"0"
...
I hope that someone with experience can give me some advice.
Thus:
A good way to deal with numeric data
A good way to deal with lots of attributes
[I know that there isn't such a thing as the best way but I'm already happy with a good way :)
Kind regards
Trial and Error may be the best way to determine the classifier that is 'best'. It really comes down to a number of factors, such as the layout and preprocessing of the data, the amount of data, and the fit of the problem to the classifier.
At a quick glance, you might be able to try J48, Neural Networks or SVMs. The only part that might need changing is the Formation attributes (split them into 5 attributes perhaps?). Besides this, a lot of classifiers would then be able to predict the nominal output based on the numeric information supplied.
As for the home vs away part, it looks good as it stands, and would likely be better to leave out the extra attribute. These types of problems typically favour the home team, but you appear to already know who is home and who is away, so it shouldn't really add much to the model.
Have a play with what's available and see how you go. The results may surprise you!
I have tried using gsub to solve this, but this is way too difficult. I do not know how to tell the function to return only certain numbers, but not others.
My problem:
I have a large data frame which has one column of test.comments for every performed test. It is a large chunk of text, out of which only certain numbers are of interest to me.
Example:
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions
What I would like to do is add the value 18,900,000,000 (but not the phone number and other random numbers) in a separate column.
Sometimes, the number is surrounded by _______:
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED
In some cases, the number is also small:
A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen
or
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen.
What I am hoping to have is a robust command that would return
18,900,000,000
33,400,000
900
<250
It would also help me to have a command that just returns numbers > 1,000 and I could manually edit the other cases.
But there must be a more elegant solution ?!?
edit:
Thank you for your help everyone, Sven's solution worked best for me!
Here's a possible solution with sub:
sub(".*?([<>]?[,0-9]+)[ _]+BK.*", "\\1", vec)
# [1] "18,900,000,000" "33,400,000" "900" "<250"
where vec is a vector containing the 4 examples.
Two approaches so far, neither are completely robust, and I'm not sure how to fix them since I'm not a good regexxxer
p1 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions"
p2 <- "** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED"
p3 <- "A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen
** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."
This first one doesn't grab the 900 in the third example string
pattern <- '(?:\\s+)*[\\d<>]((?:[\\d,])*(?![\\s-\\d]))'
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] " 18,900,000"
regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"
regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "<250"
This second one grabs extra number strings in the first example but does grab the 900 in the third example
pattern <- "[\\d<>]((?:[\\d,])*)"
regmatches(p1, gregexpr(pattern, p1, perl = TRUE))
# [[1]]
# [1] "18,900,000,000" "1" "10" "555"
# [5] "122" "634"
regmatches(p2, gregexpr(pattern, p2, perl = TRUE))
# [[1]]
# [1] "33,400,000"
regmatches(p3, gregexpr(pattern, p3, perl = TRUE))
# [[1]]
# [1] "900" "<250"
This will pull out the targets in those examples (added fourth case):
dput(test)
c("** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated 18,900,000,000 BKV genome equivalents per ml urine were detected in this patient specimen....... ............................................................................................................................................ 1 out of 10 test samples... Call 555-122-634 with Questions",
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A CALCULATED__33,400,000____BK VIRUS (BKV) GENOME EQUIVALENTS PER ML WERE DETECTED",
"A calculated 900 BK virus (BKV) genome equivalents per ml were detected in this patient specimen",
"** POSITIVE FOR BK VIRUS ** INTERPRETATION: A calculated__<250__________BK virus (BKV) genome equivalents per ml were detected in this patient specimen."
)
Need better example if this is not working well:
> gsub("(^[^>_0-9]+)([0-9,]{14}|[_]+[<0-9,]+[_]+|[,0-9]+ BK)(.+$)",
"\\2", test)
[1] "18,900,000,000 BK" "__33,400,000____" "900 BK"
[4] "__<250__________"
Then you can just remove the underscores and commas. The logic is that the reports seem to have a preset number of spaces for the data (which is all digits and commas if 14 characters or if not all digits are padded on either side with underscores.
I have a arrf file which is like this
#relation training_set
#attribute URL string
#attribute DOI numeric
#attribute ISBN numeric
#attribute Conclusions numeric
#attribute Source_Type {Scientific, Non_Scientific}
#data
http://www.nejm.org/doi/full/10.1056/nejmra1002842 , 0 , 0 , 1 , 0 , Scientific
http://www.plosone.org/article/info%3adoi%2f10.1371%2fjournal.pone.0014270#pone-0014270-t003 , 1 , 0 , 1 , 0 , Scientific
I have a problem in loading this file to weka because I have a "%" sign in the url data. I know that % are treated as comments in weka but is there a way to fetch this kind of string ? I am not making URL as a nominal data because it is an identifier in the training set ?
By wrapping the string in single-quotes, I was able to load your file successfully into Weka (I also added another attribute to match the structure of your data):
#relation training_set
#attribute URL string
#attribute DOI numeric
#attribute ISBN numeric
#attribute Conclusions numeric
#attribute Binary numeric
#attribute Source_Type {Scientific, Non_Scientific}
#data
'http://www.nejm.org/doi/full/10.1056/nejmra1002842' , 0 , 0 , 1 , 0 , Scientific
'http://www.plosone.org/article/info%3adoi%2f10.1371%2fjournal.pone.0014270#pone-0014270-t003' , 1 , 0 , 1 , 0 , Scientific
Hope this Helps!