How to convert a text file into ARFF format? - weka

I'm using WEKA tool for text classification, and I have to convert plain text files into ARFF format. However, I don't know how to do that. Can anyone please help me to convert a text file into ARFF format?
Thank you Renklauf for ur response,
I didn't understood these points "Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line." .. can u plz explain in brief ..
Suppose the text data is like a simple sport article like
" Basketball is a team sport, the objective being to shoot a ball through a basket horizontally positioned to score points while following a set of rules. Usually, two teams of five players play on a marked rectangular court with a basket at each width end. Basketball is one of the world's most popular and widely viewed sports" ...
This is my text document and I want to convert this to arff format .. and after that I need to use that arff format file for SVM text classification ..

For a document classification task, each document is considered an attribute and must be enclosed in quotes. Suppose you have a corpus of 10 sports articles tagged as either pro-Yankees or pro-Red Sox for a classifier that automatically classifies sports articles as either pro-Yankees or pro-Red Sox. You need to take each document, enclose it in quotes,place it on a single line, and then place your {yankees, red_sox} attribute value after the quotes-enclosed string.
#relation yankeesOrRedSox
#attribute article string
#attribute yankeesOrSox { yankees, red_sox }
#data
"text of article 1 here", yankees
.
.
.
"text of article 10 here", red_sox
It's key that the article is placed on a single line. When I began using Weka for text classification, this is a point that caused me a lot of frustration at first. Since text editors like Notepad only allow a limited number of columns, you'll need to get something like Notepad++ to fit everything on one line. Notepad++ has a Join Lines function that allows you to place a lot of text on a single line.
Hope this helps.

Related

Where can I find pre trained word embeddings (English) in word2vec format of 50 dimensions?

Preferably it should be a txt file than a binary file. All of the pre-trained word embeddings I found where of 300+ dimensions.
Thank you
http://nlp.stanford.edu/data/glove.6B.zip
Download this file which in GloVe format and convert it to word2vec format using this script: https://github.com/jroakes/glove-to-word2vec
It is plausibly the case that any 50 dimensions of a 300-dimensional model are still useful. So you could conceivably take a 300-dimensional set, in text, and patch the file to specify 50-dimensions and discard the last 250 dimensions of each line.

Regex expression for searching spaced/broken words in OCR PDFs (goo d ni g ht)

I need searching lots of OCR PDFs. I realized the words and sentences are perfect visually, but if I copy an paste the content, there are spaces which shouldn't be there!
I can see in the text: good night
If I copy and paste somewhere: goo d ni g ht
I would appreciate advices to handle this situation through a Regex expression considering:
a) The simple example for short words as \bgood night\b for goo d ni g ht
b) When there is line break in the sentence. I mean, the Regex expression isn't able to search from one line to another in the PDF even the paragraph is the same. In looking for
\bthe sun set and the night comes\b , but the PDF content is like that when pasted:
line 1: t he sun set an d th e
line 2: nig ht co m es
Many thanks,
Cadu
This random occurence of spaces in the middle of words can happen in PDF.
The reason behind it is the complex format that PDF actually is.
You see, a PDF document is actually a container of instructions for rendering the text in a viewer.
Imagine instructions like:
go to position 50, 50.
draw the character 'G'
go to position 56, 50.
draw the character 'O'
etc
Whenever you select something in a viewer (for instance Adobe), the program has to figure out what content overlaps with your selection (already this is not an easy problem). If it's text, it then needs to decide where to add spaces and line-breaks. Different viewers (or software) might use different metrics for this. A typical one for instance is "insert a space if two characters are further apart than the width of the space character in the same font"
The point is, getting text out of a PDF document is always kind of guesswork. And if you add the fact that it's an OCR PDF, you are adding a further layer of difficulties.

how to use weka in keyphrase extraction from text arguments

I am working on a project "key phrase extraction from text arguments" . For this I first did input cleaning and then detemined list of candidate phrases( in total around 300) using stanford parser(POS tagging). Then I computed feature value of each and every phrase. I followed these steps on each and every document in my dataset. Now how should I proceed i.e.., how to use WEKA to find keyphrases. How should I store phrases and feature values(TFXIDF) in weka . How to find efficiency of the final project??
WEKA does an excellent and simple work with Text Classification tasks (like Text Categorization and Clustering), in which the instances are relatively long pieces of text (e.g. from tweets to documents), and classes (when available) are non-overlapping tags (e.g. thematic classes like economy/sports/..., spam/legitimate email, positive/negative in sentiment analysis, etc.).
However WEKA does not fit directly term classification tasks like Part Of Specch Tagging, Word Sense Disambiguation, Named Entity Recognition, or in your case, keyphrase extraction. For applying WEKA, yo do not only need your original texts and the manually extracted keyphrases, but to decide the atributes that make those pieces of text actual keyphrases. You have to inspect your examples, and decide, for instance, that the part of speech of the words in a keyphase and the surrounding words are actually important in order to guess that a piece of text is a keyphrase.
I strongly recommend you take a look at the representation used in the datasets used in the CONLL NER shared tasks (CONLL 2002 and 2003). Each word in named entity is independent and marked as starting, in the middle or at the end of the named entity. Additionally, the features you can use are the actual words, the surrounding words, and their parts of speech.
For instance, in the example of the NER 2003 dataset:
U.N. NNP I-NP I-ORG
official NN I-NP O
Ekeus NNP I-NP I-PER
heads VBZ I-VP O
for IN I-PP O
You have that the word "Ekeus" is an NNP, it is inside a Noun Phrase (I-NP), and it is a named entity of type "person" (I-PER). You can process this format to get an instance file in which you use the POS tag and the actual words in a two-word window:
#attribute word-2 string
#attribute word-1 string
#attribute word string
#attribute word+1 string
#attribute word+2 string
#attribute postag-2 {NNP, NN, ....} // The full list of available POS tags
#attribute postag-1 {NNP, NN, ....}
// ../..
#attribute named-entity-class {O, I-PER, I-ORG, ...} // The full list of possible NE tags
#data
"U.N.","official","Ekeus","heads","for",NNP,NN,NNP,VBZ,IN,I-PER
../..
As you can see, you have to decide the attributes you need for each word and to build windows with the attributes.

How to prevent OpenOffice/LibreOffice Calc from changing what you input (data, numbers,...)

Basically, I want LibreOffice Calc to do what I tell it, not what it wants.
For example:
when I input 1.1.12, I want to have 1.1.12 in that cell, not 01.01.2012 or whatever.
when I input 001, I want to have 001 in that cell, not 1
and so on and so forth
I want it to never ever touch my data until I explicitly tell it to. Is that possible at all?
I know I can set format of a cell to text. It doesn't help at all. Example:
Input 1.1.12, it gets displayed as 01.01.12, format as text, it becomes "40909", original input is lost
Format empty cells as text. Paste "000 001 002 ..." separated by line breaks. Displays "0 1 2 ..."
I know I can write ' in front of anything for it to be forced text. Again it doesn't help, because when I paste in text, I cannot have ' auto-appended to it.
I hope this is possible. I tried googling for different problems and never found a good answer.
If you want your input to be interpreted as text and preventing Calc to do fancy (and annoying) things with your input, you have to change the format before entering any value.
Select the cells/columns/rows.
Right-click 'Format Cells...'
Select the tab 'Numbers'
In the list 'Category', select 'Text' (the last option)
Select the format '#' (it is the only one in this category)
Click on 'Ok'
You may need to tweak the 'autocorrect' options as well. Go to 'Tools > Auotcorrect Options...'. Here is a link that may help: https://help.libreoffice.org/Calc/Deactivating_Automatic_Changes
I understand your problem with pasting pure unformatted text. This may be more work than you like (we can try to automate that later) but when I paste data from Notepad, I am prompted with an import screen as you can see below. Select the column header(s) and then select Column type: Text. This should solve your paste/import problem. An alternative is to handle this with an AutoHotKey script.
Oh b.t.w. the # is the format type for text, just like you have HH for 24 hour or ddd for weekdays...
When you are importing, you're given a bunch of options. Select "Quoted field as text" so any text inside quotes is treated as text which is interpreted by LibreOffice as sacred and they do not modify it in the way they they modify something that they identify as number
When you have your data in the clipboard click Edit -> Paste as... in main menu. In next window choose "Paste as text". All your data will be pasted as is.
I initially arrived at this page with a very similar (but not identical) problem. I am posting the solution here for the benefit of those who might be visiting with the same issue.
Every time I would save, close, and then re-open my .XSLX spreadsheet in OpenOffice, it would delete the spaces I had entered in between text. For example:
"Did not attend" would become "Didnotattend".
"John DOE" would become "JohnDOE", etc.
Specifying "text" (#) as the format (as recommended above) did not help me, unfortunately.
What ultimately did solve it was saving it as an .ODS file instead of .XSLX .
just simply put character ' before the text, '0.1.16 and calc will interprate it as text data
My issue was currency, properly formatted would change to a much larger number if the numbers entered could represent a date; such as 4.22 becoming $42,482. I discovered that adding a trailing zero solved the problem.
I had pasted numbers from another site and it kept coming up with dates. I just messed around and hit the arrow that's on the paste board to give me the option of unformatted text or HTML format. I selected unformatted, a window opened to show me the text I wanted so I pressed o.k.

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"ΒΆ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.