Stanford NLP how to preprocessing the text - replace

I have a sentence like this "The people working in #walman are not good"
I have a preprocessed text file which contains the mappings, similar to the following two lines:
#walman Walman
#text Test
For the above sentence I have to read through the text file and replace the word with any matching word found in the text file.
The above sentence will change to "The people working in Walman are not good"
I am looking for an API available in Standford NLP to read the input text file and replace the text.

The only NLP-related part here is tokenization. You should read your text file into the map (e.g. HashMap in case of Java), then for each new sentence, you should tokenize it (e.g. by Stanford tokenizer), and check for each token if it is presented in the map; if yes, just replace by the found value from the map, if no, do nothing for this token.
Sample code for tokenization (taken from the link above):
String arg = "The people working in #walman is not good";
PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<CoreLabel>(new StringReader(arg),
new CoreLabelTokenFactory(), "");
for (CoreLabel label; ptbt.hasNext(); ) {
label = ptbt.next();
System.out.println(label);
}
}
So, label.toString() gives you the token without any suffixes.

Related

Making a text file which will contain my list items and applying regular expression to it

I am supposed to make a code which will read a text file containing some words with some common linguistic features. Apply some regular expression to all of the words and write one file which will have the changed words.
For now let's say my text file named abcd.txt has these words
king
sing
ping
cling
booked
looked
cooked
packed
My first question starts from here. In my simple text file how to write these words to get the above mentioned results. Shall I write them line-separated or comma separated?
This is the code provided by user palvarez.
import re
with open("new_abcd", "w+") as new, open("abcd") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new.write(new_word)
Can I add something like -
with open("new_abcd", "w+") as file, open("abcd") as original:
for word in original:
new_aword = re.sub("ed$", "abcd", word)
new.write(new_aword)
in the same code file? I want something like -
kabc
sabc
pabc
clabc
bookxyz
lookxyz
cookxyz
packxyz
PS - I don't know whether mentioning this is necessary or not, but I am supposed to do this for a Unicode supported script Devanagari. I didn't use it here in my examples because many of us here can't read the script. Additionally that script uses some diacritics. eg. 'का' has one consonant character 'क' and one vowel symbol 'ा' which together make 'का'. In my regular expression I need to condition the diacritics.
I think the approach you have with one word by line is better since you don't have to trouble yourself with delimiters and striping.
With a file like this:
king
sing
ping
cling
booked
looked
cooked
packed
And a code like this, using re.sub to replace a pattern:
import re
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = re.sub("ing$", "xyz", word)
new_word = re.sub("ed$", "abcd", new_word)
new.write(new_word)
It creates a resulting file:
kxyz
sxyz
pxyz
clxyz
bookabcd
lookabcd
cookabcd
packabcd
I tried out with the diacritic you gave us and it seems to work fine:
print(re.sub("ा$", "ing", "का"))
>>> कing
EDIT: added multiple replacement. You can have your replacements into a list and iterate over it to do re.sub as follows.
import re
# List where first is pattern and second is replacement string
replacements = [("ing$", "xyz"), ("ed$", "abcd")]
with open("new_abcd.txt", "w") as new, open("abcd.txt") as original:
for word in original:
new_word = word
for pattern, replacement in replacements:
new_word = re.sub(pattern, replacement, word)
if new_word != word:
break
new.write(new_word)
This limits one modification per word, only the first that modifies the word is taken.
It is recommended that for starters, utilize the with context manager to open your file, this way you do not need to explicitly close the file once you are done with it.
Another added advantage is then you are able to process the file line by line, this will be very useful if you are working with larger sets of data. Writing them in a single line or csv format will then all depend on the requirement of your output and how you would want to further process them.
As an example, to read from a file and say substitute a substring, you can use re.sub.
import re
with open('abcd.txt', 'r') as f:
for line in f:
#do something here
print(re.sub("ing$",'ring',line.strip()))
>>
kring
sring
pring
clring
Another nifty trick is to manage both the input and output utilizing the same context manager like:
import re
with open('abcd.txt', 'r') as f, open('out_abcd.txt', 'w') as o:
for line in f:
#notice that we add '\n' to write each output to a newline
o.write(re.sub("ing$",'ring',line.strip())+'\n')
This create an output file with your new contents in a very memory efficient way.
If you'd like to write to a csv file or any other specific formats, I highly suggest you spend sometime to understand Python's input and output functions here. If linguistics in text is what you are going for that understand encoding of different languages and further study Python's regex operations.

Easiest way to cross-reference a CSV file with a text file for common strings

I have a list of strings in a CSV file, and another text file that I would like to search for these strings. The CSV file has just the strings that I am interested in, but the text file has a bunch of other text interspersed among the strings of interest (the strings I am interested in are ID numbers for a database of proteins). What would the easiest way of going about this be? I want to check the text file for the presence of every string in the CSV file. I am working in a research lab at a top university, so you would be aiding cutting-edge research!
Thanks :)
I would use Python for this. To print the matching lines, you could do this:
import csv
with open("strings.csv") as csvfile:
reader = csv.reader(csvfile)
searchstrings = {row[0] for row in reader} # Construct a set of keywords
with open("text.txt") as txtfile:
for number, line in enumerate(txtfile):
for needle in searchstrings:
if needle in line:
print("Line {0}: {1}".format(number, line.strip()))
break # only necessary if there are several matches per line

How do I parse a txt file which erases all the lines except what I need?

I want to parse a text file to get what I want and create another txt file in c++.
I have a text file which looks something like this.
User :
Group :
Comment1 :
Comment2 :
*** Label ***
ID : Nick
PASS : sky123
Number ID : 9402
*** End of Label ***
######################################
And goes on.
I basically want to create a new txt file which leaves all lines which contains colon(:) and erase the rest such as "* Label *", and save the result in a new txt file.
The answer to that txt file would be
User :
Group :
Comment1 :
Comment2 :
ID : Nick
PASS : sky123
Number ID : 9402
How do I do this in a simple way?
Thank you very much.
In C++ with fstreams:
ifstream input("input.txt");
ofstream output("output.txt");
string line;
while (getline(input, line)) {
if (line.find(":") != string::npos) {
output << line << "\n";
}
}
I have not tested this, but you get the idea.
void foo(ifstream& ifs, ofstream& ofs)
{
while ( !ifs.eof() )
{
string line;
getline(ifs, line);
if (line.find(":") != string::npos)
ofs << line;
}
}
The simplist approach I can think of would be to read in the text file line by line storing only the lines you want. Once the read is complete, open a separate file for writing and write the stored lines. It's not C++ but I've written out some psuedo code to illustrate.
while(line in source file)
{
if(wantline)
store the line
}
for(stored lines)
write line to destination file
In a big way, all you have to do is in a loop, check for each line if it have the char (:), if it does have it, add it to a string. In the end just save that string into a new text file.
Here's a tutorial of how to manage files
Well I don't write C++ and I'm not sure if you're looking for help with the language, but here's a simple approach in the abstract:
Load the text file into a string and then split the string into an array on the newline character so that you have 1 line of the text file per array value. Then, iterate over each element of the array. Using this method you can examine the contents of an individual line and use string comparison, regex matching, etc. to determine whether you want to keep/modify that line. Once you're done you can simply save the new string to a text file.
Alternately, you can use global find/replace with string comparison, regex etc. to apply changes to the entire document at once, but that requires more advanced knowledge and application of regex, and may not be practical given the size of the document.

looking for a regular expression to extract all text outputs to user from js file

i have some huge js files and there are some texts/messages/... which are output for a human beeing. the problem is they don't run over the same method.
but i want to find them all to refactor the code.
now i am searching for a regular expression to find those messages.
...his.submit_register = function(){
if(!this.agb_accept.checked) {
out_message("This is a Messge tot the User in English." , "And the Title of the Box. In English as well");
return fals;
}
this.valida...
what i want to find is all the strings which are not source code.
in this case i want as return:
This is a Messge tot the User in
English. And the Title of the Box. In
English as well
i tried something like: /\"(\S+\s{1})+\S\"/, but this wont work ...
thanks for help
It's not possible to parse Javascript source code using regular expressions because Javascript is not a regular language. You can write a regular expression that works most of the time:
/"(.*?)"/
The ? means that the match is not greedy.
Note: this will not correctly handle strings that contain ecaped quotes.
A simple java regex solving your problem (assuming that the message doesn't contain a " character):
Pattern p = Pattern.compile("\"(.+?)\"");
The extraction code :
Matcher m;
for(String line : lines) {
m = p.matcher(line);
while(m.find()) {
System.out.println(m.group(1));
}
}

c++ search text n boolean mode

basically have two questions.
1. Is there a c++ library that would do full text boolean search just like in mysql. E.g.,
Let's say I have:
string text = "this is my phrase keywords test with boolean query.";
string booleanQuery = "\"my phrase\" boolean -test -\"keywords test\" OR ";
booleanQuery += "\"boolean search\" -mysql -sql -java -php"b
//where quotes ("") contain phrases, (-) is NOT keyword and OR is logical OR.
If answer to first is no, then;
2. Is it possible to search a phrase in text. e.g.,
string text =//same as previous
string keyword = "\"my phrase\"";
//here what's the best way to search for my phrase in the text?
TR1 has a regex class (derived from Boost::regex). It's not quite like you've used above, but reasonably close. Boost::phoenix and Boost::Spirit also provide similar capabilities, but for a first attempt the Boost/TR1 regex class is probably a better choice.
As to the 2nd point: string class does have a method find, see http://www.cppreference.com/wiki/string/find
Sure there is, try Spirit:
http://boost-spirit.com/home/