Regex with Replace String Python - regex

I have this situation, I have a sentence with wrong dot (.) to process, the sentence:
sentence = 'Hi. Long time no see .how are you ?can you follow .#abcde?'
I am trying to normalize this sentence, if you see it, there is some wrong format sentence (.how, ?can, and .#abcde). I am thinking of using regex to handle this because the sentence keep changing. This is my code so far:
import re
character = ['.','?','#']
sentence = 'Hi. Long time no see .how are you ?can you follow .#abcde?'
sentence = str(sentence)
for i in character:
charac = str(i)
charac_after = re.findall(r'\\'+charac+r'\S*', sentence)
if charac_after:
print("Exist")
sentence = sentence.replace(charac, charac+' ')
print(sentence)
The result some how skip the dot (.) and at (#) it just process the question mark (?). This is the result:
Exist
Hi. Long time no see .how are you ? can you follow .#abcde?
its supposed to be "Hi. Long time no see . how are you ? can you follow . # abcde?". I don't know if my double backslash in "r'\'+charac+r'\S*'" are wrong or something, did I miss something?
How can I process all the character? please help.

Without any knowlegde of python i think you need to do it like this:
(as per suggestion from #Sebastian Proske)
character = ['.','?','#']
sentence = str('Hi. Long time no see .how are you ?can you follow .#abcde?')
sentence = re.sub(r'([' + ''.join(map(re.escape, character)) + r'])(?=\S)', r'\1 ', sentence)
print(sentence)
The code i am not sure about, but the regex. see here:
https://regex101.com/r/HXdeuK/2
see demo here https://repl.it/Fw5b/3

Related

Replacing a word with another in a string if a condition is met

I am trying to get some help with a function on replacing two words in a string with another word if a condition is true.
The condition is: if the word 'poor' follows 'not', then replace the whole string 'not ... poor' with 'rich'. The problem is that I don't know how to make the function - more specific how to make a function that seeks for if the word poor follows not and then what I have to write to make the replacement. I am pretty new to python, so maybe it is a stupid questions but i hope someone will help me.
I want the function to do something like this:
string = 'I am not that poor'
new_string = 'I am rich'
Doubtless the regular expression pattern could be improved, but a quick and dirty way to do this is with Python's re module:
import re
patt = 'not\s+(.+\s)?poor'
s = 'I am not that poor'
sub_s = re.sub(patt, 'rich', s)
print s, '->', sub_s
s2 = 'I am not poor'
sub_s2 = re.sub(patt, 'rich', s2)
print s2, '->', sub_s2
s3 = 'I am poor not'
sub_s3 = re.sub(patt, 'rich', s3)
print s3, '->', sub_s3
Output:
I am not that poor -> I am rich
I am not poor -> I am rich
I am poor not -> I am poor not
The regular expression pattern patt matches the text not followed by a space and (optionally) other characters followed by a space and then the word poor.
Step One: Determine where the 'not' and 'poor' are inside your string (check out https://docs.python.org/2.7/library/stdtypes.html#string-methods)
Step Two: Compare the locations of 'not' and 'poor' that you just found. Does 'poor' come after 'not'? How could you tell? Are there any extra edge cases you should account for?
Step Three: If your conditions are not met, do nothing. If they are, everything between and including 'not' and 'poor' must be replaced by 'rich'. I'll leave you to decide how to do that, given the above documentation link.
Good luck, and happy coding!
This is something I came up with. Works for your example, but will need tweaks (what if there is more than 1 word between not and poor).
my_string = 'I am not that poor'
print my_string
my_list = my_string.split(' ')
poor_pos = my_list.index('poor')
if my_list[poor_pos - 1] or my_list[poor_pos - 2] == 'not':
not_pos = my_list.index('not')
del my_list[not_pos:poor_pos+1]
my_list.append('rich')
print " ".join(word for word in my_list)
Output:
I am not that poor
I am rich

findall function grabbing the wrong info

I am trying to writing a piece of python to read my files. The code is below:
import re, os
captureLevel = [] # capture read scale.
captureQID = [] # capture questionID.
captureDesc = [] # capture description.
file=open(r'E:\Grad\LIS\LIS590 Text mining\Final_Project\finalproject_data.csv','rt')
newfile=open('finalwordlist.csv','w')
mytext=file.read()
for row in mytext.split('\n'):
grabLevel=re.findall(r'(\d{1})+\n',row)
captureLevel.append(grabLevel)
grabQID=re.findall(r'(\w{1}\d{5})',row)
captureQID.append(grabQID) #ERROR LINE.
grabDesc=re.findall(r'\,+\s+(\w.+)',row)
captureDesc.append(grabDesc)
lineCount = 0
wordCount = 0
lines = ''.join(grabDesc).split('.')
for line in lines:
lineCount +=1
for word in line.split(' '):
wordCount +=1
newfile.write(''.join(grabLevel) + '|' + ''.join(grabQID) + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')
newfile.close()
Here are three lines of my data:
a00004," another oakstr eetrequest, helped student request item",2
a00005, asked retiree if he used journal on circ list,2
a00006, asked scientist about owner of some archival notes,2
Here is the result:
22|a00002|1|1|a00002,
22|a00002|1|2|
22|a00002|1|3|scientist
22|a00002|1|4|looking
22|a00002|1|5|for
The first column of the result should be just one number, but why is it printing out a two digit number?
Any idea what is the problem here? Thanks.
It is the tab and space difference again. Need to be careful especially for Python. Spaces are not treated as equivalent to tab. Here is a helpful link talking about the difference: http://legacy.python.org/dev/peps/pep-0008/. To be brief, space is recommended for indentation in the post. However, I find Tab works fine for indentation too. It is important to keep indentation consistent. So if you use tab, make sure you use it all the way.

Python regex to get string in front of hyphen and plus signs

I have a string as shown in the code. I want to get the final result as: ['AA', 'BB','CC'].
But what I have got here is ['AA', 'BB']. Could you please give me some suggestion? Thank you.
s = "AA-ZZ, BB+ZZ, CC"
a = re.findall(r'(\w+)[-|\\+\\]\w',s)
Use lookahead to see whether the string is in front of +, - or at the end of string.
a = re.findall(r'(\w+)(?=[-+]|$)',s)

Regex to find segment of string searching from end

I'm in Java and have a string that will always be in this format:
;<b>gerg(1314)</b><br> (KC)<br>
This number 461610734 will change and may be any length.. I'd like to pick that number out and use it. As you can see the number is next to a ' (the first one working backwards) and a hash # (again, the first one working backwards).
I can find the numbers after the hash by using ([^\#]+$) and I can find up to the last ' by using ([^\']+$) (but this would be on the wrong side of the '...)
I'm lost... Anyone know how to join these two together and nudge the ' along one to the left to just get the numbers?
Actually, I believe that you could simply extract "the digits that immediately follow a #".
You could then use the following regex: (?<=#)\d+.
On the other hand, if you really want to specify that your digits are following a # and followed by a ', you could (should?) make use of the look-arounds.
The following regex should be what you're looking for:
(?<=#)\d+(?=')
You can see it live by clicking this link.
Try this:
String str = ";<b>gerg(1314)</b><br> (KC)<br>";
Pattern pattern = Pattern.compile("onClick=\"return CCL\\(this,'#([0-9]+)'");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println(matcher.group(1)); // Prints 461610734
}

Regular expression any character with dynamic size

I want to use a regular expression that would do the following thing ( i extracted the part where i'm in trouble in order to simplify ):
any character for 1 to 5 first characters, then an "underscore", then some digits, then an "underscore", then some digits or dot.
With a restriction on "underscore" it should give something like that:
^([^_]{1,5})_([\\d]{2,3})_([\\d\\.]*)$
But i want to allow the "_" in the 1-5 first characters in case it still match the end of the regular expression, for example if i had somethink like:
to_to_123_12.56
I think this is linked to an eager problem in the regex engine, nevertheless, i tried to do some lazy stuff like explained here but without sucess.
Any idea ?
I used the following regex and it appeared to work fine for your task. I've simply replaced your initial [^_] with ..
^.{1,5}_\d{2,3}_[\d\.]*$
It's probably best to replace your final * with + too, unless you allow nothing after the final '_'. And note your final part allows multiple '.' (I don't know if that's what you want or not).
For the record, here's a quick Python script I used to verify the regex:
import re
strs = [ "a_12_1",
"abc_12_134",
"abcd_123_1.",
"abcde_12_1",
"a_123_123.456.7890.",
"a_12_1",
"ab_de_12_1",
]
myre = r"^.{1,5}_\d{2,3}_[\d\.]+$"
for str in strs:
m = re.match(myre, str)
if m:
print "Yes:",
if m.group(0) == str:
print "ALL",
else:
print "No:",
print str
Output is:
Yes: ALL a_12_1
Yes: ALL abc_12_134
Yes: ALL abcd_134_1.
Yes: ALL abcde_12_1
Yes: ALL a_123_123.456.7890.
Yes: ALL a_12_1
Yes: ALL ab_de_12_1
^(.{1,5})_(\d{2,3})_([\d.]*)$
works for your example. The result doesn't change whether you use a lazy quantifier or not.
While answering the comment ( writing the lazy expression ), i saw that i did a mistake... if i simply use the folowing classical regex, it works:
^(.{1,5})_([\\d]{2,3})_([\\d\\.]*)$
Thank you.