Python 2.7 RE Search by condition - python-2.7

When I am using re.search, I have some problem.
For example:
a = '<span class="chapternum">1 </span>abc,def.</span>'
How can I search the number '1'?
Or how to search by matching digit start with ">" and end with writespace?
I tried:
test = re.search('(^>)(\d+)(\s$)', a)
print test
>> []
It is fail to get the number "1"

^ and $ indicate the beginning and the end of the string. If you get rid of them you have your answer:
>>> test = re.search('(>)(\d+)(\s)', a)
>>> test.groups()
('>', '1', ' ')
Not sure that you need the first and last groups though (capturing with parenthesis):
>>> a = '<span class="chapternum">23 </span>abc,def.</span>'
>>> test = re.search('>(\d+)\s', a)
>>> test.group(1)
'23'

Related

Regex match string where symbol is not repeated

I have like this strings:
group items % together into% FALSE
characters % that can match any single TRUE
How I can match sentences where symbol % is not repeated?
I tried like this pattern but it's found first match sentence with symbol %
[%]{1}
You may use this regex in python to return failure for lines that have more than one % in them:
^(?!([^%]*%){2}).+
RegEx Demo
(?!([^%]*%){2}) is a negative lookahead that fails the match if % is found twice after line start.
You could use re.search as follows:
items = ['group items % together into%', 'characters % that can match any single']
for item in items:
output = item
if re.search(r'^.*%.*%.*$', item):
output = output + ' FALSE'
else:
output = output + ' TRUE'
print(output)
This prints:
group items % together into% FALSE
characters % that can match any single TRUE
Just count them (Python):
>>> s = 'blah % blah %'
>>> s.count('%') == 1
False
>>> s = 'blah % blah'
>>> s.count('%') == 1
True
With regex:
>>> re.match('[^%]*%[^%]*$','gfdg%fdgfgfd%')
>>> re.match('[^%]*%[^%]*$','blah % blah % blah')
>>> re.match('[^%]*%[^%]*$','blah % blah blah')
<re.Match object; span=(0, 16), match='blah % blah blah'>
re.match must match from start of string, use ^ (match start of string) if using re.search, which can match in the middle of a string.
>>> re.search('^[^%]*%[^%]*$','gfdg%fdgfgfd%')
>>> re.search('^[^%]*%[^%]*$','gfdg%fdgfgfd')
<re.Match object; span=(0, 12), match='gfdg%fdgfgfd'>
I am assuming that "sentence" in your question is the same as a line in the input text. With that assumption, you can use the following:
^[^%\r\n]*(%[^%\r\n]*)?$
This, along with the multi-line and global flags, will match all lines in the input string that contain 0 or 1 '%' symbols.
^ matches the start of a line
[^%\r\n]* matches 0 or more characters that are not '%' or a new line
(...)? matches 0 or 1 instance of the contents in parentheses
% matches '%' literally
$ matches the end of a line

Extracting Numbers from a String Without Regular Expressions

I am trying to extract all the numbers from a string composed of digits, symbols and letters.
If the numbers are multi-digit, I have to extract them as multidigit (e.g. from "shsgd89shs2011%%5swts"), I have to pull the numbers out as they appear (89, 2011 and 5).
So far what I have done just loops through and returns all the numbers incrementally, which I like but I cannot figure out how to make it stop
after finishing with one set of digits:
def StringThings(strng):
nums = []
number = ""
for each in range(len(strng)):
if strng[each].isdigit():
number += strng[each]
else:
continue
nums.append(number)
return nums
Running this value: "6wtwyw66hgsgs" returns ['6', '66', '666']
w
hat simple way is there of breaking out of the loop once I have gotten what I needed?
Using your function, just use a temp variable to concat each sequence of digits, yielding the groups each time you encounter a non-digit if the temp variable is not an empty string:
def string_things(strng):
temp = ""
for ele in strng:
if ele.isdigit():
temp += ele
elif temp: # if we have a sequence
yield temp
temp = "" # reset temp
if temp: # catch ending sequence
yield temp
Output
In [9]: s = "shsgd89shs2011%%5swts"
In [10]: list(string_things(s))
Out[10]: ['89', '2011', '5']
In [11]: s ="67gobbledegook95"
In [12]: list(string_things(s))
Out[12]: ['67', '95']
Or you could translate the string replacing letters and punctuation with spaces then split:
from string import ascii_letters, punctuation, maketrans
s = "shsgd89shs2011%%5swts"
replace = ascii_letters+punctuation
tbl = maketrans(replace," " * len(replace))
print(s.translate(tbl).split())
['89', '2011', '5']
L2 = []
file_Name1 = 'shsgd89shs2011%%5swts'
from itertools import groupby
for k,g in groupby(file_Name1, str.isdigit):
a = list(g)
if k == 1:
L2.append("".join(a))
print(L2)
Result ['89', '2011', '5']
Updated to account for trailing numbers:
def StringThings(strng):
nums = []
number = ""
for each in range(len(strng)):
if strng[each].isdigit():
number += strng[each]
if each == len(strng)-1:
if number != '':
nums.append(number)
if each != 0:
if strng[each].isdigit() == False:
if strng[each-1].isdigit():
nums.append(number)
number = ""
continue;
return nums
print StringThings("shsgd89shs2011%%5swts34");
// returns ['89', '2011', '5', '34']
So, when we reach a character which is not a number, and if the previously observed character was a number, append the contents of number to nums and then simply empty our temporary container number, to avoid it containing all the old stuff.
Note, I don't know Python so the solution may not be very pythonic.
Alternatively, save yourself all the work and just do:
import re
print re.findall(r'\d+', 'shsgd89shs2011%%5swts');

Python - Check If string Is In bigger String

I'm working with Python v2.7, and I'm trying to find out if you can tell if a word is in a string.
If for example i have a string and the word i want to find:
str = "ask and asked, ask are different ask. ask"
word = "ask"
How should i code so that i know that the result i obtain doesn't include words that are part of other words. In the example above i want all the "ask" except the one "asked".
I have tried with the following code but it doesn't work:
def exact_Match(str1, word):
match = re.findall(r"\\b" + word + "\\b",str1, re.I)
if len(match) > 0:
return True
return False
Can someone please explain how can i do it?
You can use the following function :
>>> test_str = "ask and asked, ask are different ask. ask"
>>> word = "ask"
>>> def finder(s,w):
... return re.findall(r'\b{}\b'.format(w),s,re.U)
...
>>> finder(text_str,word)
['ask', 'ask', 'ask', 'ask']
Note that you need \b for boundary regex!
Or you can use the following function to return the indices of words :
in splitted string :
>>> def finder(s,w):
... return [i for i,j in enumerate(re.findall(r'\b\w+\b',s,re.U)) if j==w]
...
>>> finder(test_str,word)
[0, 3, 6, 7]

Python regexp. My small program it is impossible to distinguish letters from numbers

Given the coordinates of the polygon and have to check the input string containing the data coordinates.
Here is my code
import re
t = "(0,0),(0,2),(2,2),(2,0),(0,1)"
#tt = "(0,0),(0,2),(2,2),(2,0),(0,'a')"
p='((\([0-9]+.?[0-9]*(\s)*,(\s)*[0-9]+.?[0-9]*(\s)*\)(\s)*,?(\s)*)+)'
b=re.search(p,t)
if b:
print "found"
else:
print "not found"
In both cases (t and tt) , the function returns true. Why is it so
Just add anchors. RE can match anywhere in your string
p='^((\([0-9]+.?[0-9]*(\s)*,(\s)*[0-9]+.?[0-9]*(\s)*\)(\s)*,?(\s)*)+)$'
>>> import re
>>> t = "(0,0),(0,2),(2,2),(2,0),(0,1)"
>>> p='((\([0-9]+.?[0-9]*(\s)*,(\s)*[0-9]+.?[0-9]*(\s)*\)(\s)*,?(\s)*)+)'
>>> re.search(p, t)
<_sre.SRE_Match object at 0x01AE7E20>
>>> tt = "(0,0),(0,2),(2,2),(2,0),(0,'a')"
>>> re.search(p, tt)
<_sre.SRE_Match object at 0x01AE7E90>
>>> p='^((\([0-9]+.?[0-9]*(\s)*,(\s)*[0-9]+.?[0-9]*(\s)*\)(\s)*,?(\s)*)+)$'
>>> re.search(p, tt)
>>> #no matching!
The ^ matches the start and the $ matches the end. This make the matching string only contains paired number from the beginning to the end.
That big subexpression is supposed to match an ordered pair (with an optional comma at the end), and I think it does. The + just means "one or more"; tt has four of them, and four is more than one, so the expression gets matched (with those four points as the match). If you want your pattern to match the whole string, then you need begin and end anchors in there, i.e. ^ and $.

How can I skip blank lines, when comparing two text files, using Python?

I used the following code to compare two text files
import difflib
with open("D:/Dataset1/data/1/hy/0/Info.txt") as f, open("D:/Dataset1/data/2/hy/0/Info.txt") as g:
flines= f.readlines()
glines= g.readlines()
d = difflib.Differ()
diff = d.compare(flines, glines)
print("\n".join(diff))
and I got this result:
- Local Config: HKEY_CURRENT_USER\Software\Microsoft\Uwxa\Kavi
? ^^^ ^^^
+ Local Config: HKEY_CURRENT_USER\Software\Microsoft\Otgad\Hyikqomi
? ^^^ + ^^^^^^^
any idea how to skip the blank lines?
The result of difflib.Differ.compare already contains newlines.
>>> import difflib
>>> list(difflib.Differ().compare(['1\n', '2\n'], ['1\n', '3\n']))
[' 1\n', '- 2\n', '+ 3\n']
>>> print ''.join(difflib.Differ().compare(['1\n', '2\n'], ['1\n', '3\n']))
1
- 2
+ 3
Joining the result with \n add additional newlines.
Replace following line:
print("\n".join(diff))
with (joining with empty string instead of newline):
print("".join(diff))
I couldnt do it with linejunk function ofdifflib.ndiff (i think it would have been a better solution). But end up using strip() function that works for me:
diff = difflib.ndiff(file1.readlines(), file2.readlines());
for x in diff:
if x.strip() == "+" or x.strip() == "-":
print("Blank Line... Ignore")
else:
print("Non Blank");