Python Regex pattern matching with a variable name not working - python-2.7

The below code does not return True for the match. I am wondering why? Any help is appreciated.
Note:
id_list = ['YYY-100', 'YYYMM1640ASS20', 'Cruzer', 'SSDSC2BA20', 'BBBPEDMD40']
'drives.txt' contains lines like this (and does contain above IDs in some lines).
'RED SSDSC2BA200G4R 200 GB 2.5 SATA 6G Class E: 30,000-100,000 writes per second'
So I would assume that id 'SSDSC2BA20' will match the second word in above line, but below match does not return True.
For double-checking, I tried 'if match: print match.group()' but that returns nothing as well. What am I missing?
import re
with open('drives.txt', 'r') as fr:
for id in id_list:
for line in fr:
match = re.search(r'%s' % id, line, re.I)
if match:
print 'True'
Note that instead of above regex, I tried the below also, but that did not work either.
my_regex = r".?" + re.escape(id) + r".?"
match = re.search(my_regex, line, re.I)

fr is a file pointer. With your current approach, you're iterating over the lines multiple times, once for each regex. Don't do this. Everytime you read a line, you advance the file pointer till it points to the end of the file. This happens on the first iteration itself, so forthcoming iterations will have you read empty strings from the file.
One fix for this is to do fr.seek(0, 0) after each inner loop, which I don't recommend. The other fix is to reorder your loops. Iterate over your file once. Here's how you do that:
with open('drives.txt', 'r') as fr:
for line in fr:
for id in id_list:
match = re.search(r'%s' % id, line, re.I)
if match:
print id, 'matches for line:', line
Also, I should mention that using id as a variable name shadows the builtin id() function, so I recommend you change it.

Related

Conditionally extracting the beginning of a regex pattern

I have a list of strings containing the names of actors in a movie that I want to extract. In some cases, the actor's character name is also included which must be ignored.
Here are a couple of examples:
# example 1
input = 'Levan Gelbakhiani as Merab\nAna Javakishvili as Mary\nAnano Makharadze'
expected_output = ['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']
# example 2
input = 'Yoosuf Shafeeu\nAhmed Saeed\nMohamed Manik'
expected_output = ['Yoosuf Shafeeu', 'Ahmed Saeed', 'Mohamed Manik']
Here is what I've tried to no avail:
import re
output = re.findall(r'(?:\\n)?([\w ]+)(?= as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?: as )?', input)
output = re.findall(r'(?:\\n)?([\w ]+)(?:(?= as )|(?! as ))', input)
The \n in the input string are new line characters. We can make use of this fact in our regex.
Essentially, each line always begins with the actor's name. After the the actor's name, there could be either the word as, or the end of the line.
Using this info, we can write the regex like this:
^(?:[\w ]+?)(?:(?= as )|$)
First, we assert that we must be at the start of the line ^. Then we match some word characters and spaces lazily [\w ]+?, until we see (?:(?= as )|$), either as or the end of the line.
In code,
output = re.findall(r'^(?:[\w ]+?)(?:(?= as )|$)', input, re.MULTILINE)
Remember to use the multiline option. That is what makes ^ and $ mean "start/end of line".
You can do this without using regular expression as well.
Here is the code:
output = [x.split(' as')[0] for x in input.split('\n')]
I guess you can combine the values obtained from two regex matches :
re.findall('(?:\\n)?(.+)(?:\W[a][s].*?)|(?:\\n)?(.+)$', input)
gives
[('Levan Gelbakhiani', ''), ('Ana Javakishvili', ''), ('', 'Anano Makharadze')]
from which you filter the empty strings out
output = list(map(lambda x : list(filter(len, x))[0], output))
gives
['Levan Gelbakhiani', 'Ana Javakishvili', 'Anano Makharadze']

Regex Python concatenate lines if some text is found the line below

import re
output = open("teste-out.txt","w")
input = open("teste.txt")
for line in input:
output.write(re.sub(r"\n\r03110", r"|03110", line))
input.close()
output.close()
Why this code isn´t working, anyone can help me fix it? I wanna read from a txt and if the line starts with 03110 I wanna merge only this line with the previous line and add | before the merge
I´ve tried \n03110 \r03110 and other options, but none is working. In notepad++ I can do this using \R++03110 and replace with |03110 using regular expressions, but I wanna a python solution to optimize the job.
Input
01000|0107160
02000|1446
03100|01|316,00
03110|||316,00|0|0|7|
03100|29|135,00
03110|||135,00|0|0|0|
99999|83
00000|00350235201512001|01071603100090489
02000|4720,905|1967,05|0
03100|31|705,26
03100|32|6073,00
03110|||6073,00|0|0|0,00|8
99999|23
Output
01000|0107160
02000|1446
03100|01|316,00|03110|||316,00|0|0|7|
03100|29|135,00|03110|||135,00|0|0|0|
99999|83
00000|00350235201512001|01071603100090489
02000|4720,905|1967,05|0
03100|31|705,26
03100|32|6073,00|03110|||6073,00|0|0|0,00|8
99999|23
I´m using python at windows.
2nd EDIT: sorry - I guess I didn't read carefully enough...
Well, to merge lines with regards to the beginning of the second line is also possible, but perhaps not as beautifully clean:
with open('teste.txt') as fin, open('teste-out.txt', 'w') as fout:
fout.write(next(fin)[:-1])
for line in fin:
if line.startswith('03110'):
fout.write(f'|{line[:-1]}')
else:
fout.write(f'\n{line[:-1]}')
fout.write('\n')
EDIT: solution working with files:
with open('teste.txt') as fin, open('teste-out.txt', 'w') as fout:
for line in fin:
if line.startswith('03100'):
fout.write(line[:-1] + '|' + next(fin))
else:
fout.write(line)
Just for the case of interest - this is no re job imho:
s_in = '''01000|0107160
02000|1446
03100|01|316,00
03110|||316,00|0|0|7|
03100|29|135,00
03110|||135,00|0|0|0|
99999|83
00000|00350235201512001|01071603100090489'''
from io import StringIO
with StringIO(s_in) as fin:
for line in fin:
if line.startswith('03100'):
print(line[:-1] + '|' + next(fin), end='')
else:
print(line, end='')
results in requested
01000|0107160
02000|1446
03100|01|316,00|03110|||316,00|0|0|7|
03100|29|135,00|03110|||135,00|0|0|0|
99999|83
00000|00350235201512001|01071603100090489
For those who like sed, this is a very short solution (not that efficient, though, as it reads all lines before printing anything):
< input_file sed '$!N;s/\n03110/03110/g'
The following sed script is a more efficient solution:
#!/usr/bin/sed -f
:h
N
s/\n03110/|03110/
t h
h
s/\n.*//
p
g
D
For the casual reader who really likes sed like I do, here's a short explanation:
the 4 lines from :h to t h are essentially a "do-while" loop in which we append a new line to the pattern space (N), and we keep doing so (t h is a "goto"), as long as the substitution command (s) is successful in changing the embedded newline \n to a |;
as soon as the s command is unsuccessful, we "save" the multiline pattern space copying it into the hold space (h), safely delete the \n and whatever is after it (s/\n.*//), and finally print the what remains (p), which is the lines that we've been successfully joining;
it's now time to get back the last line we appended which did not start by 03110: we get (g) the multiline back from the hold space, delete \n together with whatever precedes it and go to the top without printing (D).
we are back to the top of the script with a line which is not printed yet, just like we started.

Regex in python trouble

I have a text file that I would like to search through it to see how many of a certain word is in it. I'm getting the wrong count for the words.
File is here
code:
import re
with open('SysLog.txt', 'rt') as myfile:
for line in myfile:
m = re.search('guest', line, re.M|re.I)
if m is not None:
m.group(0)
print( "Found it.")
print('Found',len(m.group()), m.group(),'s')
break
for line in myfile:
n = re.search('Worm', line)
if n is not None:
n.group(0)
print("\n\tNext Match.")
print('Found', len(n.group()), n.group(), 's')
break
for line in myfile:
o = re.search('anonymous', line)
if o is not None:
o.group(0)
print("\n\tNext Match.")
print('Found', len(o.group()), o.group(), 's')
break
There is no need to use a regex, you can use str.count() to make the process much more simple:
with open('SysLog.txt', 'rt') as myfile:
text = myfile.read()
for word in ('guest', 'Worm', 'anonymous'):
print("\n\tNext Match.")
print('Found', text.count(word), word, 's')
To test this, I downloaded the file and ran the code above, and got the output:
Next Match.
Found 4 guest s
Next Match.
Found 91 Worm s
Next Match.
Found 18 anonymous s
which is correct if you do a find on the document in a text editor!
*As a sidenote, I'm not sure why you want to print a tab (\t) before 'Next Match' each time as it just looks weird in the output but it doesn't matter :)
There are multiple problems with your code:
re.search will only give you the first match, if any; this does not have to be a problem, though, as it seems like the word is only expected to appear once per line; otherwise, use re.findall
the line n.group(0) does not do anything without an assignment
len(n.group()) does not give you the number of matches, but the length of the matched string
you break after the first line in the file
myfile is an iterator, so once the first for line in myfile loop has finished, the other two won't have any lines left to loop (it will never finish because of the break anyway, though)
as already noted, you do not need regular expression at all
One (among many) possible ways of doing this would be this (not tested):
counts = {"worm": 0, "guest": 0, "anonymous": 0}
for line in myfile:
for word in counts:
if word in line:
counts[word] += 1

Python - using raw_input() to search a text document

I am trying to write a simple script that a user can enter what he/she wants to search in a specified txt file. If the word they searching is found then print it to a new text file. This is what I got so far.
import re
import os
os.chdir("C:\Python 2016 Training")
patterns = open("rtr.txt", "r")
what_directory_am_i_in = os.getcwd()
print what_directory_am_i_in
search = raw_input("What you looking for? ")
for line in patterns:
re.findall("(.*)search(.*)", line)
fo = open("test", "wb")
fo.write(line)
fo.close
This successfully creates a file called test, but the output is nothing close to what word was entered into the search variable.
Any advice appreciated.
First of all, you have not read a file
patterns = open("rtr.txt", "r")
this is a file object and not the content of file, to read the file contents you need to use
patterns.readlines()
secondly, re.findall returns a list of matched strings, so you would want to store that. You regex is also not correct as pointed by Hani, It should be
matched = re.findall("(.*)" + search + "(.*)", line)
rather it should be :
if you want the complete line
matched = re.findall(".*" + search + ".*", line)
or simply
matched = line if search in line else None
Thirdly, you don't need to keep opening your output file in the for loop. You are overwriting your file everytime in the loop so it will capture only the last result. Also remember to call the close method on the files.
Hope this helps
you are searching here for all lines that has "search" word in it
you need to get the lines that has the text you entered in the shell
so change this line
re.findall("(.*)search(.*)", line)
to
re.findall("(.*)"+search+"(.*)", line)

Python and Regex with special characters

I can't get my regex to work as desired in my Python 3 code.
I am trying to parse a file find a specific pattern (the exact pattern is Total Optimized)
I am doing this because the file can contain lines which say """Total Optimization (Active)""" and other permutations. I have tried the following lines. None work
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t\d')
PkOp = re.compile(r'Total Optimized\t[^(Active)]')
My basic code (which is simplified here) to just print the matching line out. If I got that working I would then choose the array item I wanted such as
PkOp = PkOpArray[4]
App = re.compile(r'Appliance\s(Active)')
PkOp = re.compile(r"Total Optimized\t\d")
with open("SteelheadMetric2.txt","r") as f:
with open("mydumbfile.txt","w") as o:
for line in f:
line = line.lstrip()
matches = PkOp.findall(line)
for firestick in matches:
PkOpArray = line.split()
PkOp = PkOpArray
print(PkOp)
Mostly I get this error
matches = PkOp.findall(line)
AttributeError: 'list' object has no attribute 'findall'
If I remove the slash characters I can get it to show lines with 'Total Optimization' or 'Appliance' whatever. I just can't be more specific in what I want.
What am I missing? It works fine if I just compile a text string but to use special regex like whitespace, tab digit it fails. The regex checks out in notepad++
When you write PkOp = PkOpArray you have just changed your regex into a list.
If you delete that line, and change your print(PkOp) to print(PkOpArray), it should fix your problem, assuming the rest of your code is correct.