Regex Python concatenate lines if some text is found the line below - regex

import re
output = open("teste-out.txt","w")
input = open("teste.txt")
for line in input:
output.write(re.sub(r"\n\r03110", r"|03110", line))
input.close()
output.close()
Why this code isn´t working, anyone can help me fix it? I wanna read from a txt and if the line starts with 03110 I wanna merge only this line with the previous line and add | before the merge
I´ve tried \n03110 \r03110 and other options, but none is working. In notepad++ I can do this using \R++03110 and replace with |03110 using regular expressions, but I wanna a python solution to optimize the job.
Input
01000|0107160
02000|1446
03100|01|316,00
03110|||316,00|0|0|7|
03100|29|135,00
03110|||135,00|0|0|0|
99999|83
00000|00350235201512001|01071603100090489
02000|4720,905|1967,05|0
03100|31|705,26
03100|32|6073,00
03110|||6073,00|0|0|0,00|8
99999|23
Output
01000|0107160
02000|1446
03100|01|316,00|03110|||316,00|0|0|7|
03100|29|135,00|03110|||135,00|0|0|0|
99999|83
00000|00350235201512001|01071603100090489
02000|4720,905|1967,05|0
03100|31|705,26
03100|32|6073,00|03110|||6073,00|0|0|0,00|8
99999|23
I´m using python at windows.

2nd EDIT: sorry - I guess I didn't read carefully enough...
Well, to merge lines with regards to the beginning of the second line is also possible, but perhaps not as beautifully clean:
with open('teste.txt') as fin, open('teste-out.txt', 'w') as fout:
fout.write(next(fin)[:-1])
for line in fin:
if line.startswith('03110'):
fout.write(f'|{line[:-1]}')
else:
fout.write(f'\n{line[:-1]}')
fout.write('\n')
EDIT: solution working with files:
with open('teste.txt') as fin, open('teste-out.txt', 'w') as fout:
for line in fin:
if line.startswith('03100'):
fout.write(line[:-1] + '|' + next(fin))
else:
fout.write(line)
Just for the case of interest - this is no re job imho:
s_in = '''01000|0107160
02000|1446
03100|01|316,00
03110|||316,00|0|0|7|
03100|29|135,00
03110|||135,00|0|0|0|
99999|83
00000|00350235201512001|01071603100090489'''
from io import StringIO
with StringIO(s_in) as fin:
for line in fin:
if line.startswith('03100'):
print(line[:-1] + '|' + next(fin), end='')
else:
print(line, end='')
results in requested
01000|0107160
02000|1446
03100|01|316,00|03110|||316,00|0|0|7|
03100|29|135,00|03110|||135,00|0|0|0|
99999|83
00000|00350235201512001|01071603100090489

For those who like sed, this is a very short solution (not that efficient, though, as it reads all lines before printing anything):
< input_file sed '$!N;s/\n03110/03110/g'
The following sed script is a more efficient solution:
#!/usr/bin/sed -f
:h
N
s/\n03110/|03110/
t h
h
s/\n.*//
p
g
D
For the casual reader who really likes sed like I do, here's a short explanation:
the 4 lines from :h to t h are essentially a "do-while" loop in which we append a new line to the pattern space (N), and we keep doing so (t h is a "goto"), as long as the substitution command (s) is successful in changing the embedded newline \n to a |;
as soon as the s command is unsuccessful, we "save" the multiline pattern space copying it into the hold space (h), safely delete the \n and whatever is after it (s/\n.*//), and finally print the what remains (p), which is the lines that we've been successfully joining;
it's now time to get back the last line we appended which did not start by 03110: we get (g) the multiline back from the hold space, delete \n together with whatever precedes it and go to the top without printing (D).
we are back to the top of the script with a line which is not printed yet, just like we started.

Related

Regex in python trouble

I have a text file that I would like to search through it to see how many of a certain word is in it. I'm getting the wrong count for the words.
File is here
code:
import re
with open('SysLog.txt', 'rt') as myfile:
for line in myfile:
m = re.search('guest', line, re.M|re.I)
if m is not None:
m.group(0)
print( "Found it.")
print('Found',len(m.group()), m.group(),'s')
break
for line in myfile:
n = re.search('Worm', line)
if n is not None:
n.group(0)
print("\n\tNext Match.")
print('Found', len(n.group()), n.group(), 's')
break
for line in myfile:
o = re.search('anonymous', line)
if o is not None:
o.group(0)
print("\n\tNext Match.")
print('Found', len(o.group()), o.group(), 's')
break
There is no need to use a regex, you can use str.count() to make the process much more simple:
with open('SysLog.txt', 'rt') as myfile:
text = myfile.read()
for word in ('guest', 'Worm', 'anonymous'):
print("\n\tNext Match.")
print('Found', text.count(word), word, 's')
To test this, I downloaded the file and ran the code above, and got the output:
Next Match.
Found 4 guest s
Next Match.
Found 91 Worm s
Next Match.
Found 18 anonymous s
which is correct if you do a find on the document in a text editor!
*As a sidenote, I'm not sure why you want to print a tab (\t) before 'Next Match' each time as it just looks weird in the output but it doesn't matter :)
There are multiple problems with your code:
re.search will only give you the first match, if any; this does not have to be a problem, though, as it seems like the word is only expected to appear once per line; otherwise, use re.findall
the line n.group(0) does not do anything without an assignment
len(n.group()) does not give you the number of matches, but the length of the matched string
you break after the first line in the file
myfile is an iterator, so once the first for line in myfile loop has finished, the other two won't have any lines left to loop (it will never finish because of the break anyway, though)
as already noted, you do not need regular expression at all
One (among many) possible ways of doing this would be this (not tested):
counts = {"worm": 0, "guest": 0, "anonymous": 0}
for line in myfile:
for word in counts:
if word in line:
counts[word] += 1

Find and remove specific string from a line

I am hoping to receive some feedback on some code I have written in Python 3 - I am attempting to write a program that reads an input file which has page numbers in it. The page numbers are formatted as: "[13]" (this means you are on page 13). My code right now is:
pattern='\[\d\]'
for line in f:
if pattern in line:
re.sub('\[\d\]',' ')
re.compile(line)
output.write(line.replace('\[\d\]', ''))
I have also tried:
for line in f:
if pattern in line:
re.replace('\[\d\]','')
re.compile(line)
output_file.write(line)
When I run these programs, a blank file is created, rather than a file containing the original text minus the page numbers. Thank you in advance for any advice!
Your if statement won't work because not doing a regex match, it's looking for the literal string \[\d\] in line.
for line in f:
# determine if the pattern is found in the line
if re.match(r'\[\d\]', line):
subbed_line = re.sub(r'\[\d\]',' ')
output_file.writeline(subbed_line)
Additionally, you're using the re.compile() incorrectly. The purpose of it is to pre-compile your pattern into a function. This improves performance if you use the pattern a lot because you only evaluate the expression once, rather than re-evaluating each time you loop.
pattern = re.compile(r'\[\d\]')
if pattern.match(line):
# ...
Lastly, you're getting a blank file because you're using output_file.write() which writes a string as the entire file. Instead, you want to use output_file.writeline() to write lines to the file.
You don't write unmodified lines to your output.
Try something like this
if pattern in line:
#remove page number stuff
output_file.write(line) # note that it's not part of the if block above
That's why your output file is empty.

Python - using raw_input() to search a text document

I am trying to write a simple script that a user can enter what he/she wants to search in a specified txt file. If the word they searching is found then print it to a new text file. This is what I got so far.
import re
import os
os.chdir("C:\Python 2016 Training")
patterns = open("rtr.txt", "r")
what_directory_am_i_in = os.getcwd()
print what_directory_am_i_in
search = raw_input("What you looking for? ")
for line in patterns:
re.findall("(.*)search(.*)", line)
fo = open("test", "wb")
fo.write(line)
fo.close
This successfully creates a file called test, but the output is nothing close to what word was entered into the search variable.
Any advice appreciated.
First of all, you have not read a file
patterns = open("rtr.txt", "r")
this is a file object and not the content of file, to read the file contents you need to use
patterns.readlines()
secondly, re.findall returns a list of matched strings, so you would want to store that. You regex is also not correct as pointed by Hani, It should be
matched = re.findall("(.*)" + search + "(.*)", line)
rather it should be :
if you want the complete line
matched = re.findall(".*" + search + ".*", line)
or simply
matched = line if search in line else None
Thirdly, you don't need to keep opening your output file in the for loop. You are overwriting your file everytime in the loop so it will capture only the last result. Also remember to call the close method on the files.
Hope this helps
you are searching here for all lines that has "search" word in it
you need to get the lines that has the text you entered in the shell
so change this line
re.findall("(.*)search(.*)", line)
to
re.findall("(.*)"+search+"(.*)", line)

How do i delete first 2 lines which match with a text given by me ( using sed )?

How do i delete first 2 lines which match with a text given by me ( using sed ! )
E.g :
#file.txt contains following lines :
abc
def
def
abc
abc
def
And i want to delete first 2 "abc"
Using "sed"
While #EdMorton has pointed out that sed is not the best tool for this job (if you wonder why exactly, see my answer below and compare it to the awk code), my research showed that the solution to the generalized problem
Delete occurences "N" through "M" of a line matching a given pattern using sed
indeed is a very tricky one in my opinion. There seem to be many suggestions for how to replace the "N"th occurence of a matching pattern with sed, but I found that deleting a specific matching line (or a range of lines) is a much more complex undertaking.
While the generalized problem with arbitrary values for N, M, and the pattern would probably be solved best by writing a "sed script generator" on the basis of a Finite State Machine, the solution to the special case asked by the OP is still simple enough to be coded by hand. I must admit that I wasn't very familiar with the obfuscated intricacies of the sed command syntax before, but I found this challenge to be quite useful for gaining more experience with non-trivial sed usage.
Anyway, here's my solution for deleting the first two occurences of a line containing "abc" in a file. If there's a simpler approach, I'm eager to learn about it, as this has taken me some time now.
A final caveat: this assumes GNU sed, as I was unable to find a solution with POSIX sed:
sed -n ':1;/abc/{n;b2;};p;$b4;n;b1;:2;/abc/{n;b3;};p;$b4;n;b2;:3;p;$b4;n;b3;:4;q' file
or, in more verbose syntax:
sed -n '
# BEGIN - look for first match
:first;
/abc/ {
# First match found. Skip line and jump to second section
n; bsecond;
};
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfirst;
# END - look for first match
# BEGIN - look for second match
:second;
/abc/ {
# Second match found. Skip line and jump to final section
n; bfinal;
}
# Line does not match. Print it and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bsecond;
# END - look for second match
# BEGIN - both matches found; print remaining lines
:final;
# Print line and quit if end-of-file reached
p; $bend;
# Advance to next line and start over
n; bfinal;
# END - print remaining lines
# QUIT
:end;
q;
' file
sed is for simple substitutions on individual lines, that is all. For anything else you should be using awk:
$ awk '!(/abc/ && ++c<3)' file
def
def
abc
def

Regular Expression over multiple lines

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile