Extract rows from csv file using regex substring? - regex

I have a csv file that looks like this (obviously < anystring > means just that).
<anystring>tony_UPearly_start,1,2,3
<anystring>tony_UPlate_start,4,5,6
<anystring>tony_UP<anystring>_start,7,8,9
<anystring>jane_UPearly_start,1,2,3
<anystring>jane_UPlate_start,4,5,6
<anystring>jane_UP<anystring>_start,7,8,9
I am trying to extract rows UPanystring (rows 3 and 6 in this example) using negative look forward to exclude rows 1,2 and 4,5
import re
import csv
search = re.compile(r'.*_UP(?!early|late)')
output = []
with open('test.csv', mode='r', encoding='utf-8') as f:
csvfile = csv.reader(f)
for row in csvfile:
if row[0] == search:
output.append(row)
print(output)
>>>[]
when I am after
print (output)
[<anystring>tony_UP<anystring>_start,7,8,9, <anystring>jane_UP<anystring>_start,7,8,9]
The regex search works when I test on a regex platform but not in python?
Thanks for the comments: the search code now looks like
search = re.compile(r'^.*?_UP(?!early|late).*$')
output = []
with open('test.csv', mode='r', encoding='utf-8') as f:
csvfile = csv.reader(f)
for row in csvfile:
search.search(row[0]) # it think this needs and if=true but it won't accept a boolean here?
output.append(row)
This now returns all rows (ie filters nothing whereas before it filtered everything)

You want to return a list of rows that contain _UP not followed with early or late.
The pattern should look like
search = re.compile(r'_UP(?!early|late)')
You do not need any ^, .*, etc. because when you use re.search, you are looking for a pattern match anywhere inside a string.
Then, all you need is to test the row for the regex match:
if search.search(row):
output.append(row)
See the Python demo:
import re
csvfile="""<anystring>tony_UPearly_start,1,2,3
<anystring>tony_UPlate_start,4,5,6
<anystring>tony_UP<anystring>_start,7,8,9
<anystring>jane_UPearly_start,1,2,3
<anystring>jane_UPlate_start,4,5,6
<anystring>jane_UP<anystring>_start,7,8,9""".splitlines()
search = re.compile(r'_UP(?!early|late)')
output = []
for row in csvfile:
if search.search(row):
output.append(row)
print(output)
And the output is your expected list:
['<anystring>tony_UP<anystring>_start,7,8,9', '<anystring>jane_UP<anystring>_start,7,8,9']

Related

How to format this txt file using Regex

Have a .txt file with data folded up into a single column, looking to turn it into a .csv so I can import it into a DB table.
Source file:
1000
AAAAAAAAAA
100,000.00
2000
BBBBBBBBBB
200,000.00
3000
CCCCCCCCCC
300,000.00
4000
DDDDDDDDDD
400,000.00
Looking to turn it into:
1000,AAAAAAAAA,100,000.00
2000,BBBBBBBBB,200,000.00
3000,CCCCCCCCC,300,000.00
4000,DDDDDDDDD,400,000.00
I've tried this so far and am stuck there:
find - ^(\d+)(\s)
substitue - $1,
That gets me this output:
1000,AAAAAAAAA
100,000.00
2000,BBBBBBBBB
200,000.00
3000,CCCCCCCCC
300,000.00
4000,DDDDDDDDD
400,000.00
Would love any pointers to move ahead.
Thanks,
CH
Try the following find and replace:
Find: (.*)\r?\n(.*)\r?\n(.*)(?:\r?\n|$)
Replace: $1|$2|$3\n
This approach captures each of three successive lines, and then concatenates together into a single line using pipe as the separator. Note carefully that it is not acceptable to use comma as a separator here, because some of your numeric data already uses comma.
Follow the link below for a running demo.
Demo
If every a row consists of 3 items, maybe try splitting the txt file based on spaces and then writing to a csv file?
For example in python:
result = []
def writeToCSV(result):
with open('new.csv', 'a') as writeFile:
writer = csv.writer(writeFile)
for i in range(len(result)):
writer.writerow(result)
with open('yourfile.txt', 'r') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
ind = 0
for row in spamreader:
result.append(row)
ind += 1
if(ind == 3):
ind = 0
writeToCSV(result)
result = []
You can use a regex like this:
(\d+)\n(\w+)\n([\d,.]+)
Working demo
With this replacement string:
$1,$2,$3

How to find and replace the pattern ");" in a text file?

I have a text file which contains special characters. I want to replace ");" with "firstdata);seconddata".
The catch is that both ")" and ";" should be together and then replaced with "firstdata);seconddata".
I have the following code.
import re
string = open('trial.txt').read()
new_str = re.sub('[);]', 'firstdata);seconddata', string)
open('b.txt', 'w').write(new_str)
Please, suggest me how to change my code to get the right output.
This should do:
import re
with open("file.txt", "r") as rfile:
s = rfile.read()
rplce = re.sub('\);', "REPLACED", s)
with open("file.txt", "w") as wfile:
wfile.write(rplce)
You can use the built in str.replace() method in Python
string = "foobar);"
string.replace(");", 'firstdata);seconddata') # -> 'foobarfirstdata);seconddata'
Here are the docs for common string operations like this in Python
https://docs.python.org/3/library/string.html
You may use more simple way.
with open('input_file.txt', 'r') as input_file:
with open('output_file.txt', 'w') as output_file:
for line in input_file:
x = line.replace('findtext','replacetext')
output_file.write(x)

list using rows values from nth column till end of line

I have comma seperated text file having rows like below and I want to create list from 6th column till last comma :-
FILE :-
*>,1.66.0.0/22,202.79.200.1,200,0,64515,4445,4445,64697,64697,64697,64697,i
*,14.0.184.0/24,202.79.200.64,200,0,64515,3491,9444,64574,?
Output Expected:-
List[1] = "64515,4445,4445,64697,64697,64697,64697"
List[2] = "64515,3491,9444,64574"
I have tried but it will return all values from first comma instead of starting from 6th comma and need to enclose those values using "" like mentioned above :-
for line in txtfile:
line.split(',')
You're best of using other libraries for this (csv or pandas for example), but if you want to do it without, you;d look at something like this:
data =[]
with open('file.ext', 'r') as f:
for line in f:
data.append(','.join(line.split(',')[5:-1]))
NewLine variable holds how string split, then again joined
lst =[]
with open("input.txt") as f:
for line in f:
lst.append(','.join(line.split(',')[5:][:-1]))
print(lst)
Note this is a simple split and join approach.

pyspark not working with regex

I've made RDD from a file with list of urls:
url_data = sc.textFile("url_list.txt")
Now i'm trying to make another RDD with all rows that contain 'net.com' and this string starts with non numeric or letter symbol. I mean include lines with .net.com or \tnet.com and exclude internet.com or cnet.com.
filtered_data = url_data.filter(lambda x: '[\W]net\.com' in x)
But this line gives no results.
How can i make pyspark shell work with regex?
Why not define a function in python that uses the re or re2 (much faster) package, and returns a Bool if there is a match.
def url_filter(url):
pattern = re.compile(r'REGEX_PATTERN')
match = pattern.match(URL)
if match:
return True
else:
return False
Then just pass it in to filter function url_data.filter(lambda x: python_regex_fuction(x))

Regular expression syntax in python

I try to write a python scripts to analys a data txt.I want the script to do such things:
find all the time data in one line, and compare them.but this is my first time to write RE syntax.so I write a small script at 1st.
and my script is:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
print pattern.match(a[i])
#print a
and the output is always None.
my txt is just like the picture:
what's the problem? plz help me. thx a lot.
and my python is python 2.7.2.my os is windows xp sp3.
Didn't you miss one of the ":" in you regex? I think you meant
re.compile('\d{2}:\d{2}:\d{2}')
The other problems are:
First, if you want to search in the hole text, use search instead of match. Second, to access your result you need to call group() in the match object returned by your search.
Try it:
import sys
txt = open('1.txt','r')
a = []
for eachLine in txt:
a.append(eachLine)
import re
pattern = re.compile('\d{2}:\d{2}:\d{2}')
for i in xrange(len(a)):
match = pattern.search(a[i])
print match.group()
#print a
I think you're missing the colons and dots in your regex. Also try using re.search or re.findall instead on the entire text. Like this:
import re, sys
text = open("./1.txt", "r").read() # or readlines() to make a list of lines
pattern = re.compile('\d{2}:\d{2}:\d{2}')
matches = pattern.findall(text)
for i in matches:
print(i);