Need help in re module of python - regex

I'm trying to use the re module to parse through a file. I tried three version of the code, first 2 version's are not retrieving any O/P. The third version is retrieving only one line. Can someone please have a look?
Version1:
import re
file = open('sample.txt', 'r')
x = file.readline()
while x:
var = re.findall(r'(?:\*|\*>)\s+(\d+.\d+.\d+.\d+\/\d+\s+)?(\S+)\s+\d+\s+(\d+\s+.+)[ie]',x)
x = file.readline()
print(var)
file.close()
Version2:
import re
file = open('sample.txt', 'r')
x = file.read()
var = re.findall(r'(?:\*|\*>)\s+(\d+.\d+.\d+.\d+\/\d+\s+)?(\S+)\s+\d+\s+(\d+\s+.+)[ie]',x)
print(var)
file.close()
Version3:
import re
file = open('sample.txt', 'r')
x = file.readline()
while x:
var = re.search(r'(?:\*|\*>)\s+(\d+.\d+.\d+.\d+\/\d+\s+)?(\S+)\s+\d+\s+(\d+\s+.+)[ie]',x, re.M)
x = file.readline()
print(var.group(0))
file.close()
The data in sample.txt is as below. The network is blank after first line, and when I'm running individually these statements on python shell the regex is working.
Oregon Exchange BGP Route Viewer
route-views.oregon-ix.net / route-views.routeviews.org
This hardware is part of a grant by the NSF.
Please contact help#routeviews.org if you have questions, or
if you wish to contribute your view.
Network Next Hop Metric LocPrf Weight Path
* 64.48.0.0/16 173.205.57.234 0 53364 3257 2828 i
* 202.232.0.2 0 2497 2828 i
* 93.104.209.174 0 58901 51167 1299 2828 i
* 193.0.0.56 0 3333 2828 i
* 103.197.104.1 0 134708 3491 2828 i
* 132.198.255.253 0 1351 6939 2828 i

I think this will do the trick:
import re
thelist = [
"* 64.48.0.0/16 173.205.57.234 0 53364 3257 2828 i",
"* 93.104.209.174 0 58901 51167 1299 2828 i",
"* 193.0.0.56 0 3333 2828 i",
]
regex = re.compile("\*\s+(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\/\d{1,3})?\s+([\S]+)\s+([^i]+i)")
for text in thelist:
match = re.search(regex, text)
if match:
print ("yuppers")
print (match.group(1))
print (match.group(2))
print (match.group(3))
print ("\n")
else:
print ("nope")
results
yuppers
64.48.0.0/16
173.205.57.234
0 53364 3257 2828 i
yuppers
None
93.104.209.174
0 58901 51167 1299 2828 i
yuppers
None
193.0.0.56
0 3333 2828 i
Depending on what you actually want to do with each, you can just fiddle with the results. The regex makes the network optional which might have been tripping you up. Hopefully this gets you in the right direction!

Related

Extract strings based on pattern in python and writing them into pandas dataframe columns

I have text data inside a caloumn of dataset as shown below
Record Note 1
1 Amount: $43,385.23
Mode: Air
LSP: Panalpina
2 Amount: $1,149.32
Mode: Ocean
LSP: BDP
3 Amount: $1,149.32
LSP: BDP
Mode: Road
4 Amount: U$ 3,234.01
Mode: Air
5 No details
I need to extract each of the details inside the text data and write them into new column as shown below how to do it in python
Expected Output
Record Amount Mode LSP
1 $43,385.23 Air Panalpina
2 $1,149.32 Ocean BDP
3 $1,149.32 Road BDP
4 $3,234.01 Air
5
Is this possible. how can this be do
Write a custom function and then use pd.apply() -
def parse_rec(x):
note = x['Note']
details = note.split('\n')
x['Amount'] = None
x['Mode'] = None
x['LSP'] = None
if len(details) > 1:
for detail in details:
if 'Amount' in detail:
x['Amount'] = detail.split(':')[1].strip()
if 'Mode' in detail:
x['Mode'] = detail.split(':')[1].strip()
if 'LSP' in detail:
x['LSP'] = detail.split(':')[1].strip()
return x
df = df.apply(parse_rec, axis=1)
import re
Amount = []
Mode = []
LSP = []
def extract_info(txt):
Amount_lst = re.findall(r"amounts?\s*:\s*(.*)", txt, re.I)
Mode_lst = re.findall(r"Modes??\s*:\s*(.*)", txt, re.I)
LSP_lst = re.findall(r"LSP\s*:\s*(.*)", txt, re.I)
Amount.append(Amount_lst[0].strip() if Amount_lst else "No details")
Mode.append(Mode_lst[0].strip() if Mode_lst else "No details")
LSP.append(LSP_lst[0].strip() if LSP_lst else "No details")
df["Note"].apply(lambda x : extract_info(x))
df["Amount"] = Amount_lst
df["Mode"]= Mode_lst
df["LSP"]= LSP_lst
df = df[["Record","Amount","Mode","LSP"]]
By using regex we can extract information such as the above code and write down to separate columns.

output from regex python

i have this data:
result =
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
--------------------------------------------------------------------------------------------------------------
01:EXCHANGE 1 136.668ms 136.668ms 1.02K -1 0 0 UNPARTITIONED
00:SCAN HDFS 1 115.097ms 115.097ms 36.86K -1 99.97 MB 960.00 MB edw.dw_loan_int_amt
I came up with this regex (r".?([0-9]+.[0-9]+\ .B).?[0-9]+.[0-9]+\ .?B.*) to get the information i need from "Peak Mem" in this case output is 99.97MB
What Im trying to do: If result > 90 MB then #do this
Any help appreciated.
This is what i have so far, but im getting: None
result = sum_data['summary']
print result
m = re.match(r".*?([0-9]+\.[0-9]+\ .B).*?[0-9]+\.[0-9]+\ .?B.*", result)
print m
You could split it in lines and then with a simple regular expression like \s{2,} (meaning at least two whitespaces, possibly more).
In Python:
import re
data = """
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
--------------------------------------------------------------------------------------------------------------
01:EXCHANGE 1 136.668ms 136.668ms 1.02K -1 0 0 UNPARTITIONED
00:SCAN HDFS 1 115.097ms 115.097ms 36.86K -1 99.97 MB 960.00 MB edw.dw_loan_int_amt
"""
rx = re.compile(r'\s{2,}')
for line in data.split('\n'):
parts = rx.split(line)
if len(parts) > 2:
print(parts[6])
This yields
Peak Mem
0
99.97 MB
Or - if you prefer a list comprehension:
memory_peaks = [parts[6]
for line in data.split('\n')
for parts in [rx.split(line)]
if len(parts) > 2]
print(memory_peaks)

How to generate child csv file from parent csv file using python script?

I have one csv files it's like raw report so I want few rows from that which contain specific string in each row.
Parent file:
cols: A B C D E F G H I J K L M N O
-----------------------------------------------------------------------
abc def ghi jkl mno pqr stu vwx yz aaa bbb X 0 0 ajsjsvdjchbiyu ======kjdkjfk
abe drf gti jul muo pir stu vwx yz aaa bbb X 0 0 ajsjsvdjchbiyu ======kjdkjfk
abe drf gti j8l 7uo pir stu vwx yz aaa bbb Y 0 0 ajsjsvdjchbiyu ======kjdkjfk
abe drf gti j8l 7uo pir stu vwx yz aga btb Y 0 0 ajsjsvdjchbiyu ======kjdkjfk
Child file should be:(I need only below rows which contains Y in row L)
cols: A B C D E F G H I J K L M N O
abe drf gti j8l 7uo pir stu vwx yz aaa bbb Y 0 0 ajsjsvdjchbiyu ======kjdkjfk
abe drf gti j8l 7uo pir stu vwx yz aga btb Y 0 0 ajsjsvdjchbiyu ======kjdkjfk
I have written below script to do that:
import sys
fs=open("compliance_report.csv",'r')
fe=open("failed_controls_report.csv",'w')
count=0
lDict={}
fe.write("\n")
print "\nCleaning un-wanted lines from raw report...."
for l in fs:
if'Y' in l:
fe.write(l)
else:
continue
count=count+1
fs.close()
fe.close()
We have text in "0" column so when I use this script I'm getting the result in same row.
but this is working without "0" column
You need to use the csv module to actually parse the lines into fields. With the code you have now you're just searching the entire line for any Y character which obviously is not what you want. You can know your code cannot possibly be correct because it never mentions "column L" at all, despite that column being part of the problem statement.
An alternative way would be to use the Pandas library. The procedure with pandas would look something like this:
import pandas as pd
# Read csv
df = pd.read_csv("pathtocsv")
# Filter column N
df= df[df["N"] > "Y"]
# Write to csv again
df.to_csv("newcsvpath")

Trying to extract values from text -

I am trying to get acc, accel and rx from below text only if accounting value is true.
dataRx: 21916 drx: 1743625
ota: 191791 orx: 74164489
dataDropped: 14 dropped:1134
id: 65535 waitress BE nginxid: 0 kbps: 0.000
accounting: false
drop : 1
rx : 48392 bytes: 483920
id: 65533 waitress BE nginxid: 1 kbps: 0.000
accounting: false
drop : 4
rx : 122914 bytes: 70081939
id: 4232 nginx BE nginxid: 3 kbps: 0.000
accounting: false
drop : 0
rx : 3084 bytes: 94357
id: 10482 server BE nginxid: 4 kbps: 0.000
accounting: false
drop : 0
rx : 15 bytes: 2477
id: 20344 serve BE nginxid: 10 kbps: 62914.560
accounting: true
drop : 2
rx : 2217 bytes: 309637
accel : 482 bytes: 264318
acc :349 bytes: 225181
Below python code gets accounting and accel values using below regex
accounting:\s*((?P<accounting>\S*)[\S\s]*?accel:[\S\s]*?bytes:\s*(?P<accel>\S*)[\S\s]*?)
for match in re.finditer(re_exp, text):
group = match.groupdict()
print group
Output:
{"accounting": false, "accel": 264318}
But, the expected output should be
{"accounting": true, "accel": 264318}
Need help with regex expression. Any help would be greatly appreciated.
Also, is there a regex way to group all the data fields under id?
Thanks
Try this
content = open("acc.txt",'r')
ar = content.read()
import re
getdata = re.findall(r"accounting: (true).+?accel.+?bytes:\s(\d+)",ar,re.S)
print getdata
Try as the follow for group the all data corresponding to their id
content = open("acc.txt",'r')
ar = content.readlines()
arv = []
flag = 0
m = ""
for j in ar:
if("id:"in j):
arv.append(m)
m = ""
flag = 1
if (flag == 1):
m+=j
for j in (arv):
print j
Iterate through lines one by one:
import re
rx = ""
accel = ""
acc = ""
lines = text.split("\n")
for i in range(len(lines)):
line = lines[i]
if line == "accounting: true":
match = re.search(r"rx\s*:\s+(\d*)", lines[i+2])
if match:
rx = match.group(1)
match = re.search(r"accel\s*:\s+(\d*)", lines[i+3])
if match:
accel = match.group(1)
match = re.search(r"acc\s*:\s*(\d*)", lines[i+4])
if match:
acc = match.group(1)
print("'accounting': true, 'rx': {}, 'accel': {}, 'acc': {}".format(rx, accel, acc))

Extracting Specific Columns from Multiple Files & Writing to File Python

I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change)
000001 000001 ZZ 1:1 01 01 NOTEST 0 0 0 0 1 1 no
I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:
gene name locus log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change) log2(fold_change)
ZZ 1:1 0 0 0 0
all the log2(fold_change) are obtain from the tenth column from each of the seven files
What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work
dicti = defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
reader = csv.reader((f), delimiter='\t')
for row in reader:
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output.txt", "w") as out:
out.write("gene name" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
for k,v in dicti:
out.write(k + "\t" + v[1][1][3] + "\t" + "".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])+"\n")
So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:
"".join([ int(z[0][0:3]) * "\t" + z[1][9] for z in v ])
I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by "\t" this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:
Here is what I mean for instance,
gene name locus log2(fold change) from file 1 .... log2(fold change) from file7
ZZ 1:3 0
0
because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by "\t" and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done.
Here is my first approach:
import glob
import numpy as np
with open('output.txt', 'w') as out:
fns = glob.glob('*.txt') # Here you can change the pattern of the file (e.g. 'file_experiment_*.txt')
# Title row:
titles = ['gene_name', 'locus'] + [str(file + 1) + '_log2(fold_change)' for file in range(len(fns))]
out.write('\t'.join(titles) + '\n')
# Data row:
data = []
for idx, fn in enumerate(fns):
file = np.genfromtxt(fn, skip_header=1, usecols=(2, 3, 9), dtype=np.str, autostrip=True)
if idx == 0:
data.extend([file[0], file[1]])
data.append(file[2])
out.write('\t'.join(data))
Content of the created file output.txt (Note: I created just three files for testing):
gene_name locus 1_log2(fold_change) 2_log2(fold_change) 3_log2(fold_change)
ZZ 1:1 0 0 0
I am using re instead of csv. The main problem with you code is the for loop which writes the output in the file. I am writing the complete code. Hope this solves problem you have.
import collections
import glob
import re
dicti = collections.defaultdict(list)
filetag = []
def read_data(file, base):
with open(file, 'r') as f:
for row in f:
r = re.compile(r'([^\s]*)\s*')
row = r.findall(row.strip())[:-1]
print row
if 'test_id' not in row[0]:
dicti[row[2]].append((base, row))
def main():
name_of_fold = raw_input("Folder name to stored output files in: ")
for file in glob.glob("*.txt"):
base=file[0:3]+"-log2(fold_change)"
filetag.append(base)
read_data(file, base)
with open ("output", "w") as out:
data = ("genename" + "\t"+ "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
r = re.compile(r'([^\s]*)\s*')
data = r.findall(data.strip())[:-1]
out.write('{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30} {0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
for key in dicti:
print 'locus = ' + str(dicti[key][1])
data = (key + "\t" + dicti[key][1][1][3] + "\t" + "".join([ len(z[0][0:3]) * "\t" + z[1][9] for z in dicti[key] ])+"\n")
data = r.findall(data.strip())[:-1]
out.write('{0[0]:<30}{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}{0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
out.write('\n')
if __name__ == '__main__':
main()
and i change the name of the output file from output.txt to output as the former may interrupt the code as code considers all .txt files. And I am attaching the output i got which i assume the format that you wanted.
Thanks
gene name locus 1.t-log2(fold_change) 2.t-log2(fold_change) 3.t-log2(fold_change) 4.t-log2(fold_change) 5.t-log2(fold_change) 6.t-log2(fold_change) 7.t-log2(fold_change)
ZZ 1:1 0 0 0 0 0 0 0
Remember to append \n to the end of each line to create a line break. This method is very memory efficient, as it just processes one row at a time.
import csv
import os
import glob
# Your folder location where the input files are saved.
name_of_folder = '...'
output_filename = 'output.txt'
input_files = glob.glob(os.path.join(name_of_folder, '*.txt'))
with open(os.path.join(name_of_folder, output_filename), 'w') as file_out:
headers_read = False
for input_file in input_files:
if input_file == os.path.join(name_of_folder, output_filename):
# If the output file is in the list of input files, ignore it.
continue
with open(input_file, 'r') as fin:
reader = csv.reader(fin)
if not headers_read:
# Read column headers just once
headers = reader.next()[0].split()
headers = headers[2:4] + [headers[9]]
file_out.write("\t".join(headers + ['\n'])) # Zero based indexing.
headers_read = True
else:
_ = reader.next() # Ignore header row.
for line in reader:
if line: # Ignore blank lines.
line_out = line[0].split()
file_out.write("\t".join(line_out[2:4] + [line_out[9]] + ['\n']))
>>> !cat output.txt
gene locus log2(fold_change)
ZZ 1:1 0
ZZ 1:1 0