How to parse/pull specific data out of a file with Python

How to parse/pull specific data out of a file with Python - regex

I have an interesting issue I am trying to solve and I have taken a good stab at it but need a little help. I have a squishy file that contains some lua code. I am trying to read this file and build a file path out of it. However, depending on where this file was generated from, it may contain some information or it might miss some. Here is an example of the squishy file I need to parse.
Module "foo1"
Module "foo2"
Module "common.command" "common/command.lua"
Module "common.common" "common/common.lua"
Module "common.diagnostics" "common/diagnostics.lua"
Here is the code I have written to read the file and search for the lines containing Module. You will see that there are three different sections or columns to this file. If you look at line 3 you will have "Module" for column1, "common.command" for column2 and "common/command.lua" for column3.
Taking Column3 as an example... if there is data that exists in the 3rd column then I just need to strip the quotes off and grab the data in Column3. In this case it would be common/command.lua. If there is no data in Column3 then I need to get the data out of Column2 and replace the period (.) with a os.path.sep and then tack a .lua extension on the file. Again, using line 3 as an example I would need to pull out common.common and make it common/common.lua.
squishyContent = []
if os.path.isfile(root + os.path.sep + "squishy"):
self.Log("Parsing Squishy")
with open(root + os.path.sep + "squishy") as squishyFile:
lines = squishyFile.readlines()
squishyFile.close()
for line in lines:
if line.startswith("Module "):
path = line.replace('Module "', '').replace('"', '').replace("\n", '').replace(".", "/") + ".lua"
Just need some examples/help in getting through this.

This might sound silly, but the easiest approach is to convert everything you told us about your task to code.
for line in lines:
# if the line doesn't start with "Module ", ignore it
if not line.startswith('Module '):
continue
# As you said, there are 3 columns. They're separated by a blank, so what we're gonna do is split the text into a 3 columns.
line= line.split(' ')
# if there are more than 2 columns, use the 3rd column's text (and remove the quotes "")
if len(line)>2:
line= line[2][1:-1]
# otherwise, ...
else:
line= line[1] # use the 2nd column's text
line= line[1:-1] # remove the quotes ""
line= line.replace('.', os.path.sep) # replace . with /
line+= '.lua' # and add .lua
print line # prove it works.
With a simple problem like this, it's easy to make the program do exactly what you yourself would do if you did the task manually.

Related

In python insert one space after every 5th Character in each line of a text file

I am reading a text file in python(500 rows) and it seems like:
File Input:
0082335401
0094446049
01008544409
01037792084
01040763890
I wanted to ask that is it possible to insert one space after 5th Character in each line:
Desired Output:
00823 35401
00944 46049
01008 544409
01037 792084
01040 763890
I have tried below code
st = " ".join(st[i:i + 5] for i in range(0, len(st), 5))
but the below output was returned on executing it:
00823 35401
0094 44604 9
010 08544 409
0 10377 92084
0104 07638 90
I am a novice in Python. Any help would make a difference.

There seems to be two issues here - By running your provided code, you seem to be reading the file into one single string. It would be much preferable (in your case) to read the file in as a list of strings, like the following (assuming your input file is input_data.txt):
# Initialize a list for the data to be stored
data = []
# Iterate through your file to read the data
with open("input_data.txt") as f:
for line in f.readlines():
# Use .rstrip() to get rid of the newline character at the end
data.append(line.rstrip("\r\n"))
Then, to operate on the data you obtained in a list, you could use a list comprehension similar to the one you have tried to use.
# Assumes that data is the result from the above code
data = [i[:5] + " " + i[5:] if len(i) > 5 else i for i in data]
Hope this helped!

If your only requirement is to insert a space after the fifth character than you could use the following simple version:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
if len(line) > 5:
print(line[0:5]+" "+line[5:])
else:
print(line)
If you don't mind if lines with less than five characters get a space at the end, you could even omit the if-else-statement and go with the print-function from the if-clause:
#!/usr/bin/env python
with open("input_data") as data:
for line in data.readlines():
line = line.rstrip()
print(line[0:5]+" "+line[5:])

Python 2.7 - Append processed file contents from multiple files to one large CSV file with original filename headers separating

I have not done any programming in about 12 years and have been asked by one of my colleagues to help with what is apparently a basic Python 2.7 script. My question is very similar to what this person asked (though has not been answered):
Python - Batch combine Multiple large CSV, filter data, skip header, appending vertically into a single CSV
I need to prompt the user for the folder path, read in each file from that folder (there are hundreds of CSV files), conduct processing, and then output the finished processing from each file into a single CSV file with the output of each file separated by a blank line and the filename of the file that it was read from.
It would result in something like this:
CHEM_0_5
etc etc
etc etc
etc etc
LAW_4_1
etc etc
etc etc
LAW_7_3
etc etc
etc etc
Currently the script has to be edited with the name of the file it has to read, saved, and then run. Then the contents of the output file has to be manually copied into a new csv file. It is very tedious and time consuming.
This is what I currently have. Please note I have removed some of the processing from the example.
import time
import datetime
x = 0
stamp = 0
compare = 1
values = []
## INSERT NAME OF FILE YOU WANT TO CLEAN
g = open('CHEM_0_5.csv','r')
for line in g:
lis=[line.split() for line in g]
lis.pop(0)
lis.pop(0)
timestamps = []
results = []
x = 0
for i in cl:
## INSERT WHAT YOU WANT TO SAVE THE FILE AS
fd = open('new.csv','a')
fd.write(str(ts[x]) + "," + str(i) + "\n")
fd.close()
x = x + 1
g.close()
I have been trying to re-learn python in the process of searching for answers but given that I don't really know what I'm doing I feel that this could be something to do after I've completed the task for my colleague.
Thank you for taking the time to read my submission!

Writing multiple header lines in pandas.DataFrame.to_csv

I am putting my data into NASA's ICARTT format for archvival. This is a comma-separated file with multiple header lines, and has commas in the header lines. Something like:
46, 1001
lastname, firstname
location
instrument
field mission
1, 1
2011, 06, 21, 2012, 02, 29
0
Start_UTC, seconds, number_of_seconds_from_0000_UTC
14
1, 1
-999, -999
measurement name, units
measurement name, units
column1 label, column2 label, column3 label, column4 label, etc.
I have to make a separate file for each day that data were collected, so I will end up creating around thirty files in all. When I create a csv file via pandas.DataFrame.to_csv I cannot (as far as I know) simply write the header lines to the file before writing the data, so I have had to trick it to doing what I want via
# assuming <df> is a pandas dataframe
df.to_csv('dst.ict',na_rep='-999',header=True,index=True,index_label=header_lines)
where "header_lines" is the header string
What this give me is exactly what I want, except "header_lines" is bracketed by double-quotes. Is there any way to write text to the head of a csv file using to_csv or remove the double quotes? I have already tried setting quotechar='' and doublequote=False in to_csv(), but the double quotes still come up.
What I am doing now (and it works for now, but I would like to move to something better) is simply opening a file via open('dst.ict','w') and printing to that line by line, which is quite slow.

You can, indeed, just write the header lines before the data. pandas.DataFrame.to_csv takes a path_or_buf as its first argument, not just a pathname:
pandas.DataFrame.to_csv(path_or_buf, *args, **kwargs)
path_or_buf : string or file handle, default None
File path or object, if None is provided the result is returned as a string.
Here's an example:
#!/usr/bin/python2
import pandas as pd
import numpy as np
import sys
# Make an example data frame.
df = pd.DataFrame(np.random.randint(100, size=(5,5)),
columns=['a', 'b', 'c', 'd', 'e'])
header = '\n'.join(
# I like to make sure the header lines are at least utf8-encoded.
[unicode(line, 'utf8') for line in
[ '1001',
'Daedalus, Stephen',
'Dublin, Ireland',
'Keys',
'MINOS',
'1,1',
'1904,06,16,1922,02,02',
'time_since_8am', # Ends up being the header name for the index.
]
]
)
with open(sys.argv[1], 'w') as ict:
# Write the header lines, including the index variable for
# the last one if you're letting Pandas produce that for you.
# (see above).
for line in header:
ict.write(line)
# Just write the data frame to the file object instead of
# to a filename. Pandas will do the right thing and realize
# it's already been opened.
df.to_csv(ict)
The result is just what you wanted - to write the header lines, and then call .to_csv() and write that:
$ python example.py test && cat test
1001
Daedalus, Stephen
Dublin, Ireland
Keys to the tower
MINOS
1, 1
1904, 06, 16, 1922, 02, 02
time_since_8am,a,b,c,d,e
0,67,85,66,18,32
1,47,4,41,82,84
2,24,50,39,53,13
3,49,24,17,12,61
4,91,5,69,2,18
Sorry if this is too late to be useful. I work in archiving these files (and use Python), so feel free to drop me a line if you have future questions.

Even though it's still some years and ndt's answer is quite nice, another possibility would be to write the header first and then use to_csv() with mode='a' (append):
# write the header
header = '46, 1001\nlastname, firstname\n,...'
with open('test.csv', 'w') as fp
fp.write(header)
# write the rest
df.to_csv('test.csv', header=True, mode='a')
It's maybe less effective due to the two write operations, though...

Loop through multiple csv files and write one column into new output csv

I have 251 CSV files in a folder. They are named "returned UDTs 1-12-13.csv", "returned UDTs 1-13-13.csv. The dates are not consecutive, however. For example holidays and weekends may have missing dates, so the next file may be "returned UDTs 1-17-13.csv". Each file has one column of data. I need to extract each column and append into one column in one new output csv file. I want to write a python script to do so. In a dummy folder with 3 dummy csv files (csv1.csv, csv2.csv, and csv3.csv) I created the following script that works:
import csv, os, sys
out_csv = r"C:\OutCSV\csvtest.csv"
path = r"C:\CSV_test"
fout=open(out_csv,"a")
# first file:
for line in open(path + "\csv1.csv"):
fout.write(line)
# now the rest:
for num in range(2,4):
f = open(path + "\csv"+str(num)+".csv")
f.next() # skip the header
for line in f:
fout.write(line)
f.close() # dont know if needed
fout.close()
The issue is the date in the file name and how to deal with it. Any help would be appreciated.

Python 2.7.3: Search/Count txt file for string, return full line with final occurrence of that string

I'm trying to create a WiFi Log Scanner. Currently we go through logs manually using CTRL+F and our keywords. I just want to automate that process. i.e. bang in a .txt file and receive an output.
I've got the bones of the code, can work on making it pretty later, but I'm running into a small issue. I want the scanner to search the file (done), count instances of that string (done) and output the number of occurrences (done) followed by the full line where that string occurred last, including line number (line number is not essential, just makes things easier to do a gestimate of which is the more recent issue if there are multiple).
Currently I'm getting an output of every line with the string in it. I know why this is happening, I just can't think of a way to specify just output the last line.
Here is my code:
import os
from Tkinter import Tk
from tkFileDialog import askopenfilename
def file_len(filename):
#Count Number of Lines in File and Output Result
with open(filename) as f:
for i, l in enumerate(f):
pass
print('There are ' + str(i+1) + ' lines in ' + os.path.basename(filename))
def file_scan(filename):
#All Issues to Scan will go here
print ("DHCP was found " + str(filename.count('No lease, failing')) + " time(s).")
for line in filename:
if 'No lease, failing' in line:
print line.strip()
DNS= (filename.count('Host name lookup failure:res_nquery failed') + filename.count('HTTP query failed'))/2
print ("DNS Failure was found " + str(DNS) + " time(s).")
for line in filename:
if 'Host name lookup failure:res_nquery failed' or 'HTTP query failed' in line:
print line.strip()
print ("PSK= was found " + str(testr.count('psk=')) + " time(s).")
for line in ln:
if 'psk=' in line:
print 'The length(s) of the PSK used is ' + str(line.count('*'))
Tk().withdraw()
filename=askopenfilename()
abspath = os.path.abspath(filename) #So that doesn't matter if File in Python Dir
dname = os.path.dirname(abspath) #So that doesn't matter if File in Python Dir
os.chdir(dname) #So that doesn't matter if File in Python Dir
print ('Report for ' + os.path.basename(filename))
file_len(filename)
file_scan(filename)
That's, pretty much, going to be my working code (just have to add a few more issue searches), I have a version that searches a string instead of a text file here. This outputs the following:
Total Number of Lines: 38
DHCP was found 2 time(s).
dhcp
dhcp
PSK= was found 2 time(s).
The length(s) of the PSK used is 14
The length(s) of the PSK used is 8
I only have general stuff there, modified for it being a string rather than txt file, but the string I'm scanning from will be what's in the txt files.
Don't worry too much about PSK, I want all examples of that listed, I'll see If I can tidy them up into one line at a later stage.
As a side note, a lot of this is jumbled together from doing previous searches, so I have a good idea that there are probably neater ways of doing this. This is not my current concern, but if you do have a suggestion on this side of things, please provide an explanation/link to explanation as to why your way is better. I'm fairly new to python, so I'm mainly dealing with stuff I currently understand. :)
Thanks in advance for any help, if you need any further info, please let me know.
Joe

To search and count the string occurrence I solved in following way
'''---------------------Function--------------------'''
#Counting the "string" occurrence in a file
def count_string_occurrence():
string = "test"
f = open("result_file.txt")
contents = f.read()
f.close()
print "Number of '" + string + "' in file", contents.count("foo")
#we are searching "foo" string in file "result_file.txt"

I can't comment yet on questions, but I think I can answer more specifically with some more information What line do you want only one of?
For example, you can do something like:
search_str = 'find me'
count = 0
for line in file:
if search_str in line:
last_line = line
count += 1
print '{0} occurrences of this line:\n{1}'.format(count, last_line)
I notice that in file_scan you are iterating twice through file. You can surely condense it into one iteration :).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to parse/pull specific data out of a file with Python - regex

Related

In python insert one space after every 5th Character in each line of a text file

Python 2.7 - Append processed file contents from multiple files to one large CSV file with original filename headers separating

Writing multiple header lines in pandas.DataFrame.to_csv

Loop through multiple csv files and write one column into new output csv

Python 2.7.3: Search/Count txt file for string, return full line with final occurrence of that string

Categories

Resources