Writing multiple header lines in pandas.DataFrame.to_csv - python-2.7

I am putting my data into NASA's ICARTT format for archvival. This is a comma-separated file with multiple header lines, and has commas in the header lines. Something like:
46, 1001
lastname, firstname
location
instrument
field mission
1, 1
2011, 06, 21, 2012, 02, 29
0
Start_UTC, seconds, number_of_seconds_from_0000_UTC
14
1, 1
-999, -999
measurement name, units
measurement name, units
column1 label, column2 label, column3 label, column4 label, etc.
I have to make a separate file for each day that data were collected, so I will end up creating around thirty files in all. When I create a csv file via pandas.DataFrame.to_csv I cannot (as far as I know) simply write the header lines to the file before writing the data, so I have had to trick it to doing what I want via
# assuming <df> is a pandas dataframe
df.to_csv('dst.ict',na_rep='-999',header=True,index=True,index_label=header_lines)
where "header_lines" is the header string
What this give me is exactly what I want, except "header_lines" is bracketed by double-quotes. Is there any way to write text to the head of a csv file using to_csv or remove the double quotes? I have already tried setting quotechar='' and doublequote=False in to_csv(), but the double quotes still come up.
What I am doing now (and it works for now, but I would like to move to something better) is simply opening a file via open('dst.ict','w') and printing to that line by line, which is quite slow.

You can, indeed, just write the header lines before the data. pandas.DataFrame.to_csv takes a path_or_buf as its first argument, not just a pathname:
pandas.DataFrame.to_csv(path_or_buf, *args, **kwargs)
path_or_buf : string or file handle, default None
File path or object, if None is provided the result is returned as a string.
Here's an example:
#!/usr/bin/python2
import pandas as pd
import numpy as np
import sys
# Make an example data frame.
df = pd.DataFrame(np.random.randint(100, size=(5,5)),
columns=['a', 'b', 'c', 'd', 'e'])
header = '\n'.join(
# I like to make sure the header lines are at least utf8-encoded.
[unicode(line, 'utf8') for line in
[ '1001',
'Daedalus, Stephen',
'Dublin, Ireland',
'Keys',
'MINOS',
'1,1',
'1904,06,16,1922,02,02',
'time_since_8am', # Ends up being the header name for the index.
]
]
)
with open(sys.argv[1], 'w') as ict:
# Write the header lines, including the index variable for
# the last one if you're letting Pandas produce that for you.
# (see above).
for line in header:
ict.write(line)
# Just write the data frame to the file object instead of
# to a filename. Pandas will do the right thing and realize
# it's already been opened.
df.to_csv(ict)
The result is just what you wanted - to write the header lines, and then call .to_csv() and write that:
$ python example.py test && cat test
1001
Daedalus, Stephen
Dublin, Ireland
Keys to the tower
MINOS
1, 1
1904, 06, 16, 1922, 02, 02
time_since_8am,a,b,c,d,e
0,67,85,66,18,32
1,47,4,41,82,84
2,24,50,39,53,13
3,49,24,17,12,61
4,91,5,69,2,18
Sorry if this is too late to be useful. I work in archiving these files (and use Python), so feel free to drop me a line if you have future questions.

Even though it's still some years and ndt's answer is quite nice, another possibility would be to write the header first and then use to_csv() with mode='a' (append):
# write the header
header = '46, 1001\nlastname, firstname\n,...'
with open('test.csv', 'w') as fp
fp.write(header)
# write the rest
df.to_csv('test.csv', header=True, mode='a')
It's maybe less effective due to the two write operations, though...

Related

Python 2.7 - Append processed file contents from multiple files to one large CSV file with original filename headers separating

I have not done any programming in about 12 years and have been asked by one of my colleagues to help with what is apparently a basic Python 2.7 script. My question is very similar to what this person asked (though has not been answered):
Python - Batch combine Multiple large CSV, filter data, skip header, appending vertically into a single CSV
I need to prompt the user for the folder path, read in each file from that folder (there are hundreds of CSV files), conduct processing, and then output the finished processing from each file into a single CSV file with the output of each file separated by a blank line and the filename of the file that it was read from.
It would result in something like this:
CHEM_0_5
etc etc
etc etc
etc etc
LAW_4_1
etc etc
etc etc
LAW_7_3
etc etc
etc etc
Currently the script has to be edited with the name of the file it has to read, saved, and then run. Then the contents of the output file has to be manually copied into a new csv file. It is very tedious and time consuming.
This is what I currently have. Please note I have removed some of the processing from the example.
import time
import datetime
x = 0
stamp = 0
compare = 1
values = []
## INSERT NAME OF FILE YOU WANT TO CLEAN
g = open('CHEM_0_5.csv','r')
for line in g:
lis=[line.split() for line in g]
lis.pop(0)
lis.pop(0)
timestamps = []
results = []
x = 0
for i in cl:
## INSERT WHAT YOU WANT TO SAVE THE FILE AS
fd = open('new.csv','a')
fd.write(str(ts[x]) + "," + str(i) + "\n")
fd.close()
x = x + 1
g.close()
I have been trying to re-learn python in the process of searching for answers but given that I don't really know what I'm doing I feel that this could be something to do after I've completed the task for my colleague.
Thank you for taking the time to read my submission!

BadDataError when editing a .dbf file using dbf package

I have recently produced several thousand shapefile outputs and accompanying .dbf files from an atmospheric model (HYSPLIT) on a unix system. The converter txt2dbf is used to convert shapefile attribute tables (text file) to a .dbf.
Unfortunately, something has gone wrong (probably a separator/field length error) because there are 2 problems with the output .dbf files, as follows:
Some fields of the dbf contain data that should not be there. This data has "spilled over" from neighbouring fields.
An additional field has been added that should not be there (it actually comes from a section of the first record of the text file, "1000 201").
This is an example of the first record in the output dbf (retrieved using dbview unix package):
Trajnum : 1001 2
Yyyymmdd : 0111231 2
Time : 300
Level : 0.
1000 201:
Here's what I expected:
Trajnum : 1000
Yyyymmdd : 20111231
Time : 2300
Level : 0.
Separately, I'm looking at how to prevent this from happening again, but ideally I'd like to be able to repair the existing .dbf files. Unfortunately the text files are removed for each model run, so "fixing" the .dbf files is the only option.
My approaches to the above problems are:
Extract the information from the fields that do exist to a new variable using dbf.add_fields and dbf.write (python package dbf), then delete the old incorrect fields using dbf.delete_fields.
Delete the unwanted additional field.
This is what I've tried:
with dbf.Table(db) as db:
db.add_fields("TRAJNUMc C(4)") #create new fields
db.add_fields("YYYYMMDDc C(8)")
db.add_fields("TIMEc C(4)")
for record in db: #extract data from fields
dbf.write(TRAJNUMc=int(str(record.Trajnum)[:4]))
dbf.write(YYYYMMDDc=int(str(record.Trajnum)[-1:] + str(record.Yyyymmdd)[:7]))
dbf.write(TIMEc=record.Yyyymmdd[-1:] + record.Time[:])
db.delete_fields('Trajnum') # delete the incorrect fields
db.delete_fields('Yyyymmdd')
db.delete_fields('Time')
db.delete_fields('1000 201') #delete the unwanted field
db.pack()
But this produces the following error:
dbf.ver_2.BadDataError: record data is not the correct length (should be 31, not 30)
Given the apparent problem that there has been with the txt2dbf conversion, I'm not surprised to find an error in the record data length. However, does this mean that the file is completely corrupted and that I can't extract the information that I need (frustrating because I can see that it exists)?
EDIT:
Rather than attempting to edit the 'bad' .dbf files, it seems a better approach to 1. extract the required data to a text from the bad files and then 2. write to a new dbf. (See Ethan Furman's comments/answer below).
EDIT:
An example of a faulty .dbf file that I need to fix/recover data from can be found here:
https://www.dropbox.com/s/9y92f7m88a8g5y4/p0001120110.dbf?dl=0
An example .txt file from which the faulty dbf files were created can be found here:
https://www.dropbox.com/s/d0f2c0zehsyy8ab/attTEST.txt?dl=0
To fix the data and recreate the original text file, this snippet should help:
import dbf
table = dbf.Table('/path/to/scramble/table.dbf')
with table:
fixed_data = []
for record in table:
# convert to str/bytes while skipping delete flag
data = record._data[1:].tostring()
trajnum = data[:4]
ymd = data[4:12]
time = data [12:16]
level = data[16:].strip()
fixed_data.extend([trajnum, ymd, time, level])
new_file = open('repaired_data.txt', 'w')
for line in fixed_data:
new_file.write(','.join(line) + '\n')
Assuming all your data files look like your sample (the big IF being the data has no embedded commas), then this rough code should help translate your text files into dbfs:
raw_data = open('some_text_file.txt').read().split('\n')
final_table = dbf.Table(
'dest_table.dbf',
'trajnum C(4); yyyymmdd C(8); time C(4); level C(9)',
)
with final_table:
for line in raw_data:
fields = line.split(',')
final_table.append(tuple(fields))
# table has been populated and closed
Of course, you could get fancier and use actual date, and number fields if you want to:
# dbf string becomes
'trajnum N; yyyymmdd D; time C(4), level N'
#appending data loop becomes
for line in raw_data:
trajnum, ymd, time, level = line.split(',')
trajnum = int(trajnum)
ymd = dbf.Date(ymd[:4], ymd[4:6], ymd[6:])
level = int(level)
final_table.append((trajnum, ymd, time, level))

Python - Sort files based on timestamp

I have a list which contains list of file names, i wanted to sort based on timestamp, which ( i.e timestamp ) is inbuild in each file name.
Note: In file, Hello_Hi_2015-02-20T084521_1424543480.tar.gz --> 2015-02-20T084521 represents as "year-moth-dayTHHMMSS" ( Based on this i wanted to sort )
Input file below:
file_list = ['Hello_Hi_2015-02-20T084521_1424543480.tar.gz',
'Hello_Hi_2015-02-20T095845_1424543481.tar.gz',
'Hello_Hi_2015-02-20T095926_1424543481.tar.gz',
'Hello_Hi_2015-02-20T100025_1424543482.tar.gz',
'Hello_Hi_2015-02-20T111631_1424543483.tar.gz',
'Hello_Hi_2015-02-20T111718_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112502_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112633_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113427_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113456_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113608_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113659_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113809_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113901_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113955_1424543485.tar.gz',
'Hello_Hi_2015-03-20T114122_1424543485.tar.gz',
'Hello_Hi_2015-02-20T114532_1424543486.tar.gz',
'Hello_Hi_2015-02-20T120045_1424543487.tar.gz',
'Hello_Hi_2015-02-20T120146_1424543487.tar.gz',
'Hello_WR_2015-02-20T084709_1424543480.tar.gz',
'Hello_WR_2015-02-20T113016_1424543486.tar.gz']
Output should be:
file_list = ['Hello_Hi_2015-02-20T084521_1424543480.tar.gz',
'Hello_WR_2015-02-20T084709_1424543480.tar.gz',
'Hello_Hi_2015-02-20T095845_1424543481.tar.gz',
'Hello_Hi_2015-02-20T095926_1424543481.tar.gz',
'Hello_Hi_2015-02-20T100025_1424543482.tar.gz',
'Hello_Hi_2015-02-20T111631_1424543483.tar.gz',
'Hello_Hi_2015-02-20T111718_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112502_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112633_1424543484.tar.gz',
'Hello_WR_2015-02-20T113016_1424543486.tar.gz',
'Hello_Hi_2015-02-20T113427_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113456_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113608_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113659_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113809_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113901_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113955_1424543485.tar.gz',
'Hello_Hi_2015-02-20T114532_1424543486.tar.gz',
'Hello_Hi_2015-02-20T120045_1424543487.tar.gz',
'Hello_Hi_2015-02-20T120146_1424543487.tar.gz',
'Hello_Hi_2015-03-20T114122_1424543485.tar.gz']
Below is the code which i have tried.
def sort( dir ):
os.chdir( dir )
file_list = glob.glob('Hello_*')
file_list.sort(key=os.path.getmtime)
print("\n".join(file_list))
return 0
Thanks in advance!!
So this worked for me and it sorted files by created time that did not have the time stamp in the name;
import os
import re
files = [file for file in os.listdir(".") if (file.lower().endswith('.gz'))]
files.sort(key=os.path.getmtime)
for file in sorted(files,key=os.path.getmtime):
print(file)
Would this work?
You could write list contents to a file line by line and read the file:
lines = sorted(open(open_file).readlines(), key = lambda line :
line.split("_")[2])
Further, you could print out lines.
Your code is trying to sort based on the filesystem-stored modified time, not the filename time.
Since your filename encoding is slightly sane :-) if you want to sort based on filename alone, you may use:
sorted(os.listdir(dir), key=lambda s: s[9:]))
That will do, but only because the timestamp encoding in the filename is sane: fixed-length prefix, zero-padded, constant-width numbers, going in sequence from biggest time reference (year) to the lowest one (second).
If your prefix is not fixed, you can try something with RegExp like this (which will sort by the value after the second underscore):
import re
pat = re.compile('_.*?(_)')
sorted(os.listdir(dir), key=lambda s: s[pat.search(s).end():])

How to parse/pull specific data out of a file with Python

I have an interesting issue I am trying to solve and I have taken a good stab at it but need a little help. I have a squishy file that contains some lua code. I am trying to read this file and build a file path out of it. However, depending on where this file was generated from, it may contain some information or it might miss some. Here is an example of the squishy file I need to parse.
Module "foo1"
Module "foo2"
Module "common.command" "common/command.lua"
Module "common.common" "common/common.lua"
Module "common.diagnostics" "common/diagnostics.lua"
Here is the code I have written to read the file and search for the lines containing Module. You will see that there are three different sections or columns to this file. If you look at line 3 you will have "Module" for column1, "common.command" for column2 and "common/command.lua" for column3.
Taking Column3 as an example... if there is data that exists in the 3rd column then I just need to strip the quotes off and grab the data in Column3. In this case it would be common/command.lua. If there is no data in Column3 then I need to get the data out of Column2 and replace the period (.) with a os.path.sep and then tack a .lua extension on the file. Again, using line 3 as an example I would need to pull out common.common and make it common/common.lua.
squishyContent = []
if os.path.isfile(root + os.path.sep + "squishy"):
self.Log("Parsing Squishy")
with open(root + os.path.sep + "squishy") as squishyFile:
lines = squishyFile.readlines()
squishyFile.close()
for line in lines:
if line.startswith("Module "):
path = line.replace('Module "', '').replace('"', '').replace("\n", '').replace(".", "/") + ".lua"
Just need some examples/help in getting through this.
This might sound silly, but the easiest approach is to convert everything you told us about your task to code.
for line in lines:
# if the line doesn't start with "Module ", ignore it
if not line.startswith('Module '):
continue
# As you said, there are 3 columns. They're separated by a blank, so what we're gonna do is split the text into a 3 columns.
line= line.split(' ')
# if there are more than 2 columns, use the 3rd column's text (and remove the quotes "")
if len(line)>2:
line= line[2][1:-1]
# otherwise, ...
else:
line= line[1] # use the 2nd column's text
line= line[1:-1] # remove the quotes ""
line= line.replace('.', os.path.sep) # replace . with /
line+= '.lua' # and add .lua
print line # prove it works.
With a simple problem like this, it's easy to make the program do exactly what you yourself would do if you did the task manually.

Loop through multiple csv files and write one column into new output csv

I have 251 CSV files in a folder. They are named "returned UDTs 1-12-13.csv", "returned UDTs 1-13-13.csv. The dates are not consecutive, however. For example holidays and weekends may have missing dates, so the next file may be "returned UDTs 1-17-13.csv". Each file has one column of data. I need to extract each column and append into one column in one new output csv file. I want to write a python script to do so. In a dummy folder with 3 dummy csv files (csv1.csv, csv2.csv, and csv3.csv) I created the following script that works:
import csv, os, sys
out_csv = r"C:\OutCSV\csvtest.csv"
path = r"C:\CSV_test"
fout=open(out_csv,"a")
# first file:
for line in open(path + "\csv1.csv"):
fout.write(line)
# now the rest:
for num in range(2,4):
f = open(path + "\csv"+str(num)+".csv")
f.next() # skip the header
for line in f:
fout.write(line)
f.close() # dont know if needed
fout.close()
The issue is the date in the file name and how to deal with it. Any help would be appreciated.