Abbreviate the import of multiple files with loadtxt (Python) - python-2.7

I wanna abbreviate the way I import multiples files with loadtxt, I do the next:
rc1 =loadtxt("20120701_Gp_xr_5m.txt", skiprows=19)
rc2 =loadtxt("20120702_Gp_xr_5m.txt", skiprows=19)
rc3 =loadtxt("20120703_Gp_xr_5m.txt", skiprows=19)
rc4 =loadtxt("20120704_Gp_xr_5m.txt", skiprows=19)
rc5 =loadtxt("20120705_Gp_xr_5m.txt", skiprows=19)
rc6 =loadtxt("20120706_Gp_xr_5m.txt", skiprows=19)
rc7 =loadtxt("20120707_Gp_xr_5m.txt", skiprows=19)
rc8 =loadtxt("20120708_Gp_xr_5m.txt", skiprows=19)
rc9 =loadtxt("20120709_Gp_xr_5m.txt", skiprows=19)
rc10 =loadtxt("20120710_Gp_xr_5m.txt", skiprows=19)
Then I concatenate them using:
GOES =concatenate((rc1,rc2,rc3,rc4,rc5,rc6,rc7,rc8,rc9,
rc10),axis=0)
But my question is: Do I wanna reduce all of this? Maybe with a FOR or something like that. Since the files are a secuence of dates (strings).
I was thinking to do something like this
day= #### i dont know how define a string going from 01 to 31 for example
data="201207"+day+"_Gp_xr_5m.txt"
Then do this, but i think is not correct
GOES=loadtxt(data, skiprows=19)

Yes, you can easily get your sub-arrays with a for-loop, or with an equivalent list comprehension. Use the glob module to get the desired file names:
import numpy as np # you probably don't need this line
from glob import glob
fnames = glob('path/to/dir')
arrays = [np.loadtxt(f, skiprows=19) for f in fnames]
final_array = np.concatenate(arrays)
If memory use becomes a problem, you can also iterate over all files line by line by chaining them and feeding that generator to np.loadtxt.
edit after OP's comment
My example with glob wasn't very clear..
You can use "wildcards" * to match files, e.g. glob('*') to get a list of all files in the current directory. A part of the code above could therefor be written better as:
fnames = glob('path/to/dir/201207*_Gp_xr_5m.txt')
Or if your program already runs from the right directory:
fnames = glob('201207*_Gp_xr_5m.txt')
I forgot this earlier, but you should also sort the list of filenames, because the list of filenames from glob is not guaranteed to be sorted.
fnames.sort()
A slightly different approach, more in the direction of what you were thinking is the following. When variable day contains the day number you can put it in the filename like so:
daystr = str(day).zfill(2)
fname = '201207' + daystr + '_Gp_xr_5m.txt'
Or using a clever format specifier:
fname = '201207{:02}_Gp_xr_5m.txt'.format(day)
Or the "old" way:
fname = '201207%02i_Gp_xr_5m.txt' % day
Then simply use this in a for-loop:
arrays = []
for day in range(1, 32):
daystr = str(day).zfill(2)
fname = '201207' + daystr + '_Gp_xr_5m.txt'
a = np.loadtxt(fname, skiprows=19)
arrays.append(a)
final_array = np.concatenate(arrays)

Related

Python create folders structure based on current date

i'm using the following var on my script to send output to one
output = "/opt/output"
i want to adjust it to make the output relative to the date current date of trigger the script it should be structured like this
output = "/opt/output/year/month/day"
i'm not sure if i'm using the correct way here i used the following approach
output = "/opt/output/" + today.strftime('%Y%m%d')
any tips here
I recommend you use the full timestamp instead of just using the date:
import os
mydir = os.path.join(output, datetime.datetime.now().strftime('%Y/%m/%d_%H-%M-%S'))
It's recommended do it this way because what'd happen if your script runs more than once a day ? You should at least add a counter or something (if you don't want the full timestamp) which will increment some variable if the folder already exist.
You can read more about os.path.join here
As per creating a folder, you can do it like this:
if not os.path.exists(directory):
os.makedirs(mydir)
i figure it by
today = datetime.datetime.now()
year = today.strftime("%Y")
month=today.strftime("%m")
day=today.strftime("%d")
output = "/opt/output/" + year +"/" + month + "/" + day
thats worked fine to me
I will suggest using os.path.join and os.path.sep:
import os
.
.
.
full_dir = os.path.join(base_dir, today.strftime('%Y{0}%m{0}%d').format(os.path.sep))
today.strftime('%Y%m%d') would print todays date as 20170607. But I guess you want it printed as 2017/06/07. You could explicitly add the slashes and print it something like this?
output = "/opt/output/" + today.year +"/" + today.month + "/" + today.date

deleting semicolons in a column of csv in python

I have a column of different times and I want to find the values in between 2 different times but can't find out how? For example: 09:04:00 threw 09:25:00. And just use the values in between those different times.
I was gonna just delete the semicolons separating hours:minutes:seconds and do it that way. But really don't know how to do that. But I know how to find a value in a column so I figured that way would be easier idk.
Here is the csv I'm working with.
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505
02/03/1997,09:05:00,3047.00,3048.00,3046.00,3047.00,162
02/03/1997,09:06:00,3047.50,3048.00,3047.00,3047.50,98
02/03/1997,09:07:00,3047.50,3047.50,3047.00,3047.50,228
02/03/1997,09:08:00,3048.00,3048.00,3047.50,3048.00,136
02/03/1997,09:09:00,3048.00,3048.00,3046.50,3046.50,174
02/03/1997,09:10:00,3046.50,3046.50,3045.00,3045.00,134
02/03/1997,09:11:00,3045.50,3046.00,3044.00,3045.00,43
02/03/1997,09:12:00,3045.00,3045.50,3045.00,3045.00,214
02/03/1997,09:13:00,3045.50,3045.50,3045.50,3045.50,8
02/03/1997,09:14:00,3045.50,3046.00,3044.50,3044.50,152
02/03/1997,09:15:00,3044.00,3044.00,3042.50,3042.50,126
02/03/1997,09:16:00,3043.50,3043.50,3043.00,3043.00,128
02/03/1997,09:17:00,3042.50,3043.50,3042.50,3043.50,23
02/03/1997,09:18:00,3043.50,3044.50,3043.00,3044.00,51
02/03/1997,09:19:00,3044.50,3044.50,3043.00,3043.00,18
02/03/1997,09:20:00,3043.00,3045.00,3043.00,3045.00,23
02/03/1997,09:21:00,3045.00,3045.00,3044.50,3045.00,51
02/03/1997,09:22:00,3045.00,3045.00,3045.00,3045.00,47
02/03/1997,09:23:00,3045.50,3046.00,3045.00,3045.00,77
02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131
02/03/1997,09:25:00,3044.50,3044.50,3043.50,3043.50,138
02/03/1997,09:26:00,3043.50,3043.50,3043.50,3043.50,6
02/03/1997,09:27:00,3043.50,3043.50,3043.00,3043.00,56
02/03/1997,09:28:00,3043.00,3044.00,3043.00,3044.00,32
02/03/1997,09:29:00,3044.50,3044.50,3044.50,3044.50,63
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
02/03/1997,09:31:00,3045.00,3045.50,3045.00,3045.50,75
02/03/1997,09:32:00,3045.50,3045.50,3044.00,3044.00,54
02/03/1997,09:33:00,3043.50,3044.50,3043.50,3044.00,96
02/03/1997,09:34:00,3044.00,3044.50,3044.00,3044.50,27
02/03/1997,09:35:00,3044.50,3044.50,3043.50,3044.50,44
02/03/1997,09:36:00,3044.00,3044.00,3043.00,3043.00,61
02/03/1997,09:37:00,3043.50,3043.50,3043.50,3043.50,18
Thanks for the time
If you just want to replace semicolons with commas you can use the built in string replace function.
line = '02/03/1997,09:24:00,3045.00,3045.00,3045.00,3045.00,131'
line = line.replace(':',',')
print(line)
Output
02/03/1997,09,04,00,3046.00,3048.50,3046.00,3047.50,505
Then split on commas to separate the data.
line.split(',')
If you only want the numerical values you could also do the following (using a regular expression):
import re
line = '02/03/1997,09:04:00,3046.00,3048.50,3046.00,3047.50,505'
values = [float(x) for x in re.sub(r'[^\w.]+', ',', line).split(',')]
print values
Which gives you a list of numerical values that you can process.
[2.0, 3.0, 1997.0, 9.0, 4.0, 0.0, 3046.0, 3048.5, 3046.0, 3047.5, 505.0]
Use the csv module! :)
>>>import csv
>>> with open('myFile.csv', newline='') as csvfile:
... myCsvreader = csv.reader(csvfile, delimiter=',', quotechar='|')
... for row in myCsvreader:
... for item in row:
... item.spit(':') # Returns hours without semicolons
Once you extracted different time stamps, you can use the datetime module, such as:
from datetime import datetime, date, time
x = time(hour=9, minute=30, second=30)
y = time(hour=9, minute=30, second=42)
diff = datetime.combine(date.today(), y) - datetime.combine(date.today(), x)
print diff.total_seconds()

Python read text file based on partial name and file timestamp

I'm trying to pull two of the same files into python in different dataframes, with the end goal of comparing what was added in the new file and removed from the old. So far, I've got code that looks like this:
In[1] path = r'\\Documents\FileList'
files = os.listdir(path)
In[2] files_txt = [f for f in files if f[-3:] == 'txt']
In[3] for f in files_txt:
data = pd.read_excel(path + r'\\' + f)
df = df.append(data)
I've also set a variable to equal the current date minus a certain number of days, which I want to use to pull the file that has a date equal to that variable:
d7 = dt.datetime.today() - timedelta(7)
As of now, I'm unsure of how to do this, as the first part of the filename always remains the same but they add numbers at the end (eg. file_03232016 then file_03302016). I want to parse through the directory for the beginning part of the filename and add it to a dataframe if it matches the date parameter I set.
EDIT: I forgot to add that sometimes I also need to look at the system date created timestamp, as the text date in the file name isn't always there.
Here are some modifications to your original code to get a list of files containing your target date. You need to use strftime.
import os
from datetime import timedelta
d7 = dt.datetime.today() - timedelta(7)
target_date_str = d7.strftime('_%m%d%Y')
files_txt = [f for f in files if f[-13:] == target_date_str + '.txt']
>>> target_date_str + '.txt'
'_03232016.txt'
data = []
for f in files_txt:
data.append(pd.read_excel(os.path.join(path, f))
df = pd.concat(data, ignore_index=True)
Use strftime in order to represent your datetime variable as a string with desired format and glob for searching files by file mask in the directory:
import datetime as dt
import glob
fmask = r'\\Documents\FileList\*' + (dt.datetime.today() - dt.timedelta(7)).strftime('%m%d%Y') + '*.txt'
files_txt = glob.glob(fmask)
# concatenate all CSV/txt files into one data frame
df = pd.concat([pd.read_csv(f) for f in files_txt], ignore_index=True)
PS I guess you want to use read_csv instead of read_excel when working with txt files unless you really have excel files with txt extension?

Python - Sort files based on timestamp

I have a list which contains list of file names, i wanted to sort based on timestamp, which ( i.e timestamp ) is inbuild in each file name.
Note: In file, Hello_Hi_2015-02-20T084521_1424543480.tar.gz --> 2015-02-20T084521 represents as "year-moth-dayTHHMMSS" ( Based on this i wanted to sort )
Input file below:
file_list = ['Hello_Hi_2015-02-20T084521_1424543480.tar.gz',
'Hello_Hi_2015-02-20T095845_1424543481.tar.gz',
'Hello_Hi_2015-02-20T095926_1424543481.tar.gz',
'Hello_Hi_2015-02-20T100025_1424543482.tar.gz',
'Hello_Hi_2015-02-20T111631_1424543483.tar.gz',
'Hello_Hi_2015-02-20T111718_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112502_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112633_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113427_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113456_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113608_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113659_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113809_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113901_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113955_1424543485.tar.gz',
'Hello_Hi_2015-03-20T114122_1424543485.tar.gz',
'Hello_Hi_2015-02-20T114532_1424543486.tar.gz',
'Hello_Hi_2015-02-20T120045_1424543487.tar.gz',
'Hello_Hi_2015-02-20T120146_1424543487.tar.gz',
'Hello_WR_2015-02-20T084709_1424543480.tar.gz',
'Hello_WR_2015-02-20T113016_1424543486.tar.gz']
Output should be:
file_list = ['Hello_Hi_2015-02-20T084521_1424543480.tar.gz',
'Hello_WR_2015-02-20T084709_1424543480.tar.gz',
'Hello_Hi_2015-02-20T095845_1424543481.tar.gz',
'Hello_Hi_2015-02-20T095926_1424543481.tar.gz',
'Hello_Hi_2015-02-20T100025_1424543482.tar.gz',
'Hello_Hi_2015-02-20T111631_1424543483.tar.gz',
'Hello_Hi_2015-02-20T111718_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112502_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112633_1424543484.tar.gz',
'Hello_WR_2015-02-20T113016_1424543486.tar.gz',
'Hello_Hi_2015-02-20T113427_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113456_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113608_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113659_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113809_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113901_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113955_1424543485.tar.gz',
'Hello_Hi_2015-02-20T114532_1424543486.tar.gz',
'Hello_Hi_2015-02-20T120045_1424543487.tar.gz',
'Hello_Hi_2015-02-20T120146_1424543487.tar.gz',
'Hello_Hi_2015-03-20T114122_1424543485.tar.gz']
Below is the code which i have tried.
def sort( dir ):
os.chdir( dir )
file_list = glob.glob('Hello_*')
file_list.sort(key=os.path.getmtime)
print("\n".join(file_list))
return 0
Thanks in advance!!
So this worked for me and it sorted files by created time that did not have the time stamp in the name;
import os
import re
files = [file for file in os.listdir(".") if (file.lower().endswith('.gz'))]
files.sort(key=os.path.getmtime)
for file in sorted(files,key=os.path.getmtime):
print(file)
Would this work?
You could write list contents to a file line by line and read the file:
lines = sorted(open(open_file).readlines(), key = lambda line :
line.split("_")[2])
Further, you could print out lines.
Your code is trying to sort based on the filesystem-stored modified time, not the filename time.
Since your filename encoding is slightly sane :-) if you want to sort based on filename alone, you may use:
sorted(os.listdir(dir), key=lambda s: s[9:]))
That will do, but only because the timestamp encoding in the filename is sane: fixed-length prefix, zero-padded, constant-width numbers, going in sequence from biggest time reference (year) to the lowest one (second).
If your prefix is not fixed, you can try something with RegExp like this (which will sort by the value after the second underscore):
import re
pat = re.compile('_.*?(_)')
sorted(os.listdir(dir), key=lambda s: s[pat.search(s).end():])

Python: Copy several files with one column into one file with multi-column

I have the following question in Python 2.7:
I have 20 different txt-files, each with exactly one column of numbers. Now - as an output - I would like to have one file with all those columns together. How can I concatenate one-column files in Python ? I was thinking about using the fileinput module, but I fear, I have to open all my different txt files at once ?
My idea:
filenames = ['input1.txt','input2.txt',...,'input20.txt']
import fileinput
with open('/path/output.txt', 'w') as outfile:
for line in fileinput.input(filenames)
write(line)
Any suggestions on that ?
Thanks for any help !
A very simply (naive?) solution is
filenames = ['a.txt', 'b.txt', 'c.txt', 'd.txt']
columns = []
for filename in filenames:
lines = []
for line in open(filename):
lines.append(line.strip('\n'))
columns.append(lines)
rows = zip(*columns)
with open('output.txt', 'w') as outfile:
for row in rows:
outfile.write("\t".join(row))
outfile.write("\n")
But on *nix (including OS X terminal and Cygwin), it's easier to
$ paste a.txt b.txt c.txt d.txt
from the command line.
My suggestion: a little functional approach. Using list comprehension to zip the file being read, to the accumulated columns, and then join them to be a string again, one column (file) at a time:
filenames = ['input1.txt','input2.txt','input20.txt']
outputfile = 'output.txt'
#maybe you need to separate each column:
separator = " "
separator_list = []
output_list = []
for f in filenames:
with open(f,'r') as inputfile:
if len(output_list) == 0:
output_list = inputfile.readlines()
separator_list = [ separator for x in range(0, len(outputlist))]
else:
input_list = inputfile.readlines()
output_list = [ ''.join(x) for x in [list(y) for y in zip(output_list, separator_list, input_list)]
with open(outputfile,'w') as output:
output.writelines(output_list)
It will keep in memory the accumulator for the result (output_list), and one file at a time (the one being read, which is also the only file open for reading), but may be a little slower, and, of course, it is not fail-proof.