I'm trying to pull two of the same files into python in different dataframes, with the end goal of comparing what was added in the new file and removed from the old. So far, I've got code that looks like this:
In[1] path = r'\\Documents\FileList'
files = os.listdir(path)
In[2] files_txt = [f for f in files if f[-3:] == 'txt']
In[3] for f in files_txt:
data = pd.read_excel(path + r'\\' + f)
df = df.append(data)
I've also set a variable to equal the current date minus a certain number of days, which I want to use to pull the file that has a date equal to that variable:
d7 = dt.datetime.today() - timedelta(7)
As of now, I'm unsure of how to do this, as the first part of the filename always remains the same but they add numbers at the end (eg. file_03232016 then file_03302016). I want to parse through the directory for the beginning part of the filename and add it to a dataframe if it matches the date parameter I set.
EDIT: I forgot to add that sometimes I also need to look at the system date created timestamp, as the text date in the file name isn't always there.
Here are some modifications to your original code to get a list of files containing your target date. You need to use strftime.
import os
from datetime import timedelta
d7 = dt.datetime.today() - timedelta(7)
target_date_str = d7.strftime('_%m%d%Y')
files_txt = [f for f in files if f[-13:] == target_date_str + '.txt']
>>> target_date_str + '.txt'
'_03232016.txt'
data = []
for f in files_txt:
data.append(pd.read_excel(os.path.join(path, f))
df = pd.concat(data, ignore_index=True)
Use strftime in order to represent your datetime variable as a string with desired format and glob for searching files by file mask in the directory:
import datetime as dt
import glob
fmask = r'\\Documents\FileList\*' + (dt.datetime.today() - dt.timedelta(7)).strftime('%m%d%Y') + '*.txt'
files_txt = glob.glob(fmask)
# concatenate all CSV/txt files into one data frame
df = pd.concat([pd.read_csv(f) for f in files_txt], ignore_index=True)
PS I guess you want to use read_csv instead of read_excel when working with txt files unless you really have excel files with txt extension?
Related
I have 1000 of subdirectories (error1 - error1000) with three different csv files (rand.csv, run_error.csv, swe_error.csv). Each vsc has index row. I need to merge the csv files that have the same filename, so I end up with e.g. rand_merge.csv with index row and 1000 rows of data.
I followed Merge multiple csv files with same name in 10 different subdirectory, which gets me
KeyError: 'filename'
I can't figure out how to fix it, so any help is appreciated.
Thx
Update: Here's the exact code, which came from linked post above:
import pandas as pd
import glob
CONCAT_DIR = "./error/files_concat/"
# Use glob module to return all csv files under root directory. Create DF from this.
files = pd.DataFrame([file for file in glob.glob("error/*/*")], columns=["fullpath"])
# Split the full path into directory and filename
files_split = files['fullpath'].str.rsplit("\\", 1, expand=True).rename(columns={0: 'path', 1:'filename'})
# Join these into one DataFrame
files = files.join(files_split)
# Iterate over unique filenames; read CSVs, concat DFs, save file
for f in files['filename'].unique():
paths = files[files['filename'] == f]['fullpath'] # Get list of fullpaths from unique filenames
dfs = [pd.read_csv(path, header=None) for path in paths] # Get list of dataframes from CSV file paths
concat_df = pd.concat(dfs) # Concat dataframes into one
concat_df.to_csv(CONCAT_DIR + f) # Save dataframe
I found my mistake. I needed a "/" after rsplit, not "\"
files_split = files['fullpath'].str.rsplit("/", 1, expand=True).rename(columns={0: 'path', 1:'filename'})
I have a script which generates CSV files and names them as per time stamp
-rw-rw-r-- 1 9949 Oct 13 11:57 2018-10-13-11:57:10.796516.csv
-rw-rw-r-- 1 9649 Oct 13 12:58 2018-10-13-12:58:12.907835.csv
-rw-rw-r-- 1 9649 Oct 13 13:58 2018-10-13-13:58:10.502635.csv
I need to pick column C from these sheets and write to a new CSV file. However order of columns in new sheet should be as per the name of existing sheets.
Example, Column C from the file generated at 11:57 should be in column A , from 12:58 in Column B and 13:38 in Column C of new sheet.
EDIT -- Code tried based on Bilal Input. It does move the C column from all existing sheets to a new sheet, however not in proper order. It just picks them randomly and keeps adding in columns on new file.
import os
import re
import pandas as pd
newCSV = pd.DataFrame.from_dict({})
# get a list of your csv files and put them files
files = [f for f in os.listdir('.') if os.path.isfile(f)]
results = []
for f in files:
if re.search('.csv', f):
results += [f]
for file in results:
df = pd.read_csv(file,usecols=[2])
newCSV = pd.concat((newCSV, df), axis=1)
newCSV.to_csv("new.csv")
EDIT -- Final Code that worked, Thanks Bilal
import os
import re
import pandas as pd
newCSV = pd.DataFrame.from_dict({})
files = [f for f in os.listdir('.') if os.path.isfile(f)]
# get a list of your csv files and put them files
results = []
for f in files:
if re.search('.csv', f):
results += [f]
result1=sorted(results)
for file in result1:
df = pd.read_csv(file,usecols=[2])
newCSV = pd.concat((newCSV, df), axis=1)
newCSV.to_csv("new.csv")
import pandas as pd
newCSV = pd.DataFrame.from_dict({})
# get a list of your csv files and put them files
for f in files:
df = pd.read_csv(f)
newCSV = pd.concat((newCSV, df.colum_name), axis=1)
newCSV.to_csv("new.csv")
See if that works for you.
If you don't know how to find all the files with a specific extension, look here.
I have n-files in a folder like
source_dir
abc_2017-07-01.tar
abc_2017-07-02.tar
abc_2017-07-03.tar
pqr_2017-07-02.tar
Lets consider for a single pattern now 'abc'
(but I get this pattern randomly from Database, so need double filtering,one for pattern and one for last day)
And I want to extract file of last day ie '2017-07-02'
Here I can get common files but not exact last_day files
Code
pattern = 'abc'
allfiles=os.listdir(source_dir)
m_files=[f for f in allfiles if str(f).startswith(pattern)]
print m_files
output:
[ 'abc_2017-07-01.tar' , 'abc_2017-07-02.tar' , 'abc_2017-07-03.tar' ]
This gives me all files related to abc pattern, but how can filter out only last day file of that pattern
Expected :
[ 'abc_2017-07-02.tar' ]
Thanks
just a minor tweak in your code can get you the desired result.
import os
from datetime import datetime, timedelta
allfiles=os.listdir(source_dir)
file_date = datetime.now() + timedelta(days=-1)
pattern = 'abc_' +str(file_date.date())
m_files=[f for f in allfiles if str(f).startswith(pattern)]
Hope this helps!
latest = max(m_files, key=lambda x: x[-14:-4])
will find the filename with latest date among filenames in m_files.
use python regex package like :
import re
import os
files = os.listdir(source_dir)
for file in files:
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
and then you can work with day in the loop to do what ever you want. Like create that list:
import re
import os
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
day = match.group(1)
return day
files = os.listdir(source_dir)
days = [extract_day(file) for file in files]
if the month is also variable you can substitute '07' with '\d\d' or also '\d{2}'. Be carefull if you have files that dont match with the pattern at all, then match.group() will cause an error since match is of type none. Then use :
def extract_day(name):
match = re.search('abc_2017-07-(\d{2})\.tar', file)
try:
day = match.group(1)
except :
day = None
return day
As a part of my learning. After i successfully split with help, in my next step, wanted to know if i can split the names of files when the month name is found in the name of the file that matches with the name of the month given in this list below ---
Months=['January','February','March','April','May','June','July','August','September','October','November','December'].
When my file name is like this
1.Non IVR Entries Transactions December_16_2016_07_49_22 PM.txt
2.Denied_Calls_SMS_Sent_December_14_2016_05_33_41 PM.txt
Please note that the names of files is not same..i.e why i need to split it like
Non IVR Entries Transactions as one part and December_16_2016_07_49_22 PM as another.
import os
import os.path
import csv
path = 'C:\\Users\\akhilpriyatam.k\\Desktop\\tes'
text_files = [os.path.splitext(f)[0] for f in os.listdir(path)]
for v in text_files:
print (v[0:9])
print (v[10:])
os.chdir('C:\\Users\\akhilpriyatam.k\\Desktop\\tes')
with open('file.csv', 'wb') as csvfile:
thedatawriter = csv.writer(csvfile,delimiter=',')
for v in text_files:
s = (v[0:9])
t = (v[10:])
thedatawriter.writerow([s,t])
import re
import calendar
fullname = 'Non IVR Entries Transactions December_16_2016_07_49_22 PM.txt'
months = list(calendar.month_name[1:])
regex = re.compile('|'.join(months))
iter = re.finditer(regex, fullname)
if iter:
idx = [it for it in iter][0].start()
filename, timestamp = fullname[:idx],fullname[idx:-4]
print filename, timestamp
else:
print "Month not found"
Assuming that you want the filename and timestamp as splits and the month occurs only once in the string, I hope the following code solves your problem.
I have a list which contains list of file names, i wanted to sort based on timestamp, which ( i.e timestamp ) is inbuild in each file name.
Note: In file, Hello_Hi_2015-02-20T084521_1424543480.tar.gz --> 2015-02-20T084521 represents as "year-moth-dayTHHMMSS" ( Based on this i wanted to sort )
Input file below:
file_list = ['Hello_Hi_2015-02-20T084521_1424543480.tar.gz',
'Hello_Hi_2015-02-20T095845_1424543481.tar.gz',
'Hello_Hi_2015-02-20T095926_1424543481.tar.gz',
'Hello_Hi_2015-02-20T100025_1424543482.tar.gz',
'Hello_Hi_2015-02-20T111631_1424543483.tar.gz',
'Hello_Hi_2015-02-20T111718_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112502_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112633_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113427_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113456_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113608_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113659_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113809_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113901_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113955_1424543485.tar.gz',
'Hello_Hi_2015-03-20T114122_1424543485.tar.gz',
'Hello_Hi_2015-02-20T114532_1424543486.tar.gz',
'Hello_Hi_2015-02-20T120045_1424543487.tar.gz',
'Hello_Hi_2015-02-20T120146_1424543487.tar.gz',
'Hello_WR_2015-02-20T084709_1424543480.tar.gz',
'Hello_WR_2015-02-20T113016_1424543486.tar.gz']
Output should be:
file_list = ['Hello_Hi_2015-02-20T084521_1424543480.tar.gz',
'Hello_WR_2015-02-20T084709_1424543480.tar.gz',
'Hello_Hi_2015-02-20T095845_1424543481.tar.gz',
'Hello_Hi_2015-02-20T095926_1424543481.tar.gz',
'Hello_Hi_2015-02-20T100025_1424543482.tar.gz',
'Hello_Hi_2015-02-20T111631_1424543483.tar.gz',
'Hello_Hi_2015-02-20T111718_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112502_1424543483.tar.gz',
'Hello_Hi_2015-02-20T112633_1424543484.tar.gz',
'Hello_WR_2015-02-20T113016_1424543486.tar.gz',
'Hello_Hi_2015-02-20T113427_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113456_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113608_1424543484.tar.gz',
'Hello_Hi_2015-02-20T113659_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113809_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113901_1424543485.tar.gz',
'Hello_Hi_2015-02-20T113955_1424543485.tar.gz',
'Hello_Hi_2015-02-20T114532_1424543486.tar.gz',
'Hello_Hi_2015-02-20T120045_1424543487.tar.gz',
'Hello_Hi_2015-02-20T120146_1424543487.tar.gz',
'Hello_Hi_2015-03-20T114122_1424543485.tar.gz']
Below is the code which i have tried.
def sort( dir ):
os.chdir( dir )
file_list = glob.glob('Hello_*')
file_list.sort(key=os.path.getmtime)
print("\n".join(file_list))
return 0
Thanks in advance!!
So this worked for me and it sorted files by created time that did not have the time stamp in the name;
import os
import re
files = [file for file in os.listdir(".") if (file.lower().endswith('.gz'))]
files.sort(key=os.path.getmtime)
for file in sorted(files,key=os.path.getmtime):
print(file)
Would this work?
You could write list contents to a file line by line and read the file:
lines = sorted(open(open_file).readlines(), key = lambda line :
line.split("_")[2])
Further, you could print out lines.
Your code is trying to sort based on the filesystem-stored modified time, not the filename time.
Since your filename encoding is slightly sane :-) if you want to sort based on filename alone, you may use:
sorted(os.listdir(dir), key=lambda s: s[9:]))
That will do, but only because the timestamp encoding in the filename is sane: fixed-length prefix, zero-padded, constant-width numbers, going in sequence from biggest time reference (year) to the lowest one (second).
If your prefix is not fixed, you can try something with RegExp like this (which will sort by the value after the second underscore):
import re
pat = re.compile('_.*?(_)')
sorted(os.listdir(dir), key=lambda s: s[pat.search(s).end():])