merge 1000s of csv with same name in different subdirectories

merge 1000s of csv with same name in different subdirectories - python-2.7

I have 1000 of subdirectories (error1 - error1000) with three different csv files (rand.csv, run_error.csv, swe_error.csv). Each vsc has index row. I need to merge the csv files that have the same filename, so I end up with e.g. rand_merge.csv with index row and 1000 rows of data.
I followed Merge multiple csv files with same name in 10 different subdirectory, which gets me
KeyError: 'filename'
I can't figure out how to fix it, so any help is appreciated.
Thx
Update: Here's the exact code, which came from linked post above:
import pandas as pd
import glob
CONCAT_DIR = "./error/files_concat/"
# Use glob module to return all csv files under root directory. Create DF from this.
files = pd.DataFrame([file for file in glob.glob("error/*/*")], columns=["fullpath"])
# Split the full path into directory and filename
files_split = files['fullpath'].str.rsplit("\\", 1, expand=True).rename(columns={0: 'path', 1:'filename'})
# Join these into one DataFrame
files = files.join(files_split)
# Iterate over unique filenames; read CSVs, concat DFs, save file
for f in files['filename'].unique():
paths = files[files['filename'] == f]['fullpath'] # Get list of fullpaths from unique filenames
dfs = [pd.read_csv(path, header=None) for path in paths] # Get list of dataframes from CSV file paths
concat_df = pd.concat(dfs) # Concat dataframes into one
concat_df.to_csv(CONCAT_DIR + f) # Save dataframe

I found my mistake. I needed a "/" after rsplit, not "\"
files_split = files['fullpath'].str.rsplit("/", 1, expand=True).rename(columns={0: 'path', 1:'filename'})

Related

Parse files in a directory that contain a match to a set of strings - pull line with match to new file

I need to parse through a directory of multiple excel files to find matches to a set of 500+ strings (that I currently have in a set).
If there is a match to one of the strings in an excel file, I need to pull that row out into a new file.
Please let me know if you can assist! Thank you in advance for the help!
The directory is called: All_Data
The set is from a list of strings in a file (MRN_file_path)
My code:
MRN = set()
with open(MRN_file_path) as MRN_file:
for line in MRN_file:
if line.strip():
MRN.add(line.strip())
for root, dires, files in os.walk('path/All_Data'):
for name in files:
if name.endswith('.xlsx'):
filepath = os.path.join(root, name)
with open(search_results_path, "w") as search_results:
if MRN in filepath:
search_results.write(line)

Your code doesn't actually read the .xlsx files. As far as I know, there isn't anything in native Python to read .xlsx files. However, you can check out openpyxl and see if that helps. Here's a solution which reads all the .xlsx files in the specified directory, and writes them into a single tab-delimited txt file.
import os
from openpyxl import load_workbook
MRN = set()
with open(MRN_file_path) as MRN_file:
for line in MRN_file:
if line.strip():
MRN.add(line.strip())
outfile = open(search_results_path, "w")
for root, dires, files in os.walk(path):
for name in files:
if name.endswith('.xlsx'):
filepath = os.path.join(root, name)
# load in the .xlsx workbook
wb = load_workbook(filename = filepath, read_only = True)
# assuming we select the worksheet which is active
ws = wb.active
# iterate through each row in the worksheet
for row in ws.rows:
# iterate over each cell
for cell in row:
if cell.value in MRN:
# create a temporary array with all the cell values in the matching row.
# the 'None' check is there to avoid errors when joining the array
# into a tab-delimited row
arr = [cell.value if cell.value is not None else "" for cell in row]
outfile.write("\t".join(arr) + "\n")
outfile.close()
If a tab-delimited output isn't what you're looking for, then you can adjust the second last line to whatever fits your needs.

create dataframe by randomly sampling from multiple files

I have a folder with several 20 million record tab delimited files in it. I would like to create a pandas dataframe where I randomly sample say 20 thousand records from each file, and then append them together in the dataframe. Does anyone know how to do that?

You could read in all the text files in a particular folder. Then you could make use of pandas Dataframe.sample (link to docs).
I've provided a fully reproducible example with two example .txt file created with 200 rows. I then take a random sample of ten rows and append the sample to a final datframe.
import pandas as pd
import numpy as np
import glob
# Change the path for the directory
directory = r'C:\some\example\folder'
# I create two test .txt files for demonstration purposes with 200 rows each
df_test = pd.DataFrame(np.random.randn(200, 2), columns=list('AB'))
df_test.to_csv(directory + r'\test_1.txt', sep='\t', index=False)
df_test.to_csv(directory + r'\test_2.txt', sep='\t', index=False)
df = pd.DataFrame()
for filename in glob.glob(directory + r'\*.txt'):
df_full = pd.read_csv(filename, sep='\t')
df_sample = df_full.sample(n=10)
df = df.append(df_sample)

How to rename a file name with referenace of csv file

I have a files like this.
1.stream0106.wav
2.stream0205.wav
3.steram0304.wav
I need to rename "01" in a file name as "_C" & "06" as "_LFE1" Like this.This new names I have in csv file like below.
Can you please guuide me for this.

I'm not sure if you want the "01" to be replaced or appended. The csv titles make it confusing.
I would first make the csv file start in column A and row 1 to make reading it in easier for you.
If you are appending names this should work
import os
import csv
# Assuming files are just in current directory
wav_files = [f for f in os.listdir('.') if f.endswith('.wav')]
with open('your_file.csv', 'rb') as csv_file:
mappings = [row.strip().split(',' ) for row in csv_file.readlines()[1:]]
for f in wav_files:
for digit, name in mappings:
if f[:-4].endswith(digit):
new_name = f.replace(digit,name)
os.rename(f, new_name)
break
EDIT
Old Name,New Name
00,_0
01,_C
02,_L
03,_R
04,_Ls
05,_Rs
06,_LFE1
07,_Cs
This can be achieved by just having them in excel starting at Col A and Row 1

Bulk Search/replacing of filenames using python

I have:
An excel file as A1:B2.
A folder with 200 jpeg files.
I'm trying to search the filename in the folder with the value in Column A and replace it with the value in Column B if found without changing the extensions of the files in the folder.
Here am stuck using various skiddies to do this but failed. Here's my code:
import os
import xlrd
path = r'c:\users\c_thv\desktop\x.xls'
#collect the files in fexceler
path1 = r'c:\users\c_thv\desktop'
data = []
for name in os.listdir(path1):
if os.path.isfile(os.path.join(path1, name)):
fileName, fileExtension = os.path.splitext(name)
if fileExtension == '.py':
data.append(fileName)
#print data
#collect the filenames for changing
book = xlrd.open_workbook(path)
sheet = book.sheet_by_index(0)
cell = sheet.cell(0,0)
cells = sheet.row_slice(rowx=0,start_colx=0,end_colx=2)
excel = []
#collect the workable data in an list
for cell in cells:
excel.append(cell)
#print excel
#compare list for matches
for i,j in enumerate(excel):
if j in data[:]:
os.rename(excel[i],data[i])

Try a print "Match found" after if j in data[:]: just to check if the condition is ever met. My guess is there will be no match because the list data is full on python filemanes (if fileExtension == '.py') and you are looking for jpeg files in the excel list.
Besides, old is not defined.
EDIT:
If I understand correctly, this will may help:
import os, xlrd
path = 'c:/users/c_thv/desktop' #path to jpg files
path1 = 'c:/users/c_thv/desktop/x.xls'
data =[] #list of jpg filenames in folder
#lets create a filenames list without the jpg extension
for name in os.listdir(path):
fileName, fileExtension = os.path.splitext(name)
if fileExtension =='.jpg':
data.append(fileName)
#lets create a list of old filenames in the excel column a
book = xlrd.open_workbook(path1)
sheet = book.sheet_by_index(0)
oldNames =[]
for row in range(sheet.nrows):
oldNames.append(sheet.cell_value(row,0))
#lets create a list with the new names in column b
newNames =[]
for row in range(sheet.nrows):
newNames.append(sheet.cell_value(row,1))
#now create a dictionary with the old name in a and the corresponding new name in b
fileNames = dict(zip(oldNames,newNames))
print fileNames
#lastly rename your jpg files
for f in data:
if f in fileNames.keys():
os.rename(path+'/'+f+'.jpg', path+'/'+fileNames[f]+'.jpg')

Loop through multiple csv files and write one column into new output csv

I have 251 CSV files in a folder. They are named "returned UDTs 1-12-13.csv", "returned UDTs 1-13-13.csv. The dates are not consecutive, however. For example holidays and weekends may have missing dates, so the next file may be "returned UDTs 1-17-13.csv". Each file has one column of data. I need to extract each column and append into one column in one new output csv file. I want to write a python script to do so. In a dummy folder with 3 dummy csv files (csv1.csv, csv2.csv, and csv3.csv) I created the following script that works:
import csv, os, sys
out_csv = r"C:\OutCSV\csvtest.csv"
path = r"C:\CSV_test"
fout=open(out_csv,"a")
# first file:
for line in open(path + "\csv1.csv"):
fout.write(line)
# now the rest:
for num in range(2,4):
f = open(path + "\csv"+str(num)+".csv")
f.next() # skip the header
for line in f:
fout.write(line)
f.close() # dont know if needed
fout.close()
The issue is the date in the file name and how to deal with it. Any help would be appreciated.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

merge 1000s of csv with same name in different subdirectories - python-2.7

I found my mistake. I needed a "/" after rsplit, not "\" files_split = files['fullpath'].str.rsplit("/", 1, expand=True).rename(columns={0: 'path', 1:'filename'})

Related

Parse files in a directory that contain a match to a set of strings - pull line with match to new file

create dataframe by randomly sampling from multiple files

How to rename a file name with referenace of csv file

Bulk Search/replacing of filenames using python

Loop through multiple csv files and write one column into new output csv

Categories

Resources