Automate process for merging CSV files in Python

Automate process for merging CSV files in Python - regex

I am trying to work with 12 different csv files, that are stored in datafolder. I have a function that opens each file separately (openFile) and performs a specific calculation on the data. I then want to be able to apply a function to each file. The names of the files are all similar to this: UOG_001-AC_TOP-Accelerometer-2017-07-22T112654.csv . The code below shows how I was planning to read the files into the openFile function:
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'):
abs_path = os.path.abspath(DATA_PATH + 'datafolder/' + file)
print(abs_path)
data = openFile(abs_path)
data2 = someFunction(data)
I need to merge specific files, which have the same two letters in the file name. At the end I should have 6 files instead of 12. The files are not stored in the order that they need to be merged in datafolder as this will eventually lead to being used for a larger number of files. The files all have the same header
Am I able to supply a list of the two letters that are the key words in the file to use in regex? e.g.
list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
Any suggestions on how I can achieve this with or without regex?

You could walk through the file tree and, then based on the first two letters of the file name, save each pair of files that need to be merged.
fileList = {'AC':[], 'FO':[], 'CK':[], 'OR':[], 'RS':[], 'IK':[]}
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'): #Ensure we are looking at csv
#Add the file to its correct bucket based off of the letters in name
fileList[extractTwoLettersFromFileName(file)].append(file)
for twoLetters, files in fileList.items():
mergeFiles(files)
I did not provide implementation for extracting the letters and merging the files, but from you question you seem to already have that implementation.

You can first so a simple substring check, and then based on that, classify the filenames into groups:
letters_list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
for letters in letters_list:
if letters in filename:
filename_list = filename_dict.get(letters, list())
filename_list.append(filename)
filename_dict[letters] = filename_list

Here is the the Path object from pathlib to create a list of files whose names end with '.csv'. Then I use a function to examine each file's name for the presence of one or another of those strings that you mentioned using a regex so that I can create a list of the pairs of these strings with their associates filenames. Notice that the length of this list of pairs is 12, and that the filenames can be recovered from this.
Having made that list I can use groupby from itertools to create two-element lists of files that share those strings in the file_kinds list. You can merge items in these lists.
>>> from pathlib import Path
>>> file_kinds = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
>>> def try_match(filename):
... m = re.search('(%s)'%'|'.join(file_kinds), filename)
... if m:
... return m.group()
... else:
... return None
...
>>> all_files_list = [(item, try_match(item.name)) for item in list(Path(r'C:/scratch/datafolder').glob('*.csv')) if try_match(item.name)]
>>> len(all_files_list)
12
Expression for extracting full paths from all_files_list:
[str(_[0]) for _ in all_files_list]
>>> for kind, files_list in groupby(all_files_list, key=itemgetter(1)):
... kind, [str(_[0]) for _ in list(files_list)]
...
('AC', ['C:\\scratch\\datafolder\\UOG_001-AC_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-AC__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('CK', ['C:\\scratch\\datafolder\\UOG_001-CK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-CK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('FO', ['C:\\scratch\\datafolder\\UOG_001-FO_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-FO__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('IK', ['C:\\scratch\\datafolder\\UOG_001-IK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-IK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('OR', ['C:\\scratch\\datafolder\\UOG_001-OR_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-OR__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('RS', ['C:\\scratch\\datafolder\\UOG_001-RS_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-RS__B_TOP-Accelerometer-2017-07-22T112654.csv'])

Related

Python: find missing files from pairs

I am trying to find some missing files, but those files are in a pair.
as example, we have files like:
file1_LEFT
file1_RIGHT
file2_LEFT
file2_RIGHT
file3_LEFT
file4_RIGHT
...
The ideea is the name is same but they have a left\right pair. Normally we have thousands of files but somewhere there, we'll find some files without a pair. Like file99_LEFT is present but RIGHT is missing (or vice-versa for sides).
I'm trying to make a script in python 2.7 (yes i'm using an old python for personal reasons... unfortunately) but i have no clue how can be realized.
ideas tried:
-verify them 2 by 2 and check if we have RIGHT in current file and LEFT in previous, print ok, else print the file that's not matching. But after first one is printed, all others are failing due to fact that the structure is changed, at that point we won't have left-right one next to eachother, their order will be re-arranged
-create separate lists for LEFT and RIGHT and compare them but again first one will be found but won't work for others.
Code i've used until now:
import os
import fnmatch,re
path = raw_input('Enter files path:')
for path, dirname, filenames in os.walk(path):
for fis in filenames:
print fis
print len(filenames)
for i in range(1,len(filenames),2):
print filenames[i]
if "RIGHT" in filenames[i] and "LEFT" in filenames[i-1]:
print "Ok"
else:
print "file >"+fis+"< has no pair"
f = open(r"D:\rec.txt", "a")
f.writelines(fis + "\n")
f.close()
Thanks for your time!

We can use glob to list the files in a given path, filtered by a search pattern.
If we consider one set of all LEFT filenames, and another set of all RIGHT filenames, can we say you are looking for the elements not in the intersection of these two sets?
That is called the "symmetric difference" of those two sets.
import glob
# Get a list of all _LEFT filenames (excluding the _LEFT part of the name)
# Eg: ['file1', 'file2' ... ].
# Ditto for the _RIGHT filenames
# Note: glob.glob() will look in the current directory where this script is running.
left_list = [x.replace('_LEFT', '') for x in glob.glob('*_LEFT')]
right_list = [x.replace('_RIGHT', '') for x in glob.glob('*_RIGHT')]
# Print the symmetric difference between the two lists
symmetric_difference = list(set(left_list) ^ set(right_list))
print symmetric_difference
# If you'd like to save the names of missing pairs to file
with open('rec.txt', 'w') as f:
for pairname in symmetric_difference:
print >> f, pairname
# If you'd like to print which file (LEFT or RIGHT) is missing a pair
for filename in symmetric_difference:
if filename in left_list:
print "file >" + filename + "_LEFT< has no pair"
if filename in right_list:
print "file >" + filename + "_RIGHT< has no pair"

Grabbing parts of filename with python & boto3

I just started with python and I am still a newbie , I want to create a function that grabs parts of filenames corresponding to a certain pattern these files are stored in s3 bucket.
So in my case, let's say I have 5 .txt files
Transfarm_DAT_005995_20190911_0300.txt
Transfarm_SupplierDivision_058346_20190911_0234.txt
Transfarm_SupplierDivision_058346_20200702_0245.txt
Transfarm_SupplierDivision_058346_20200703_0242.txt
Transfarm_SupplierDivision_058346_20200704_0241.txt
I want the script to go through these filenames, grab the string "Category i.e "Transfarm_DAT" and date "20190911"" and before the filename extension.
Can you point me in the direction to which Python modules and possibly guides that could assist me?

Check out the split and join functions if your filenames are always like this. Otherwise, regex is another avenue.
files_list = ['Transfarm_DAT_005995_20190911_0300.txt ', 'Transfarm_SupplierDivision_058346_20190911_0234.txt',
'Transfarm_SupplierDivision_058346_20200702_0245.txt', 'Transfarm_SupplierDivision_058346_20200703_0242.txt', 'Transfarm_SupplierDivision_058346_20200704_0241.txt']
category_list = []
date_list = []
for f in files_list:
date = f.split('.')[0].split('_',2)[2]
category = '_'.join([f.split('.')[0].split('_')[0], f.split('.')[0].split('_')[1]])
# print(category, date)
category_list.append(category)
date_list.append(date)
print(category_list, date_list)
Output lists:
['Transfarm_DAT', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision', 'Transfarm_SupplierDivision'] ['005995_20190911_0300', '058346_20190911_0234', '058346_20200702_0245', '058346_20200703_0242', '058346_20200704_0241']

Why must I run this code a few times before my entire .csv file is converted into a .yaml file?

I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.

I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.

Python using RE to find integer in text file in a for

I'm writing a bot in python using tweepy for python 2.7. I'm stumped on how to approach what I am looking to do. Currently the bot finds the tweet id and appends it to a text file. On later runs I want to use regex to search that file for a match and only write if there is no match within the text file. The intent is not to add duplicate tweet ids to my text file which could span a large amount of numbers followed by newline.
Any help is appreciate!
/edit when I try the below code the IDE says match can't be seen and syntax error as a result.
import re,codecs,tweepy
qName = Queue.txt
tweets = api.search(q=searchQuery,count=tweet_count,result_type="recent")
with codecs.open(qName,'a',encoding='utf-8') as f:
for tweet in tweets:
tweetId = tweet.id_str
match = re.findall(tweedId), qName)
#if match = false then do write, else discard and move on
f.write(tweetId + '\n')

If i get you correct,You need not to bother with regex etc. let the special containers do the work for you.I would proceed with non-duplicate-container like dictionary or set e.g read all the data from file into dictionary or set and then go for extending id into this dictionary or set after all write this dictionary or set back into file.
e.g.
>>>data = set()
>>>for i in list('asddddddddddddfgggggg'):
data.add(i)
>>>data
>>>set(['a', 's', 'd', 'g', 'f']) ## see one d and g

Python 2.7 - Split comma separated text file into smaller text files

I was (unsuccessfully) trying to figure out how to create a list of compound letters using loops. I am a beginner programmer, have been learning python for a few months. Fortunately, I later found a solution to this problem - Genearte a list of strings compound of letters from other list in Python - see the first answer.
So I took that code and added a little to it for my needs. I randomized the list, turned the list into a comma separated file. This is the code:
from string import ascii_lowercase as al
from itertools import product
import random
list = ["".join(p) for i in xrange(1,6) for p in product(al, repeat = i)]
random.shuffle(list)
joined = ",".join(list)
f = open("double_letter_generator_output.txt", 'w')
print >> f, joined
f.close()
What I need to do now is split that massive file "double_letter_generator_output.txt" into smaller files. Each file needs to consist of 200 'words'. So it will need to split into many files. The files of course do not exist yet and will need to be created by the program also. How can I do that?

Here's how I would do it, but I'm not sure why you're splitting this into smaller files. I would normally do it all at once, but I'm assuming the file is too big to be stored in working memory, so I'm traversing one character at a time.
Let bigfile.txt contain
1,2,3,4,5,6,7,8,9,10,11,12,13,14
MAX_NUM_ELEMS = 2 #you'll want this to be 200
nameCounter = 1
numElemsCounter = 0
with open('bigfile.txt', 'r') as bigfile:
outputFile = open('output' + str(nameCounter) + '.txt', 'a')
for letter in bigfile.read():
if letter == ',':
numElemsCounter += 1
if numElemsCounter == MAX_NUM_ELEMS:
numElemsCounter = 0
outputFile.close()
nameCounter += 1
outputFile = open('output' + str(nameCounter) + '.txt', 'a')
else:
outputFile.write(letter);
outputFile.close()
now output1.txt is 1,2, output2.txt is 3,4, output3.txt is 5,6, etc.
$ cat output7.txt
13,14
This is a little sloppy, you should write a nice function to do it and format it the way you like!
FYI, if you want to write to a bunch of different files, there's no reason to write to one big file first. Write to the little files right off the bat.
This way, the last file might have fewer than MAX_NUM_ELEMS elements.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Automate process for merging CSV files in Python - regex

Related

Python: find missing files from pairs

Grabbing parts of filename with python & boto3

Why must I run this code a few times before my entire .csv file is converted into a .yaml file?

Python using RE to find integer in text file in a for

Python 2.7 - Split comma separated text file into smaller text files

Categories

Resources