Python: find missing files from pairs - list

I am trying to find some missing files, but those files are in a pair.
as example, we have files like:
file1_LEFT
file1_RIGHT
file2_LEFT
file2_RIGHT
file3_LEFT
file4_RIGHT
...
The ideea is the name is same but they have a left\right pair. Normally we have thousands of files but somewhere there, we'll find some files without a pair. Like file99_LEFT is present but RIGHT is missing (or vice-versa for sides).
I'm trying to make a script in python 2.7 (yes i'm using an old python for personal reasons... unfortunately) but i have no clue how can be realized.
ideas tried:
-verify them 2 by 2 and check if we have RIGHT in current file and LEFT in previous, print ok, else print the file that's not matching. But after first one is printed, all others are failing due to fact that the structure is changed, at that point we won't have left-right one next to eachother, their order will be re-arranged
-create separate lists for LEFT and RIGHT and compare them but again first one will be found but won't work for others.
Code i've used until now:
import os
import fnmatch,re
path = raw_input('Enter files path:')
for path, dirname, filenames in os.walk(path):
for fis in filenames:
print fis
print len(filenames)
for i in range(1,len(filenames),2):
print filenames[i]
if "RIGHT" in filenames[i] and "LEFT" in filenames[i-1]:
print "Ok"
else:
print "file >"+fis+"< has no pair"
f = open(r"D:\rec.txt", "a")
f.writelines(fis + "\n")
f.close()
Thanks for your time!

We can use glob to list the files in a given path, filtered by a search pattern.
If we consider one set of all LEFT filenames, and another set of all RIGHT filenames, can we say you are looking for the elements not in the intersection of these two sets?
That is called the "symmetric difference" of those two sets.
import glob
# Get a list of all _LEFT filenames (excluding the _LEFT part of the name)
# Eg: ['file1', 'file2' ... ].
# Ditto for the _RIGHT filenames
# Note: glob.glob() will look in the current directory where this script is running.
left_list = [x.replace('_LEFT', '') for x in glob.glob('*_LEFT')]
right_list = [x.replace('_RIGHT', '') for x in glob.glob('*_RIGHT')]
# Print the symmetric difference between the two lists
symmetric_difference = list(set(left_list) ^ set(right_list))
print symmetric_difference
# If you'd like to save the names of missing pairs to file
with open('rec.txt', 'w') as f:
for pairname in symmetric_difference:
print >> f, pairname
# If you'd like to print which file (LEFT or RIGHT) is missing a pair
for filename in symmetric_difference:
if filename in left_list:
print "file >" + filename + "_LEFT< has no pair"
if filename in right_list:
print "file >" + filename + "_RIGHT< has no pair"

Related

Automate process for merging CSV files in Python

I am trying to work with 12 different csv files, that are stored in datafolder. I have a function that opens each file separately (openFile) and performs a specific calculation on the data. I then want to be able to apply a function to each file. The names of the files are all similar to this: UOG_001-AC_TOP-Accelerometer-2017-07-22T112654.csv . The code below shows how I was planning to read the files into the openFile function:
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'):
abs_path = os.path.abspath(DATA_PATH + 'datafolder/' + file)
print(abs_path)
data = openFile(abs_path)
data2 = someFunction(data)
I need to merge specific files, which have the same two letters in the file name. At the end I should have 6 files instead of 12. The files are not stored in the order that they need to be merged in datafolder as this will eventually lead to being used for a larger number of files. The files all have the same header
Am I able to supply a list of the two letters that are the key words in the file to use in regex? e.g.
list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
Any suggestions on how I can achieve this with or without regex?
You could walk through the file tree and, then based on the first two letters of the file name, save each pair of files that need to be merged.
fileList = {'AC':[], 'FO':[], 'CK':[], 'OR':[], 'RS':[], 'IK':[]}
for file in os.listdir(DATA_PATH + 'datafolder/'):
if file.endswith('.csv'): #Ensure we are looking at csv
#Add the file to its correct bucket based off of the letters in name
fileList[extractTwoLettersFromFileName(file)].append(file)
for twoLetters, files in fileList.items():
mergeFiles(files)
I did not provide implementation for extracting the letters and merging the files, but from you question you seem to already have that implementation.
You can first so a simple substring check, and then based on that, classify the filenames into groups:
letters_list = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
for letters in letters_list:
if letters in filename:
filename_list = filename_dict.get(letters, list())
filename_list.append(filename)
filename_dict[letters] = filename_list
Here is the the Path object from pathlib to create a list of files whose names end with '.csv'. Then I use a function to examine each file's name for the presence of one or another of those strings that you mentioned using a regex so that I can create a list of the pairs of these strings with their associates filenames. Notice that the length of this list of pairs is 12, and that the filenames can be recovered from this.
Having made that list I can use groupby from itertools to create two-element lists of files that share those strings in the file_kinds list. You can merge items in these lists.
>>> from pathlib import Path
>>> file_kinds = ['AC', 'FO', 'CK', 'OR', 'RS', 'IK']
>>> def try_match(filename):
... m = re.search('(%s)'%'|'.join(file_kinds), filename)
... if m:
... return m.group()
... else:
... return None
...
>>> all_files_list = [(item, try_match(item.name)) for item in list(Path(r'C:/scratch/datafolder').glob('*.csv')) if try_match(item.name)]
>>> len(all_files_list)
12
Expression for extracting full paths from all_files_list:
[str(_[0]) for _ in all_files_list]
>>> for kind, files_list in groupby(all_files_list, key=itemgetter(1)):
... kind, [str(_[0]) for _ in list(files_list)]
...
('AC', ['C:\\scratch\\datafolder\\UOG_001-AC_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-AC__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('CK', ['C:\\scratch\\datafolder\\UOG_001-CK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-CK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('FO', ['C:\\scratch\\datafolder\\UOG_001-FO_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-FO__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('IK', ['C:\\scratch\\datafolder\\UOG_001-IK_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-IK__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('OR', ['C:\\scratch\\datafolder\\UOG_001-OR_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-OR__B_TOP-Accelerometer-2017-07-22T112654.csv'])
('RS', ['C:\\scratch\\datafolder\\UOG_001-RS_A_TOP-Accelerometer-2017-07-22T112654.csv', 'C:\\scratch\\datafolder\\UOG_001-RS__B_TOP-Accelerometer-2017-07-22T112654.csv'])

Hello I have a code that prints what I need in python but i'd like it to write that result to a new file

The file look like a series of lines with IDs:
aaaa
aass
asdd
adfg
aaaa
I'd like to get in a new file the ID and its occurrence in the old file as the form:
aaaa 2
asdd 1
aass 1
adfg 1
With the 2 element separated by tab.
The code i have print what i want but doesn't write in a new file:
with open("Only1ID.txt", "r") as file:
file = [item.lower().replace("\n", "") for item in file.readlines()]
for item in sorted(set(file)):
print item.title(), file.count(item)
As you use Python 2, the simplest approach to convert your console output to file output is by using the print chevron (>>) syntax which redirects the output to any file-like object:
with open("filename", "w") as f: # open a file in write mode
print >> f, "some data" # print 'into the file'
Your code could look like this after simply adding another open to open the output file and adding the chevron to your print statement:
with open("Only1ID.txt", "r") as file, open("output.txt", "w") as out_file:
file = [item.lower().replace("\n", "") for item in file.readlines()]
for item in sorted(set(file)):
print >> out_file item.title(), file.count(item)
However, your code has a few other more or less bad things which one should not do or could improve:
Do not use the same variable name file for both the file object returned by open and your processed list of strings. This is confusing, just use two different names.
You can directly iterate over the file object, which works like a generator that returns the file's lines as strings. Generators process requests for the next element just in time, that means it does not first load the whole file into your memory like file.readlines() and processes them afterwards, but only reads and stores one line at a time, whenever the next line is needed. That way you improve the code's performance and resource efficiency.
If you write a list comprehension, but you don't need its result necessarily as list because you simply want to iterate over it using a for loop, it's more efficient to use a generator expression (same effect as the file object's line generator described above). The only syntactical difference between a list comprehension and a generator expression are the brackets. Replace [...] with (...) and you have a generator. The only downside of a generator is that you neither can find out its length, nor can you access items directly using an index. As you don't need any of these features, the generator is fine here.
There is a simpler way to remove trailing newline characters from a line: line.rstrip() removes all trailing whitespaces. If you want to keep e.g. spaces, but only want the newline to be removed, pass that character as argument: line.rstrip("\n").
However, it could possibly be even easier and faster to just not add another implicit line break during the print call instead of removing it first to have it re-added later. You would suppress the line break of print in Python 2 by simply adding a comma at the end of the statement:
print >> out_file item.title(), file.count(item),
There is a type Counter to count occurrences of elements in a collection, which is faster and easier than writing it yourself, because you don't need the additional count() call for every element. The Counter behaves mostly like a dictionary with your items as keys and their count as values. Simply import it from the collections module and use it like this:
from collections import Counter
c = Counter(lines)
for item in c:
print item, c[item]
With all those suggestions (except the one not to remove the line breaks) applied and the variables renamed to something more clear, the optimized code looks like this:
from collections import Counter
with open("Only1ID.txt") as in_file, open("output.txt", "w") as out_file:
counter = Counter(line.lower().rstrip("\n") for line in in_file)
for item in sorted(counter):
print >> out_file item.title(), counter[item]

Facing issue with for loop

I am trying to get this function to read an input file and output the lines from the input file into a new file. Pycharm keeps saying 'item' is not being used or it was used in the first for loop. I don't see why 'item' is a problem. It also won't create the new file.
input_list = 'persist_output_input_file_test.txt'
def persist_output(input_list):
input_file = open(input_list, 'rb')
lines = input_file.readlines()
input_file.close()
for item in input_list:
write_new_file = open('output_word.txt', 'wb')
for item in lines:
print>>input_list, item
write_new_file.close()
You have a few things going wrong in your program.
input_list seems to be a string denoting the name of a file. Currently you are iterating over the characters in the string with for item in input_list.
You shadow the already created variable item in your second for loop. I recommend you change that.
In Python, depending on which version you use, the correct syntax for printing a statement to the screen is print text(Python 2) or print(text)(Python 3). Unlike c++'s std::cout << text << endl;. << and >> are actually bit wise operators in Python that shift the bits either to the left or to the right.
There are a few issues in your implementation. Refer the following code for what you intend to do:
def persist_output(input_list):
input_file = open(input_list, 'rb')
lines = input_file.readlines()
write_new_file = open('output_word.txt', 'wb')
input_file.close()
for item in lines:
print item
write_new_file.write(item);
The issues with your earlier implementation are as follows:
In the first loop you are iterating in the input file name. If you intend to keep input_list a list of input files to be read, then you will also have to open them. Right now, the loop iterates through the characters in the input file name.
You are opening the output file in a loop. So, Only the last write operation will be successful. You would have to move the the file opening operation outside the loop(Ref: above code snippet) or edit the mode to 'append'. This can be done as follows:
write_new_file = open('output_word.txt', 'a')
There is a syntax error with the way you are using print command.
f=open('yourfilename','r').read()
f1=f.split('\n')
p=open('outputfilename','w')
for i in range (len(f1)):
p.write(str(f1[i])+'\n')
p.close()
hope this helps.

Length of Python dictionary created doesn't match length from input file

I'm currently trying to create a dictionary from the following input file:
1776344_at 1779734_at 0.755332745 1.009570769 -0.497209846
1776344_at 1771911_at 0.931592828 0.830039019 2.28101445
1776344_at 1777458_at 0.746306282 0.753624146 3.709120716
...
...
There are a total of 12552 lines in this file.
What I wanted to do is to create a dictionary where the first 2 columns are the keys and the rest are the values. This I've successfully done and it looks something like this:
1770449_s_at;1777263_at:0.825723773;1.188969175;-2.858979578
1772892_at;1772051_at:-0.743866602;-1.303847456;26.41464414
1777227_at;1779218_s_at:0.819554413;0.677758609;4.51390617
But here's THE THING: I ran my python script on ms-dos cmd, and the generated output not only does not have the same sequence as that in the input file (i.e. 1st line is the 34th line), the whole file only has 739 lines.
Can someone enlighten me on what's going on? Is it something to do with memory? Cos the last I check I still have 305GB of disk space.
The script I wrote is as follow:
import sys
import os
input_file = sys.argv[1]
infile = open(input_file, 'r')
model_dict = {}
for line in infile:
key = ';'.join(line.split('\t')[0:2]).rstrip(os.linesep)
value = ';'.join(line.split('\t')[2:]).rstrip(os.linesep)
print 'keys are:',key,'\n','values are:',value
model_dict[key] = value
print model_dict
outfile = open('model_dict', 'w')
for key,value in model_dict.items():
print key,value
outfile.write('%s:%s\n' % (key,value))
outfile.close()
Based on the information given and since each dictionary key is unique, i suspect you have in the input file, lines that are generating the same key. This way the dictionary will only hold the last value associated with that key.
Python dictionaries are unordered set of key: value pairs. So when you print it's elements to the output file, don't expect that the order is preserved.
Another problem i see in your script is the loop that prints the output file, that shouldn't be "inside" the loop that reads from the input file.

Python 2.7 - Split comma separated text file into smaller text files

I was (unsuccessfully) trying to figure out how to create a list of compound letters using loops. I am a beginner programmer, have been learning python for a few months. Fortunately, I later found a solution to this problem - Genearte a list of strings compound of letters from other list in Python - see the first answer.
So I took that code and added a little to it for my needs. I randomized the list, turned the list into a comma separated file. This is the code:
from string import ascii_lowercase as al
from itertools import product
import random
list = ["".join(p) for i in xrange(1,6) for p in product(al, repeat = i)]
random.shuffle(list)
joined = ",".join(list)
f = open("double_letter_generator_output.txt", 'w')
print >> f, joined
f.close()
What I need to do now is split that massive file "double_letter_generator_output.txt" into smaller files. Each file needs to consist of 200 'words'. So it will need to split into many files. The files of course do not exist yet and will need to be created by the program also. How can I do that?
Here's how I would do it, but I'm not sure why you're splitting this into smaller files. I would normally do it all at once, but I'm assuming the file is too big to be stored in working memory, so I'm traversing one character at a time.
Let bigfile.txt contain
1,2,3,4,5,6,7,8,9,10,11,12,13,14
MAX_NUM_ELEMS = 2 #you'll want this to be 200
nameCounter = 1
numElemsCounter = 0
with open('bigfile.txt', 'r') as bigfile:
outputFile = open('output' + str(nameCounter) + '.txt', 'a')
for letter in bigfile.read():
if letter == ',':
numElemsCounter += 1
if numElemsCounter == MAX_NUM_ELEMS:
numElemsCounter = 0
outputFile.close()
nameCounter += 1
outputFile = open('output' + str(nameCounter) + '.txt', 'a')
else:
outputFile.write(letter);
outputFile.close()
now output1.txt is 1,2, output2.txt is 3,4, output3.txt is 5,6, etc.
$ cat output7.txt
13,14
This is a little sloppy, you should write a nice function to do it and format it the way you like!
FYI, if you want to write to a bunch of different files, there's no reason to write to one big file first. Write to the little files right off the bat.
This way, the last file might have fewer than MAX_NUM_ELEMS elements.