I'm working with two large files; approximately 100K+ rows each and I want to search csv file #1 for a string contained in csv file#2, then join another string from csv file#1 to the row in csv file#2 based on the match criteria. Here's an example of the data I'm working with and my expected output:
File#1: String to be matched in file#2 is the 2nd element; 1st is to be appended to each matched row in file#2. (Integer to be appended is bold; string to be matched is italicized for clarity only)
row 1:
3604430123,mta0000cadd503c.mta.net
row 2:
3604434567,mta0000CADD5638.MTA.NET
row 3:
3606304758,mta00069234e9a51.DT.COM
File#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active
Desired Output joining bold integer string from file#1 to entire row in file#2 based on string match between file#1 and file#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active,3604430123
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active,3604434567
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active,3606304758
There are many instances where the case in the match string of file#1 doesn't match the case of file#2, however the characters match, thus case can be ignored for match critera. The character case does need to be preserved in file#2 after it is appended with the integer string from file#1.
I'm a python newb and I've been at this for a while and have scoured posts in SE, but can't seem to come up with working code that gets me to the point where I can just print out a line from file#2 that has been matched on the string in file#1. I've tried a few other methods, such as writing to a dictionary, using Dictreader, etc, but haven't been able to clear what appears to be simple errors in those methods, so I tried to strip this down to simple lists and get to the point where I can use a list comprehension to combine the data, then write that back to a file named output, which will eventually be written back to a csv file. Any help or suggestions would be greatly appreciated.
import csv
sg = []
fqdn = []
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for row in read:
sg.append(row)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for row in read1:
fqdn.append(row)
output = output.append([s[0] for s in sg if fqdn[1] in sg])
print output
Result after running this is:
None
Process finished with exit code 0
You should use a dictionary for file#1 than just a list, as matching is easier. Just turn fqdn into a dict and in your loop reading file#1 set your key-value pairs on the dict. I would use .lower() on the match key. This turns the key to lower case so you later only have to check if the lower-cased version of the field in file#2 is a key in the dictionary:
import csv
sg = []
fqdn = {}
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for dataset in read:
sg.append(dataset)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for to_append, to_match in read1:
fqdn[to_match.lower()] = to_append
for dataset in sg:
to_append = fqdn.get(dataset[2].lower()) # If the key matched, to_append now contains the string to append, else it becomes None
if to_append:
dataset.append(to_append) # Append the field
output.append(dataset) # Append the row to the result list
print(output)
You can then use csv.writer to create a csv file from the result.
Here's a brute force solution to solving this problem. For every line of the first file, you will search through every line of the second file until you find a match. The matched lines will be written out to the output.csv file in the format you specified using the csv writer.
import csv
with open('file1.csv', 'r') as file1:
with open('file2.csv', 'r') as file2:
with open('output.csv', 'w') as outfile:
writer = csv.writer(outfile)
reader1 = csv.reader(file1)
reader2 = csv.reader(file2)
for row in reader1:
if not row:
continue
for other_row in reader2:
if not other_row:
continue
# if we found a match, let's write it to the csv file with the id appended
if row[1].lower() == other_row[2].lower():
new_row = other_row
new_row.append(row[0])
writer.writerow(new_row)
continue
# reset file pointer to beginning of file
file2.seek(0)
You might be tempted to store the information in a data structure before writing it out to a file. In my experience, you always end up getting larger files in the future and may run into memory issues. I like to write things out to file as I find the matches in order to avoid this problem.
Related
I have gone through similar questions but am having trouble fitting this to my needs. I am reading a csv, creating a list and appending the list to a seperate csv.
with open('in_table.csv', 'rb') as vo:
next(vo) # skip header row
reader = csv.reader(vo)
vo_list = list(reader)
print vo_list
with open('out_table.csv', 'ab') as f:
cf = csv.writer(f)
for row in vo_list:
cf.writerow(row)
I need to write the list starting at the second column and not the first, as the first column will contain separate information. What is the simplest way to do this?
Realistically I have another input CSV exactly like the first one and I need to put them both into the output file into a total of 4 columns. Like so:
Column1, join_count1, grid_id1, join_count2, grid_id2
Blah, 0, U24, 3, U24
I would go with the built-in csv package. Also, you are opening CSV files as binary files, was that intentional? CSVs should be text files by definition, but if yours are binary then please correct the flags below:
import csv
with open("out_table.csv", "a+") as out_file:
writer = csv.writer(out_file)
with open("in_table.csv") as in_file:
reader = csv.reader(in_file)
next(reader) # skip the header
for oid, join_count, grid_id in reader:
writer.writerow([join_count, grid_id])
I would like to format multiplecsv files, some of them have summaries before the raw data. Raw data can start at any row, but if “colname” is find at any row then raw data start there. I am using the Standard Libary csv module to read files and check if “colname” exist and extract the data from there. With the code below, print(data) always gives me data from the first row of the file. But I want to pull the data starting from where “colname” is found. If “colname” is not found I don’t want to read the data.
Root_dir=r”folder1”
for fname in os.listdir(root_dir):
file_path = os.path.join(root_dir, fname)
if fname.endswith(('.csv')):
n = 0
with open(file_path,'rU') as fp:
csv_reader = csv.reader(fp)
while True:
for line in csv_reader:
if line == " colname": continue
n = n + 1
data=line
print(data)
Your code's logic reads only skip lines that aren't exactly " colname", which has 2 problems:
You want to skip lines until AFTER you have seen "colname"; you could use a boolean variable to distinguish between these two situations
Not clear if your test for colname is correct; for example, if there isn't exactly one leading space, or the line has a trailing end-of-line character, would trip it up.
I want to find and replace all of the Managerial positions in a CSV file with number 3. The list contains different positions from simple ",Manager," to ",Construction Project Manager and Project Superintendent," but all of them are placed between two commas. I wrote this to find them all:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?,
The Problem is that sometimes a comma is common between two adjacent Managrial position. So I need to include comma when I want to find the positions but I need to exclude it when I want to replace the position with 3! How Can I do that with a regular expression in Python?
Here is the CSV file.
I suggest using Python's built-in CSV module instead. Let's not reinvent the wheel here and consider handling CSV as a solved problem.
Here is some sample code that demonstrates how it can be done: The csv module is responsible for reading and writing the file with the correct delimiter and quotation char.
re.search is used to search individual cells/columns for your keyword. If manager is found, put a 3, otherwise, put the original content and write the row back when done.
import csv, sys, re
infile= r'in.csv'
outfile= r'out.csv'
o = open(outfile, 'w', newline='')
csvwri = csv.writer(o, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
with open(infile, newline='') as f:
reader = csv.reader(f, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
try:
for row in reader:
newrow = []
for col in row:
if re.search("manager", col, re.I):
newrow.append("3")
else:
newrow.append(col)
csvwri.writerow(newrow)
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(infile, reader.line_num, e))
o.flush()
o.close()
Straightforward and clean, I would say.
If you insist on using a regex, here's an improved pattern:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?(?=,)
Replace with 3, as shown in the demo.
However, I believe you are still better off with the csv lib approach.
here is my problem, I tend to get corresponding value in CSV through a list. For example:
I have a list like
namelist=[1,2]
and csv like
name id value
1 a aaa
2 b bbb
3 c ccc
and i tend to using every element in the list to find corresponding value in CSV. such as: 1-aaa;2-bbb. this is what i tried now:
with open('1.csv','rb') as f:
reader = csv.DictReader(f)
for i in namelist:
for row in reader:
if row['name'] == namelist[i]:
print row['value']
but I got nothing. how can i fix it?
Thanks in advance!
A couple things:
csv.DictReader reads items to a dictionary of string:string, not string:int. So, I changed your namelist to a list of strings. Alternatively, you could convert row['name'] to an integer, but I figured it would be more versatile converting your namelist in this manner.
It would be much faster to just check if row['name'] in namelist: then to loop over the entire csv file for every element in namelist.
Code:
import csv
namelist=['1','2']
with open('1.csv','rb') as f:
reader = csv.DictReader(f)
for row in reader:
if row['name'] in namelist:
print row['value']
Output:
aaa
bbb
First of all you should use indents after "with" clause. And another thing that can cause problem may be in this part of code:
for i in namelist: # in that case if namelist = [1,2] i will be 1 or 2, but you need 0,1
Also you can try this solution if the order of the columns is always such (i mean name,id,value):
namelist=[1,2]
with open('1.csv','rb') as f:
reader = list(csv.reader(f))
for row in reader:
for i in range(len(namelist)): # or if row[0] in namelist:
if row[0] == namelist[i]:
print row[2]
I need to create a list of tuples from a .csv file. On another post a member suggested using this code:
import csv
with open('movieCatalogue.csv') as f:
data=[tuple(line) for line in csv.reader(f)]
data.pop(0)
print(data)
This is almost perfect except the first column in the .csv file contains the product id which I do not one in the tuples. Is there a way to prevent certain columns in each line from being copied.
First, I suppose you're dropping the title line with data.pop(0). You could save a list dealloc/move by skipping when reading.
Then, when you compose your tuple, just drop the first element using sub-list syntax: line[start:stop:step], starting at index 0.
import csv
with open('movieCatalogue.csv') as f:
cr = csv.reader(f)
# drop the first line: better as next(f)
# since it works even if the title line is multi-line!
next(cr)
data=[tuple(line[1:]) for line in cr] # drop first column of each line
print(data)