I want to find and replace all of the Managerial positions in a CSV file with number 3. The list contains different positions from simple ",Manager," to ",Construction Project Manager and Project Superintendent," but all of them are placed between two commas. I wrote this to find them all:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?,
The Problem is that sometimes a comma is common between two adjacent Managrial position. So I need to include comma when I want to find the positions but I need to exclude it when I want to replace the position with 3! How Can I do that with a regular expression in Python?
Here is the CSV file.
I suggest using Python's built-in CSV module instead. Let's not reinvent the wheel here and consider handling CSV as a solved problem.
Here is some sample code that demonstrates how it can be done: The csv module is responsible for reading and writing the file with the correct delimiter and quotation char.
re.search is used to search individual cells/columns for your keyword. If manager is found, put a 3, otherwise, put the original content and write the row back when done.
import csv, sys, re
infile= r'in.csv'
outfile= r'out.csv'
o = open(outfile, 'w', newline='')
csvwri = csv.writer(o, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
with open(infile, newline='') as f:
reader = csv.reader(f, delimiter=',', quotechar='\"', quoting=csv.QUOTE_MINIMAL)
try:
for row in reader:
newrow = []
for col in row:
if re.search("manager", col, re.I):
newrow.append("3")
else:
newrow.append(col)
csvwri.writerow(newrow)
except csv.Error as e:
sys.exit('file {}, line {}: {}'.format(infile, reader.line_num, e))
o.flush()
o.close()
Straightforward and clean, I would say.
If you insist on using a regex, here's an improved pattern:
[,\s]?([A-Za-z. '\s/()\"]+)?(Manager|manager)([A-Za-z. '\s/()]+)?(?=,)
Replace with 3, as shown in the demo.
However, I believe you are still better off with the csv lib approach.
Related
I have gone through similar questions but am having trouble fitting this to my needs. I am reading a csv, creating a list and appending the list to a seperate csv.
with open('in_table.csv', 'rb') as vo:
next(vo) # skip header row
reader = csv.reader(vo)
vo_list = list(reader)
print vo_list
with open('out_table.csv', 'ab') as f:
cf = csv.writer(f)
for row in vo_list:
cf.writerow(row)
I need to write the list starting at the second column and not the first, as the first column will contain separate information. What is the simplest way to do this?
Realistically I have another input CSV exactly like the first one and I need to put them both into the output file into a total of 4 columns. Like so:
Column1, join_count1, grid_id1, join_count2, grid_id2
Blah, 0, U24, 3, U24
I would go with the built-in csv package. Also, you are opening CSV files as binary files, was that intentional? CSVs should be text files by definition, but if yours are binary then please correct the flags below:
import csv
with open("out_table.csv", "a+") as out_file:
writer = csv.writer(out_file)
with open("in_table.csv") as in_file:
reader = csv.reader(in_file)
next(reader) # skip the header
for oid, join_count, grid_id in reader:
writer.writerow([join_count, grid_id])
I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line
I've read the other threads on this site but haven't quite grasped how to accomplish what I want to do. I'd like to find a method like .splitlines() to assign the first two lines of text in a multiline string into two separate variables. Then group the rest of the text in the string together in another variable.
The purpose is to have consistent data-sets to write to a .csv using the three variables as data for separate columns.
Title of a string
Description of the string
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
There are multiple lines under the second line in the string!
Any guidance on the pythonic way to do this would be appreciated.
Using islice
In addition to normal list slicing you can use islice() which is more performant when generating slices of larger lists.
Code would look like this:
from itertools import islice
with open('input.txt') as f:
data = f.readlines()
first_line_list = list(islice(data, 0, 1))
second_line_list = list(islice(data, 1, 2))
other_lines_list = list(islice(data, 2, None))
first_line_string = "".join(first_line_list)
second_line_string = "".join(second_line_list)
other_lines_string = "".join(other_lines_list)
However, you should keep in mind that the data source you read from is long enough. If it is not, it will raise a StopIteration error when using islice() or an IndexError when using normal list slicing.
Using regex
The OP asked for a list-less approach additionally in the comments below.
Since reading data from a file leads to a string and via string-handling to lists later on or directly to a list of read lines I suggested using a regex instead.
I cannot tell anything about performance comparison between list/string handling and regex operations. However, this should do the job:
import re
regex = '(?P<first>.+)(\n)(?P<second>.+)([\n]{2})(?P<rest>.+[\n])'
preg = re.compile(regex)
with open('input.txt') as f:
data = f.read()
match = re.search(regex, data, re.MULTILINE | re.DOTALL)
first_line = match.group('first')
second_line = match.group('second')
rest_lines = match.group('rest')
If I understand correctly, you want to split a large string into lines
lines = input_string.splitlines()
After that, you want to assign the first and second line to variables and the rest to another variable
title = lines[0]
description = lines[1]
rest = lines[2:]
If you want 'rest' to be a string, you can achieve that by joining it with a newline character.
rest = '\n'.join(lines[2:])
A different, very fast option is:
lines = input_string.split('\n', maxsplit=2) # This only separates the first to lines
title = lines[0]
description = lines[1]
rest = lines[2]
I'm writing a bot in python using tweepy for python 2.7. I'm stumped on how to approach what I am looking to do. Currently the bot finds the tweet id and appends it to a text file. On later runs I want to use regex to search that file for a match and only write if there is no match within the text file. The intent is not to add duplicate tweet ids to my text file which could span a large amount of numbers followed by newline.
Any help is appreciate!
/edit when I try the below code the IDE says match can't be seen and syntax error as a result.
import re,codecs,tweepy
qName = Queue.txt
tweets = api.search(q=searchQuery,count=tweet_count,result_type="recent")
with codecs.open(qName,'a',encoding='utf-8') as f:
for tweet in tweets:
tweetId = tweet.id_str
match = re.findall(tweedId), qName)
#if match = false then do write, else discard and move on
f.write(tweetId + '\n')
If i get you correct,You need not to bother with regex etc. let the special containers do the work for you.I would proceed with non-duplicate-container like dictionary or set e.g read all the data from file into dictionary or set and then go for extending id into this dictionary or set after all write this dictionary or set back into file.
e.g.
>>>data = set()
>>>for i in list('asddddddddddddfgggggg'):
data.add(i)
>>>data
>>>set(['a', 's', 'd', 'g', 'f']) ## see one d and g
I'm working with two large files; approximately 100K+ rows each and I want to search csv file #1 for a string contained in csv file#2, then join another string from csv file#1 to the row in csv file#2 based on the match criteria. Here's an example of the data I'm working with and my expected output:
File#1: String to be matched in file#2 is the 2nd element; 1st is to be appended to each matched row in file#2. (Integer to be appended is bold; string to be matched is italicized for clarity only)
row 1:
3604430123,mta0000cadd503c.mta.net
row 2:
3604434567,mta0000CADD5638.MTA.NET
row 3:
3606304758,mta00069234e9a51.DT.COM
File#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active
Desired Output joining bold integer string from file#1 to entire row in file#2 based on string match between file#1 and file#2:
row 1:
4246,211-015617,mta0000cadd503c.mta.net,old,NW MG2,BBand2 ESA,Active,3604430123
row 2:
7251,ACCOUNT,mta0000CADD5638.MTA.NET,FQDN ,NW MG2,BBand2 ESA,Active,3604434567
row 3:
536887946,874-22558501,mta00069234e9a51.DT.COM,"P",NW MG2,BBand2 ESA,Active,3606304758
There are many instances where the case in the match string of file#1 doesn't match the case of file#2, however the characters match, thus case can be ignored for match critera. The character case does need to be preserved in file#2 after it is appended with the integer string from file#1.
I'm a python newb and I've been at this for a while and have scoured posts in SE, but can't seem to come up with working code that gets me to the point where I can just print out a line from file#2 that has been matched on the string in file#1. I've tried a few other methods, such as writing to a dictionary, using Dictreader, etc, but haven't been able to clear what appears to be simple errors in those methods, so I tried to strip this down to simple lists and get to the point where I can use a list comprehension to combine the data, then write that back to a file named output, which will eventually be written back to a csv file. Any help or suggestions would be greatly appreciated.
import csv
sg = []
fqdn = []
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for row in read:
sg.append(row)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for row in read1:
fqdn.append(row)
output = output.append([s[0] for s in sg if fqdn[1] in sg])
print output
Result after running this is:
None
Process finished with exit code 0
You should use a dictionary for file#1 than just a list, as matching is easier. Just turn fqdn into a dict and in your loop reading file#1 set your key-value pairs on the dict. I would use .lower() on the match key. This turns the key to lower case so you later only have to check if the lower-cased version of the field in file#2 is a key in the dictionary:
import csv
sg = []
fqdn = {}
output = []
with open(r'file2.csv', 'rb') as src:
read = csv.reader(src, delimiter=',')
for dataset in read:
sg.append(dataset)
with open(r'file1.csv', 'rb') as src1:
read1 = csv.reader(src1, delimiter=',')
for to_append, to_match in read1:
fqdn[to_match.lower()] = to_append
for dataset in sg:
to_append = fqdn.get(dataset[2].lower()) # If the key matched, to_append now contains the string to append, else it becomes None
if to_append:
dataset.append(to_append) # Append the field
output.append(dataset) # Append the row to the result list
print(output)
You can then use csv.writer to create a csv file from the result.
Here's a brute force solution to solving this problem. For every line of the first file, you will search through every line of the second file until you find a match. The matched lines will be written out to the output.csv file in the format you specified using the csv writer.
import csv
with open('file1.csv', 'r') as file1:
with open('file2.csv', 'r') as file2:
with open('output.csv', 'w') as outfile:
writer = csv.writer(outfile)
reader1 = csv.reader(file1)
reader2 = csv.reader(file2)
for row in reader1:
if not row:
continue
for other_row in reader2:
if not other_row:
continue
# if we found a match, let's write it to the csv file with the id appended
if row[1].lower() == other_row[2].lower():
new_row = other_row
new_row.append(row[0])
writer.writerow(new_row)
continue
# reset file pointer to beginning of file
file2.seek(0)
You might be tempted to store the information in a data structure before writing it out to a file. In my experience, you always end up getting larger files in the future and may run into memory issues. I like to write things out to file as I find the matches in order to avoid this problem.