I am reading in a CSV file and it works quite well, but some of the Strings look like this:
u'Egg'
when trying to convert this to a String I get the Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128). I have read various Questions similar to this, but trying the solutions provided already resulted in the same error.
Strangely when debugging as you can see in the picture the variable CITY, has the correct supposed to be value. But it still crashes.
below my function:
def readData(filename, delimiter=";"):
"""
Read in our data from a CSV file and create a dictionary of records,
where the key is a unique record ID and each value is dict
"""
data = pd.read_csv(filename, delimiter=delimiter, encoding="UTF-8")
data.set_index("TRNUID")
returnValue = {}
for index, row in data.iterrows():
if index == 0:
print row["CITY"]
else:
if math.isnan(row["DUNS"]) == True:
DUNS = ""
else:
DUNS = str((int(row["DUNS"])))[:-2]
NAME = str(row["NAME"]).encode("utf-8")
STREET = str(row["STREET"]).encode("utf-8")
CITY = row["CITY"]
POSTAL = str(row["POSTAL"]).encode("utf-8")
returnValue[row["TRNUID"]] = {
"DUNS": DUNS,
"NAME": NAME,
"STREET": STREET,
"CITY": CITY,
"POSTAL": POSTAL
}
return returnValue
You're trying to convert to an ASCII string something that inherently cannot be converted to it.
If you look at the unicode character for \xfc, it is a "u" with an umlaut. Indeed, your screenshot of the variables shows "Egg a.d.Guntz" with an umlaut over the "u". The problem is not with "Egg", therefore, but with the continuation.
You could address this by removing all diacritics from your characters (as in this question), but you will lose information.
Related
I need to do pretty much what a 'grep -i str file' gives back but I have been hitting my head up against this issue for ages.
I have a func called 'siteLookup' that I am passing two parameters: str 's', and file_handle 'f'.
I want to a) determine if there is a (single) occurrence of the string (in this example site="XX001"),
and b) if found, take the line it was found in, and return another field value that I extract from that [found] line back to the caller. (it is a 'csv' lookup). I have had this working periodically but then it will stop working and I cannot see why.
I have tried all of the different 'open' options including f.readlines etc.
#example line: 'XX001,-1,10.1.1.1/30,By+Location/CC/City Name/'
#example from lookupFile.csv: "XX001","Charleston","United States"
sf = open('lookupFile.csv')
def siteLookup(s, f):
site = s.split(',')[0].strip().upper()
if len(site) == 5:
f.seek(0)
for line in f:
if line.find(site)>=0:
city = line.split(',')[1].strip('"').upper()
return city
# else site not found
return -1
else: # len(site) != 5
return -1
city = siteLookup(line, sf)
print(city)
sf.close()
I am getting zero matches in this code. (I have simplified this example code to a single search). I am expecting to get back the name of the city that matches a 5 digit site code - the site code is the first field in the example "line".
Any assistance much appreciated.
Your return is wrongly indented - if the thing you look for is not found in the first line, it will return -1 and not look further.
Use with open(...) as f: to make your code more secure:
with open("lookupFile.csv","w") as f:
f.write("""#example from lookupFile.csv:
"XX001","Charleston","United States"
""")
def siteLookup(s, f):
site = s.split(',')[0].strip().upper()
if len(site) == 5:
f.seek(0)
for line in f:
if site in line: # if site in line is easier for checking
city = line.split(',')[1].strip('"').upper()
return city
# wrongly indented - will return if site not in line
# return -1
# if too short or not found, return -1 - no need for 2 returns
return -1
line = 'XX001,-1,10.1.1.1/30,By+Location/CC/City Name/'
with open('lookupFile.csv') as sf:
city = siteLookup(line, sf)
print(city)
Output:
CHARLESTON
I have a file with millions of records like this
2017-07-24 18:34:23|CN:SSL|RESPONSETIME:23|BYTESIZE:1456|CLIENTIP:127.0.0.9|PROTOCOL:SSL-V1.2
Each record contains around 30 key-value pairs with "|" delimeter. Key-value pair position is not constant.
Trying to parse these records using python dictionary or list concepts.
Note: 1st column is not in key-value format
your file is basically a |-separated csv file holding first the timestamp, then 2 fields separated by :.
So you could use csv module to read the cells, then pass the result of str.split to a dict in a gencomp to build the dictionary for all elements but the first one.
Then update the dict with the timestamp:
import csv
list_of_dicts = []
with open("input.txt") as f:
cr = csv.reader(f,delimiter="|")
for row in cr:
d = dict(v.split(":") for v in row[1:])
d["date"] = row[0]
list_of_dicts.append(d)
list_of_dicts contains dictionaries like
{'date': '2017-07-24 18:34:23', 'PROTOCOL': 'SSL-V1.2', 'RESPONSETIME': '23', 'CN': 'SSL', 'CLIENTIP': '127.0.0.9', 'BYTESIZE': '1456'}
You repeat the below process for all the lines in your code. I am not clear about the date time value. So I haven't included that in the input. You can include it based on your understanding.
import re
given = "CN:SSL|RESPONSETIME:23|BYTESIZE:1456|CLIENTIP:127.0.0.9|PROTOCOL:SSL-
V1.2"
results = dict()
list_for_this_line = re.split('\|',given)
for i in range(len(list_for_this_line)):
separated_k_v = re.split(':',list_for_this_line[i])
results[separated_k_v[0]] = separated_k_v[1]
print results
Hope this helps!
I am trying to write a program that will read a text file and convert what it reads to another text file but using the given variables. Kinda like a homemade encryption. I want the program to read 2 bytes at a time and read the entire file. I am new to python but enjoy the application. any help would be greatly appreciated
a = 12
b = 34
c = 56
etc... up to 20 different types of variables
file2= open("textfile2.text","w")
file = open("testfile.txt","r")
file.read(2):
if file.read(2) = 12 then;
file2.write("a")
else if file.read(2) = 34
file2.write("b")
else if file.read(2) = 56
file2.write("c")
file.close()
file2.close()
Text file would look like:
1234567890182555
so the program would read 12 and write "a" in the other text file and then read 34 and put "b" in the other text file. Just having some logic issues.
I like your idea here is how I would do it. Note I convert everything to lowercase using lower() however if you understand what I am doing it would be quite simple to extend this to work on both lower and uppercase:
import string
d = dict.fromkeys(string.ascii_lowercase, 0) # Create a dictionary of all the letters in the alphabet
updates = 0
while updates < 20: # Can only encode 20 characters
letter = input("Enter a letter you want to encode or type encode to start encoding the file: ")
if letter.lower() == "encode": # Check if the user inputed encode
break
if len(letter) == 1 and letter.isalpha(): # Check the users input was only 1 character long and in the alphabet
encode = input("What do want to encode %s to: " % letter.lower()) # Ask the user what they want to encode that letter to
d[letter.lower()] = encode
updates += 1
else:
print("Please enter a letter...")
with open("data.txt") as f:
content = list(f.read().lower())
for idx, val in enumerate(content):
if val.isalpha():
content[idx] = d[val]
with open("data.txt", 'w') as f:
f.write(''.join(map(str, content)))
print("The file has been encoded!")
Example Usage:
Original data.txt:
The quick brown fox jumps over the lazy dog
Running the script:
Enter a letter you want to encode or type encode to start encoding the file: T
What do want to encode t to: 6
Enter a letter you want to encode or type encode to start encoding the file: H
What do want to encode h to: 8
Enter a letter you want to encode or type encode to start encoding the file: u
What do want to encode u to: 92
Enter a letter you want to encode or type encode to start encoding the file: 34
Please enter a letter...
Enter a letter you want to encode or type encode to start encoding the file: rt
Please enter a letter...
Enter a letter you want to encode or type encode to start encoding the file: q
What do want to encode q to: 9
Enter a letter you want to encode or type encode to start encoding the file: encode
The file has been encoded!
Encode data.txt:
680 992000 00000 000 092000 0000 680 0000 000
I would read the source file and convert the items as you go into a string. Then write the entire result string separately to the second file. This would also allow you to use the better with open construct for file reading. This allows python to handle file closing for you.
This code will not work because it only reads the first two characters. you need to create your own idea on how to iterate it, but here is an idea (without just making a solution for you)
with open("textfile.text","r") as f:
# you need to create a way to iterate over these two byte/char increments
code = f.read(2)
decoded = <figure out what code translates to>
results += decoded
# now you have a decoded string inside `results`
with open("testfile.txt","w") as f:
f.write(results)
the decoded = <figure out what code translates to> part can be done much better than using a bunch of serial if/elseifs....
perhaps define a dictionary of the encodings?
codings = {
"12": "a",
"45": "b",
# etc...
}
then you could just:
results += codings[code]
instead of the if statements (and it would be faster).
I'm trying to find how to stop a os.walk after it has walked through a particular file.
I have a directory of log files organized by date. I'm trying to replace grep searches allowing a user to find ip addresses stored in a date range they specify.
The program will take the following arguments:
-i ipv4 or ipv6 address with subnet
-s start date ie 2013/12/20 matches file structure
-e end date
I'm assuming because the topdown option their is a logic that should allow me to declare a endpoint, what is the best way to do this? I'm thinking while loop.
I apologize in advance if something is off with my question. Just checked blood sugar, it's low 56, gd type one.
Additional information
The file structure will be situated in flows/index_border as such
2013
--01
--02
----01
----...
----29
2014
___________Hope this is clear, year folder contains month folders, containing day folders, containing hourly files. Dates increase downwards.___________________
The end date will need to be inclusive, ( I didn't focus too much on it because I can just add code to move one day up)
I have been trying to make a date range function, I was surprised I didn't see this in any datetime docs, seems like it would be useful.
import os, gzip, netaddr, datetime, argparse
startDir = '.'
def sdate_format(s):
try:
return (datetime.datetime.strptime(s, '%Y/%m/%d').date())
except ValueError:
msg = "Bad start date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
def edate_format(e):
try:
return (datetime.datetime.strptime(e, '%Y/%m/%d').date())
except ValueError:
msg = "Bad end date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
parser = argparse.ArgumentParser(description='Locate IP address in log files for a particular date or date range')
parser.add_argument('-s', '--start_date', action='store', type=sdate_format, dest='start_date', help='The first date in range of interest.')
parser.add_argument('-e', '--end_date', action='store', type=edate_format, dest='end_date', help='The last date in range of interest.')
parser.add_argument('-i', action='store', dest='net', help='IP address or address range, IPv4 or IPv6 with optional subnet accepted.', required=True)
results = parser.parse_args()
start = results.start_date
end = results.end_date
target_ip = results.net
startDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(start.year, start.month, start.day)
print('searching...')
for root, dirs, files in os.walk(startDir):
for contents in files:
if contents.endswith('.gz'):
f = gzip.open(os.path.join(root, contents), 'r')
else:
f = open(os.path.join(root, contents), 'r')
text = f.readlines()
f.close()
for line in text:
for address_item in netaddr.IPNetwork(target_IP):
if str(address_item) in line:
print line,
You need to describe what works or does not work. The argparse of your code looks fine, though I haven't done any testing. The use of type is refreshingly correct. :) (posters often misuse that parameter.)
But as for the stopping, I'm guessing you could do:
endDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(end.year, end.month, end.day)
for root, dirs, files in os.walk(startDir):
for contents in files:
....
if endDir in <something based on dirs and files>:
break
I don't know enough your file structure to be more specific. It's also been sometime since I worked with os.walk. In any case, I think a conditional break is the way to stop the walk early.
#!/usr/bin/env python
import os, gzip, netaddr, datetime, argparse, sys
searchDir = '.'
searchItems = []
def sdate_format(s):
try:
return (datetime.datetime.strptime(s, '%Y/%m/%d').date())
except ValueError:
msg = "Bad start date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
def edate_format(e):
try:
return (datetime.datetime.strptime(e, '%Y/%m/%d').date())
except ValueError:
msg = "Bad end date. Please use yyyy/mm/dd format."
raise argparse.ArgumentTypeError(msg)
parser = argparse.ArgumentParser(description='Locate IP address in log files for a particular date or date range')
parser.add_argument('-s', '--start_date', action='store', type=sdate_format, dest='start_date',
help='The first date in range of interest.', required=True)
parser.add_argument('-e', '--end_date', action='store', type=edate_format, dest='end_date',
help='The last date in range of interest.', required=True)
parser.add_argument('-i', action='store', dest='net',
help='IP address or address range, IPv4 or IPv6 with optional subnet accepted.', required=True)
results = parser.parse_args()
start = results.start_date
end = results.end_date + datetime.timedelta(days=1)
target_IP = results.net
dateRange = end - start
for addressOfInterest in(netaddr.IPNetwork(target_IP)):
searchItems.append(str(addressOfInterest))
print('searching...')
for eachDay in range(dateRange.days):
period = start+datetime.timedelta(days=eachDay)
searchDir = '/flows/index_border/{0}/{1:02d}/{2:02d}'.format(period.year, period.month, period.day)
for contents in os.listdir(searchDir):
if contents.endswith('.gz'):
f = gzip.open(os.path.join(searchDir, contents), 'rb')
text = f.readlines()
f.close()
else:
f = open(os.path.join(searchDir, contents), 'r')
text = f.readlines()
f.close()
#for line in text:
# break
for addressOfInterest in searchItems:
for line in text:
if addressOfInterest in line:
# if str(address_item) in line:
print contents
print line,
I was banging my head, because I thought I was printing a duplicate. Turns out the file I was given to test has duplication. I ended up removing os.walk due to the predictable nature of the file system, but #hpaulj did provide a correct solution. Much appreciated!
Hi i am a newbie to python.i am creating a small program which parses the load log file of a particular website and stores the valid data in particular feild of a database.but some of the feild has weired character like '推广频道'.(non-ascii character ' xe6').
i also tried to apply the solution of this question.
Inserting unicode into sqlite?
but this codec is unable to decode and providing UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 84: chara
cter maps to
def db_log_entry(filename):
conn = sqlite3.connect('log.db')
c= conn.cursor()
""" creating log table
Table structure:
Name Type
id integer
identifier text
url text
user_ip text
datetime text
timezone_mod text
query_type text
status text
opt_id text
user_agent text
opt_ip text
opt_ip_port text
"""
c.execute(''' CREATE TABLE log
(id integer, identifier text, url text, user_ip text, datetime text, timezone_mod text, query_type text, query_url text, status text,
opt_id text, user_agent text, opt_ip text, opt_ip_port text)''')
f=codecs.open(filename,encoding='cp1252') ### opening file to read data
loop_count=0 # variable to keep record of successful loop iteration
id_count=0 # variable to be entered in id feild of log table
""" Reading each line of log file for performing iterative opertion on each line to parse valid data
make a entry in database """
for log_line in f:
loop_count= loop_count+1
id_count= id_count+1
list=[]
db_list=[]
(txt1,txt2)=log_line.split('HTTP/') #split the log_line in two parts txt1,txt2 for comparison with reg1,reg2
reg1= r'(\d{6}_\d{5})\s([\w.-]+)\s([\d.]+)\s-\s-\s\[(\d{2}/\w{3}/\d{4}:\d{2}:\d{2}:\d{2})\s([+-]?\d+)\]\s"(\w+)\s([\w.\-\/]+)'
reg2= r'[1.1"]?[1.0"]?\s(\d*)\s(\d*)([\s\"\-]*)([\w\.\%\/\s\;\(\:\+\)\?\S\"]+)"\s(\d{2,3}.\d{2,3}.\d{2,3}.\d{2,3})\:(\d+)'
print 'starting loop ',loop_count
match= re.search(reg1,txt1)
if match: # If regex match found between (reg1,txt1) than append the data in db_list
db_list.append(id_count)
db_list.append(match.group(1))
print match.group(1)
db_list.append(match.group(2))
print match.group(2)
db_list.append(match.group(3))
print match.group(3)
db_list.append(match.group(4))
print match.group(4)
db_list.append(match.group(5))
print match.group(5)
db_list.append(match.group(6))
print match.group(6)
db_list.append(match.group(7))
print match.group(7)
print 'yes 1'
else:
print 'match not found in reg1'
match= re.search(reg2,txt2) # If regex match found between (reg2,txt2) than append the data in db_list
if match:
db_list.append(match.group(1))
print match.group(1)
db_list.append(match.group(2))
print match.group(2)
#print match.group(3)
db_list.append(match.group(4))
print match.group(4)
db_list.append(match.group(5))
print match.group(5)
db_list.append(match.group(6))
print match.group(6)
print 'yes 2'
else:
print 'match not found in reg2'
print 'inserting value of',loop_count
c.execute('INSERT INTO log VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)',db_list)
conn.commit()
print 'success...'
conn.commit()
def main():
db_log_entry('access_log.txt')
if __name__ == '__main__':
main()
That's because you're using the wrong character encoding to open the file.
You should be opening it with UTF-8, like this:
f=codecs.open(filename,encoding='utf-8')
This is because CP-1252 is an encoding for the Latin alphabet and as such doesn't understand Japanese characters.
As I'm not sure what the original encoding (or language) is, UTF-8 is also a safe bet as it supports all languages.