pandas reading .csv files - python-2.7

I have a small script to read and print a .csv file using pandas generated from MS Excel.
import pandas as pd
data = pd.read_csv('./2010-11.csv')
print(data)
now this script runs in Python 2.7.8 but in Python 3.4.1 gives the following
error. Any ideas why this might be so? Thanks in advance for any help with this.
Traceback (most recent call last):
File "proc_csv_0-0.py", line 3, in <module>
data = pd.read_csv('./2010-11.csv')
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 260, in _read
return parser.read()
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 721, in read
ret = self._engine.read(nrows)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 1170, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7566)
File "pandas/parser.pyx", line 791, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7806)
File "pandas/parser.pyx", line 866, in pandas.parser.TextReader._read_rows (pandas/parser.c:8639)
File "pandas/parser.pyx", line 973, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9950)
File "pandas/parser.pyx", line 1033, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10737)
File "pandas/parser.pyx", line 1130, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12141)
File "pandas/parser.pyx", line 1150, in pandas.parser.TextReader._string_convert (pandas/parser.c:12355)
File "pandas/parser.pyx", line 1382, in pandas.parser._string_box_utf8 (pandas/parser.c:17679)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 4: unexpected end of data

In Python3, when pd.read_csv is passed a file path (as opposed to a file buffer) it decodes the contents with the utf-8 codec by default.1 It appears your CSV file is using a different encoding. Since it was generated by MS Excel, it might be cp-1252:
In [25]: print('\xc9'.decode('cp1252'))
É
In [27]: import unicodedata as UDAT
In [28]: UDAT.name('\xc9'.decode('cp1252'))
Out[28]: 'LATIN CAPITAL LETTER E WITH ACUTE'
The error message
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9
says that '\xc9'.decode('utf-8') raises a UnicodeDecodeError.
The above shows byte 0xc9 can be decoded with cp1252. It remains to be seen if the rest of the file can also be decoded with cp1252, and if it produces the desired result.
Unfortunately, given only a file, there is no surefire way to tell what
encoding (if any) was used. It depends entirely on the program used to generate
the file.
If cp1252 is the right encoding, then to load the file into a DataFrame use
data = pd.read_csv('./2010-11.csv', encoding='cp1252')
1 When pd.read_csv is passed a buffer, the buffer could have been opened with encoding already set:
# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
df = pd.read_csv(f)
print(df)
in which case pd.read_csv will not attempt to decode since the buffer f is already supplying decoded strings.

Related

Trying to run Word Embeddings Benchmarks and getting UnicodeDecodeError

I am trying to run the Word Embeddings Benchmarks from this Github: Word Embeddings Benchmarks Github on word2vec embeddings I've created. I've included a picture of what my embedding file looks like.
I keep getting this error:
Traceback (most recent call last):
File "./evaluate_on_all.py", line 75, in <module>
load_kwargs=load_kwargs)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embeddings.py", line 39, in load_embedding
w = Embedding.from_word2vec(fname, binary=False)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embedding.py", line 482, in from_word2vec
words, vectors = Embedding._from_word2vec_text(fname)
File "/home/groups/gdarmsta/word-embeddings-benchmarks-master/scripts/web/embedding.py", line 340, in _from_word2vec_text
header = fin.readline()
File "/share/software/user/open/python/3.6.1/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 16: invalid start byte
I just want to be able to get the benchmarks to work properly with my embeddings.
Results of hexdump of header:
It looks like you're getting the error reading the very-first header line of the file (which suggests it's not something like a challenging word later on):
https://github.com/kudkudak/word-embeddings-benchmarks/blob/2b56c401ea4bba335ebfc0c8c5c4f8ba6394f2cd/web/embedding.py#L340
Are you sure that you're specifying the right plain-text file?
Might the file have extra hidden characters at the beginning, like the 'Byte Order Mark'? (Looking at hexdump -C YOUR_FILE_NAME | head could give a clue.)

apache beam 2.7.0 craches in utf-8 decoding french characters

I am trying to write a csv from a bucket of google cloud platform into datastore, containing french characters/accents but I have an error message regarding decoding.
After trying encoding and decoding from "latin-1" to "utf-8" without success (using unicode, unicodedata and codecs) I tried to change things manually...
The Os I am using, has the "ascii" encoding by default and I manually changed in "Anaconda3/envs/py27/lib/site.py" into utf-8.
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
sys.setdefaultencoding("utf-8")
I've tried locally with a test file, by printing and then writing a string with accents into a file, and it worked!
string='naïve café'
test_decode=codecs.utf_8_decode(string, "strict", True)[0]
print(test_decode)
with open('./test.txt', 'w') as outfile:
outfile.write(test_decode)
But no luck with apache_beam...
Then I've tried to manually change "/usr/lib/python2.7/encodings/utf_8.py" and put "ignore" instead of "strict" into codecs.utf_8_decode
def decode(input, errors='ignore'):
return codecs.utf_8_decode(input, errors, True)
but I've realized that apache_beam do not use this file or at least does not take it into account any changes
Any ideas how to deal with it?
Please find below the error message
Traceback (most recent call last):
File "etablissementsFiness.py", line 146, in <module>
dataflow(run_locally)
File "etablissementsFiness.py", line 140, in dataflow
| 'Write entities into Datastore' >> WriteToDatastore(PROJECT)
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\pipel
ine.py", line 414, in __exit__
self.run().wait_until_finish()
File "C:\Users\Georges\Anaconda3\envs\py27\lib\site-packages\apache_beam\runne
rs\dataflow\dataflow_runner.py", line 1148, in wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow
pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 642, in do_work
work_executor.execute()
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", lin
e 156, in execute
op.start()
File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
def start(self):
File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.scoped_start_state:
File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
with self.spec.source.reader() as reader:
File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.nativ
e_operations.NativeReadOperation.start
for value in reader:
File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/textio.py", line 2
01, in read_records
yield self._coder.decode(record)
File "/usr/local/lib/python2.7/dist-packages/apache_beam/coders/coders.py", li
ne 307, in decode
return value.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 190: invalid continuation byte
Try to write a CustomCoder class and "ignore" any errors while decoding:
from apache_beam.coders.coders import Coder
class CustomCoder(Coder):
"""A custom coder used for reading and writing strings as UTF-8."""
def encode(self, value):
return value.encode("utf-8", "replace")
def decode(self, value):
return value.decode("utf-8", "ignore")
def is_deterministic(self):
return True
Then, read and write the files using the coder=CustomCoder():
lines = p | "Read" >> ReadFromText("files/path/*.txt", coder=CustomCoder())
# More processing code here...
output | WriteToText("output/file/path", file_name_suffix=".txt", coder=CustomCoder())
This error: "UnicodeDecodeError: 'utf8' codec can't decode byte"
means, that you CSV file still contains some wrong bytes not recognized by the decoder as UTF characters.
The easiest solution for this, is to convert and validate csv input file to not contain UTF8 errors before submitting for Datastore. Simple online UTF8 validation can check it.
If you need to convert latin-1 to UTF8 in python, you can do it like that:
string.decode('iso-8859-1').encode('utf8')

Unable to read csv in pandas with sys.argv[]

I read a csv file in pandas by specifying the name directly. It works without trouble. But when I try to do the same with sys.argv[] it throws an error. This is what I tried.
import sys
import pandas as pd
filename = sys.argv[-1]
data = pd.read_csv(filename, delimiter=',')
print data
print data[data['logFC'] >= 2]
The error it throws back is:
Traceback (most recent call last): File "/Users/filter_cutoff_genexdata.py", line 7, in <module>
data = pd.read_csv(filename, delimiter=',') File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 285, in
_read
return parser.read() File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 747, in read
ret = self._engine.read(nrows) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1197, in read
data = self._reader.read(nrows) File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read (pandas/parser.c:7988) File "pandas/parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244) File "pandas/parser.pyx", line 842, in pandas.parser.TextReader._read_rows (pandas/parser.c:8970) File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838) File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649) pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 3
Can anyone please tell why it doesn't work ?

NLTK python tokenizing a CSV file

I have began to experiment with Python and NLTK.
I am experiencing a lengthy error message which I cannot find a solution to and would appreciate any insights you may have.
import nltk,csv,numpy
from nltk import sent_tokenize, word_tokenize, pos_tag
reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|')
tokenData = nltk.word_tokenize(reader)
I'm running Python 2.7 and the latest nltk package on OSX Yosemite.
These are also two lines of code I attempted with no difference in results:
with open("Medium_Edited.csv", "rU") as csvfile:
tokenData = nltk.word_tokenize(reader)
These are the error messages I see:
Traceback (most recent call last):
File "nltk_text.py", line 11, in <module>
tokenData = nltk.word_tokenize(reader)
File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize
return tokenizer.tokenize(text)
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
Thanks in advance
As you can read in the Python csv documentation, csv.reader "returns a reader object which will iterate over lines in the given csvfile". In other words, if you want to tokenize the text in your csv file, you will have to go through the lines and the fields in those lines:
for line in reader:
for field in line:
tokens = word_tokenize(field)
Also, when you import word_tokenize at the beginning of your script, you should call it as word_tokenize, and not as nltk.word_tokenize. This also means you can drop the import nltk statement.
It is giving error - expected string or buffer because you have forgotten to add str as
tokenData = nltk.word_tokenize(str(reader))

reading csv file in python?

i am working on a machine learning project where i am supposed to read a csv file to build a linear regression model and here is i read the csv file
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0)
but when i run i got this error
/usr/bin/python2.7 /home/halawa/PycharmProjects/ML/evergreen.py
Traceback (most recent call last):
File "/home/halawa/PycharmProjects/ML/evergreen.py", line 24, in <module>
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 470, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 256, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 715, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1164, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411)
File "pandas/parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7651)
File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268)
File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8
Process finished with exit code 1
Your issue is that your CSV doesn't have a consistent number of fields on each line. For example, it appears that the first line has 3 fields
x,y,z
While the third line has 8
x,y,z,a,b,c,d,e
You will need to fix your source CSV file to avoid this error.
Alternatively, if you know that you have 8 fields max, and are ok with some lines missing fields you can use names:
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0, names=list('abcdefgh'))
This parameter tells the CSV reader how many fields to expect, and the rest are filled in with a default value.
EDIT:
If your null columns are marked with a ? then you should set the pandas na_values parameter like so:
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0, na_values=['?'])