reading csv file in python? - python-2.7

i am working on a machine learning project where i am supposed to read a csv file to build a linear regression model and here is i read the csv file
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0)
but when i run i got this error
/usr/bin/python2.7 /home/halawa/PycharmProjects/ML/evergreen.py
Traceback (most recent call last):
File "/home/halawa/PycharmProjects/ML/evergreen.py", line 24, in <module>
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 470, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 256, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 715, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1164, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411)
File "pandas/parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7651)
File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268)
File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8
Process finished with exit code 1

Your issue is that your CSV doesn't have a consistent number of fields on each line. For example, it appears that the first line has 3 fields
x,y,z
While the third line has 8
x,y,z,a,b,c,d,e
You will need to fix your source CSV file to avoid this error.
Alternatively, if you know that you have 8 fields max, and are ok with some lines missing fields you can use names:
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0, names=list('abcdefgh'))
This parameter tells the CSV reader how many fields to expect, and the rest are filled in with a default value.
EDIT:
If your null columns are marked with a ? then you should set the pandas na_values parameter like so:
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0, na_values=['?'])

Related

UnicodeEncodeError while transferring ".eml" file to Google Cloud Platform (gsutil v4.6.1 on Linux)

While transferring file(s) from a Linux system to Google Cloud Platform using the gsutil cp command, it fails at some old ".eml" files when trying to process its content (not just file name!) which contains non-English characters not encoded in Unicode.
The command attempted was:
gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/
The error message was:
UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)
gsutil rsync gives a very similar error. Position 22881 (0x5961) turns out to be towards the end of the multi-part e-mail source file. Following shows the hex-dumped file content:
00005960: 20a8 43a4 d1b3 a320 5961 686f 6f21 a95f .C.... Yahoo!._
00005970: bcaf 203e 2020 7777 772e 7961 686f 6f2e .. > www.yahoo.
00005980: 636f 6d2e 7477 0d0a com.tw..
We see byte "0xa8" at position 0x5961, which was the source of the problem as indicated by the error message. For some reason gsutil was trying to encode the text. When opening the file in a terminal that supports Chinese characters, we see this:
< 每天都 Yahoo!奇摩 > www.yahoo.com.tw
The first Chinese character "每" is 0xa843 when encoded in Big-5. A simple work-around was to rename the file extension to something other than ".eml" such as ".eml.bak" so that gsutil does not process the file content. Unfortunately it is difficult to know the existence of files with such non-English character in advance while doing bulk transfer, and the whole process can be stopped multiple times.
Following is the full error message:
darsenlu#devmodel:~/Home$ gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/
Copying file:///home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml [Content-Type=message/rfc822]...
Traceback (most recent call last):
File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil", line 21, in <module>
gsutil.RunMain()
File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil.py", line 122, in RunMain
sys.exit(gslib.__main__.main())
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 444, in main
user_project=user_project)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 780, in _RunNamedCommandAndHandleExceptions
_HandleUnknownFailure(e)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 639, in _RunNamedCommandAndHandleExceptions
user_project=user_project)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 411, in RunNamedCommand
return_code = command_inst.RunCommand()
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 1124, in RunCommand
seek_ahead_iterator=seek_ahead_iterator)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 1525, in Apply
arg_checker, should_return_results, fail_on_error)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 1596, in _SequentialApply
worker_thread.PerformTask(task, self)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 2316, in PerformTask
results = task.func(cls, task.args, thread_state=self.thread_gsutil_api)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 709, in _CopyFuncWrapper
preserve_posix=cls.preserve_posix_attrs)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 924, in CopyFunc
preserve_posix=preserve_posix)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 3957, in PerformCopy
gzip_encoded=gzip_encoded)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2250, in _UploadFileToObject
parallel_composite_upload, logger)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2066, in _DelegateUploadFileToObject
elapsed_time, uploaded_object = upload_delegate()
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2227, in CallNonResumableUpload
gzip_encoded=gzip_encoded_file)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 1762, in _UploadFileToObjectNonResumable
gzip_encoded=gzip_encoded)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/cloud_api_delegator.py", line 388, in UploadObject
gzip_encoded=gzip_encoded)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/gcs_json_api.py", line 1712, in UploadObject
gzip_encoded=gzip_encoded)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/gcs_json_api.py", line 1534, in _UploadObject
global_params=global_params)
File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/third_party/storage_apitools/storage_v1_client.py", line 1182, in Insert
upload=upload, upload_config=upload_config)
File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/base_api.py", line 703, in _RunMethod
download)
File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/base_api.py", line 679, in PrepareHttpRequest
upload.ConfigureRequest(upload_config, http_request, url_builder)
File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/transfer.py", line 763, in ConfigureRequest
self.__ConfigureMultipartRequest(http_request)
File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/transfer.py", line 823, in __ConfigureMultipartRequest
g.flatten(msg_root, unixfrom=False)
File "/usr/lib/python3.6/email/generator.py", line 116, in flatten
self._write(msg)
File "/usr/lib/python3.6/email/generator.py", line 181, in _write
self._dispatch(msg)
File "/usr/lib/python3.6/email/generator.py", line 214, in _dispatch
meth(msg)
File "/usr/lib/python3.6/email/generator.py", line 272, in _handle_multipart
g.flatten(part, unixfrom=False, linesep=self._NL)
File "/usr/lib/python3.6/email/generator.py", line 116, in flatten
self._write(msg)
File "/usr/lib/python3.6/email/generator.py", line 181, in _write
self._dispatch(msg)
File "/usr/lib/python3.6/email/generator.py", line 214, in _dispatch
meth(msg)
File "/usr/lib/python3.6/email/generator.py", line 361, in _handle_message
payload = self._encode(payload)
File "/usr/lib/python3.6/email/generator.py", line 412, in _encode
return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)
The Linux system is Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-76-generic x86_64).
I took your string with Chinese characters and was able to reproduce your error. I fixed it after updating to gsutil 4.62. Here's the merged PR and issue tracker as reference.
Update Cloud SDK by running:
gcloud components update

How to fix 'ORA-19011' error in Python cx_Oracle

HI~ I want to query xml data from Oracle db with cx_Oracle, but it doesn't work with Ora-19011 error message. I think size of query data is larger than string buffer, but i don't know how to solve this problem
Oracle DB is an external DB and it's not my own DB, So i can't access directly. Therefore, I want to fix my problem on my code and print query data on python terminal.
(my software version)
windows 10 64bit
python 2.7 64bit
oracle-instant client 19.3 64bit
cx_oracle 7.2.2
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import cx_Oracle
import sys
import csv
import codecs
printHeader = True
conn = cx_Oracle.connect('id/passwd#ip:port/orcl')
print(conn.version)
curs = conn.cursor()
curs.execute('SELECT * FROM tablename')
for record in curs:
print(record)
Error occured at line 18(for record in curs) and here are error messages.
11.2.0.4.0
We've got an error while stopping in unhandled exception: <class 'cx_Oracle.DatabaseError'>.
Traceback (most recent call last):
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\_vendored\pydevd\pydevd.py", line 1740, in do_stop_on_unhandled_exception
self.do_wait_suspend(thread, frame, 'exception', arg, is_unhandled_exception=True)
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\_vendored\pydevd\pydevd.py", line 1615, in do_wait_suspend
with self._threads_suspended_single_notification.notify_thread_suspended(thread_id, stop_reason):
File "C:\Python27\lib\contextlib.py", line 17, in __enter__
return self.gen.next()
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\_vendored\pydevd\pydevd.py", line 360, in notify_thread_suspended
with AbstractSingleNotificationBehavior.notify_thread_suspended(self, thread_id, stop_reason):
File "C:\Python27\lib\contextlib.py", line 17, in __enter__
return self.gen.next()
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\_vendored\pydevd\pydevd.py", line 308, in notify_thread_suspended
self.send_suspend_notification(thread_id, stop_reason)
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\_vendored\pydevd\pydevd.py", line 354, in send_suspend_notification
py_db.writer.add_command(py_db.cmd_factory.make_thread_suspend_single_notification(py_db, thread_id, stop_reason))
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\_vendored\pydevd\_pydevd_bundle\pydevd_net_command_factory_json.py", line 309, in make_thread_suspend_single_notification
return NetCommand(CMD_THREAD_SUSPEND_SINGLE_NOTIFICATION, 0, event, is_json=True)
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\_vendored\pydevd\_pydevd_bundle\pydevd_net_command.py", line 57, in __init__
text = json.dumps(as_dict)
File "C:\Python27\lib\json\__init__.py", line 244, in dumps
return _default_encoder.encode(obj)
File "C:\Python27\lib\json\encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "C:\Python27\lib\json\encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb9 in position 11: invalid start byte
Traceback (most recent call last):
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\ptvsd_launcher.py", line 43, in <module>
main(ptvsdArgs)
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\__main__.py", line 432, in main
run()
File "c:\Users\goo41\.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd\__main__.py", line 316, in run_file
runpy.run_path(target, run_name='__main__')
File "C:\Python27\lib\runpy.py", line 252, in run_path
return _run_module_code(code, init_globals, run_name, path_name)
File "C:\Python27\lib\runpy.py", line 82, in _run_module_code
mod_name, mod_fname, mod_loader, pkg_name)
File "C:\Python27\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "c:\PythonWorkspace\oraclePrc\test1.py", line 18, in <module>
for record in curs:
cx_Oracle.DatabaseError: ORA-19011: Character string buffer too small
When you connect to the database, try using this code instead:
conn = cx_Oracle.connect('id/passwd#ip:port/orcl', encoding="UTF-8", nencoding="UTF-8")
This will ensure that you are using a universal encoding -- which may eliminate the first error, and possibly the second as well. If not, adjust the code sample and error messages noted above.

Unable to read csv in pandas with sys.argv[]

I read a csv file in pandas by specifying the name directly. It works without trouble. But when I try to do the same with sys.argv[] it throws an error. This is what I tried.
import sys
import pandas as pd
filename = sys.argv[-1]
data = pd.read_csv(filename, delimiter=',')
print data
print data[data['logFC'] >= 2]
The error it throws back is:
Traceback (most recent call last): File "/Users/filter_cutoff_genexdata.py", line 7, in <module>
data = pd.read_csv(filename, delimiter=',') File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 285, in
_read
return parser.read() File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 747, in read
ret = self._engine.read(nrows) File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1197, in read
data = self._reader.read(nrows) File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read (pandas/parser.c:7988) File "pandas/parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244) File "pandas/parser.pyx", line 842, in pandas.parser.TextReader._read_rows (pandas/parser.c:8970) File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838) File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649) pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 3
Can anyone please tell why it doesn't work ?

pandas reading .csv files

I have a small script to read and print a .csv file using pandas generated from MS Excel.
import pandas as pd
data = pd.read_csv('./2010-11.csv')
print(data)
now this script runs in Python 2.7.8 but in Python 3.4.1 gives the following
error. Any ideas why this might be so? Thanks in advance for any help with this.
Traceback (most recent call last):
File "proc_csv_0-0.py", line 3, in <module>
data = pd.read_csv('./2010-11.csv')
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 260, in _read
return parser.read()
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 721, in read
ret = self._engine.read(nrows)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 1170, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7566)
File "pandas/parser.pyx", line 791, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7806)
File "pandas/parser.pyx", line 866, in pandas.parser.TextReader._read_rows (pandas/parser.c:8639)
File "pandas/parser.pyx", line 973, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9950)
File "pandas/parser.pyx", line 1033, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10737)
File "pandas/parser.pyx", line 1130, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12141)
File "pandas/parser.pyx", line 1150, in pandas.parser.TextReader._string_convert (pandas/parser.c:12355)
File "pandas/parser.pyx", line 1382, in pandas.parser._string_box_utf8 (pandas/parser.c:17679)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 4: unexpected end of data
In Python3, when pd.read_csv is passed a file path (as opposed to a file buffer) it decodes the contents with the utf-8 codec by default.1 It appears your CSV file is using a different encoding. Since it was generated by MS Excel, it might be cp-1252:
In [25]: print('\xc9'.decode('cp1252'))
É
In [27]: import unicodedata as UDAT
In [28]: UDAT.name('\xc9'.decode('cp1252'))
Out[28]: 'LATIN CAPITAL LETTER E WITH ACUTE'
The error message
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9
says that '\xc9'.decode('utf-8') raises a UnicodeDecodeError.
The above shows byte 0xc9 can be decoded with cp1252. It remains to be seen if the rest of the file can also be decoded with cp1252, and if it produces the desired result.
Unfortunately, given only a file, there is no surefire way to tell what
encoding (if any) was used. It depends entirely on the program used to generate
the file.
If cp1252 is the right encoding, then to load the file into a DataFrame use
data = pd.read_csv('./2010-11.csv', encoding='cp1252')
1 When pd.read_csv is passed a buffer, the buffer could have been opened with encoding already set:
# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
df = pd.read_csv(f)
print(df)
in which case pd.read_csv will not attempt to decode since the buffer f is already supplying decoded strings.

use random forest to classifier review, but hat key error?

I have follow code in python:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( train_data_features, train["sentiment"] )
but have key error for "sentiment", I don't know why,
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
-Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site--packages/pandas/core/frame.py", line 1780, in __getitem__
return self._getitem_column(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 1787, in _getitem_column
return self._get_item_cache(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1068, in _get_item_cache
values = self._data.get(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 2849, in get
loc = self.items.get_loc(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.py", line 1402, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3807)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3687)
File "pandas/hashtable.pyx", line 696, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12310)
File "pandas/hashtable.pyx", line 704, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12261)
KeyError: 'sentiment'
Are you doing the Kaggle competition? https://www.kaggle.com/c/word2vec-nlp-tutorial/data
Are you sure you have downloaded and decompressed the file ok? The first part of the file reads:
id sentiment review
"5814_8" 1 "With all this stuff go
This works for me:
>>> train = pd.read_csv("labeledTrainData.tsv", delimiter="\t")
>>> train.columns
Index([u'id', u'sentiment', u'review'], dtype='object')
>>> train.head(3)
id sentiment review
0 5814_8 1 With all this stuff going down at the moment w...
1 2381_9 1 \The Classic War of the Worlds\" by Timothy Hi...
2 7759_3 0 The film starts with a manager (Nicholas Bell)...
You should check the columns are setup correctly in the train variable. You should have a sentiment column. That column seems to be missing in your dataframe.