NLTK python tokenizing a CSV file - python-2.7

I have began to experiment with Python and NLTK.
I am experiencing a lengthy error message which I cannot find a solution to and would appreciate any insights you may have.
import nltk,csv,numpy
from nltk import sent_tokenize, word_tokenize, pos_tag
reader = csv.reader(open('Medium_Edited.csv', 'rU'), delimiter= ",",quotechar='|')
tokenData = nltk.word_tokenize(reader)
I'm running Python 2.7 and the latest nltk package on OSX Yosemite.
These are also two lines of code I attempted with no difference in results:
with open("Medium_Edited.csv", "rU") as csvfile:
tokenData = nltk.word_tokenize(reader)
These are the error messages I see:
Traceback (most recent call last):
File "nltk_text.py", line 11, in <module>
tokenData = nltk.word_tokenize(reader)
File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 101, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "/Library/Python/2.7/site-packages/nltk/tokenize/__init__.py", line 86, in sent_tokenize
return tokenizer.tokenize(text)
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/Library/Python/2.7/site-packages/nltk/tokenize/punkt.py", line 1278, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer
Thanks in advance

As you can read in the Python csv documentation, csv.reader "returns a reader object which will iterate over lines in the given csvfile". In other words, if you want to tokenize the text in your csv file, you will have to go through the lines and the fields in those lines:
for line in reader:
for field in line:
tokens = word_tokenize(field)
Also, when you import word_tokenize at the beginning of your script, you should call it as word_tokenize, and not as nltk.word_tokenize. This also means you can drop the import nltk statement.

It is giving error - expected string or buffer because you have forgotten to add str as
tokenData = nltk.word_tokenize(str(reader))

Related

Read attribute data from an xml in python

I am trying to read data from an xml file from an url using request module in python
import requests
from requests.auth import HTTPBasicAuth
import xml.etree.ElementTree as et
url ="https://sample.com/simple.xml"
response = requests.get(url,auth=HTTPBasicAuth(username,password))
xml_data = et.fromstring(response.text)
The error I am getting is:
Traceback (most recent call last):
File "C:\Python27\myfolder\Artifactory.py", line 156, in <module>
xml_data = et.fromstring(xml_response.text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1311, in XML
parser.feed(text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1657, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 8419: ordinal not in range(128)
So i changed the code to xml_data = et.parse(response.text)
then the error is :
Traceback (most recent call last):
File "C:\Python27\myfolder\Artifactory.py", line 156, in <module>
xml_data = et.parse(xml_response.text)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 647, in parse
source = open(source, "rb")
IOError: [Errno 2] No such file or directory: u'<?xml version="1.0" encoding="utf-8"?>
After this error the xml data is getting printed
please help me in this issue
et.parse requires file path (not contents).
You need to encode response to utf-8
xml_data = et.fromstring(response.text.encode('utf-8'))
The first attempt seems like an encoding issue with python.
try adding this to your code between your last import and your url variable.
import sys
reload(sys)
sys.setdefaultencoding("utf8")

Preparing data for TfidfVectorizer use (scikitlearn)

I am trying to use TfIdfVectorizer of sklearn. I am having trouble because my input is probably not matching TfIdfVectorizer needs. I have a bunch of JSONs I loaded and appended into a list, and I now want that to be the corpus for TfIdfVectorizer use.
The code:
import json
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
train=pandas.read_csv("train.tsv", sep='\t')
documents=[]
for i,row in train.iterrows():
data = json.loads(row['boilerplate'].lower())
documents.append(data['body'])
vectorizer=TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(documents)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
I am getting the following error:
Traceback (most recent call last):
File "<ipython-input-56-94a6b95b0745>", line 1, in <module>
runfile('C:/Users/Guinea Pig/Downloads/try.py', wdir='C:/Users/Guinea Pig/Downloads')
File "D:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 585, in runfile
execfile(filename, namespace)
File "C:/Users/Guinea Pig/Downloads/try.py", line 19, in <module>
X = vectorizer.fit_transform(documents)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 1219, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'NoneType' object has no attribute 'lower'
I am getting that the documents array consists of Unicode objects, and not string objects, but I can't seem to solve this issue. ant ideas?
Eventually I used:
str_docs=[]
for item in documents:
str_docs.append(documents[i].encode('utf-8'))
As an addition

Python, Networked programs from a book by Dr Charles Severance

I am a beginner in programming, started from Python. I learn by Dr Charles Severance materials. So in his book there is an example:
import urllib
fhand = urllib.urlopen('http://www.py4inf.com/code/rom...
for line in fhand:
print line.strip()
When I copy paste it to Python 2 version (I use PyCharm 5.0.4) there appears:
Traceback (most recent call last):
File "D:/Python4yk/temprehg111.py", line 2, in <module>
fhand = urllib.urlopen('http://www.py4inf.com/code/rom...
File "C:\Python27\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "C:\Python27\lib\urllib.py", line 208, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 292, in open_http
import httplib
File "C:\Python27\lib\httplib.py", line 79, in <module>
import mimetools
File "C:\Python27\lib\mimetools.py", line 6, in <module>
import tempfile
File "C:\Python27\lib\tempfile.py", line 35, in <module>
from random import Random as _Random
File "random.py", line 3, in <module>
integers
NameError: name 'line' is not defined
When I type another example, gets an error also. What is wrong? I don`t even write a program. I just copy paste an example. Asked Dr Chuck - still no answer.
Try this:
import urllib
fhand = urllib.urlopen('http://www.py4inf.com')
for line in fhand:
print line.strip() # notice the indentation

pandas reading .csv files

I have a small script to read and print a .csv file using pandas generated from MS Excel.
import pandas as pd
data = pd.read_csv('./2010-11.csv')
print(data)
now this script runs in Python 2.7.8 but in Python 3.4.1 gives the following
error. Any ideas why this might be so? Thanks in advance for any help with this.
Traceback (most recent call last):
File "proc_csv_0-0.py", line 3, in <module>
data = pd.read_csv('./2010-11.csv')
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 260, in _read
return parser.read()
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 721, in read
ret = self._engine.read(nrows)
File "/usr/lib64/python3.4/site-packages/pandas/io/parsers.py", line 1170, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7566)
File "pandas/parser.pyx", line 791, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7806)
File "pandas/parser.pyx", line 866, in pandas.parser.TextReader._read_rows (pandas/parser.c:8639)
File "pandas/parser.pyx", line 973, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9950)
File "pandas/parser.pyx", line 1033, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10737)
File "pandas/parser.pyx", line 1130, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12141)
File "pandas/parser.pyx", line 1150, in pandas.parser.TextReader._string_convert (pandas/parser.c:12355)
File "pandas/parser.pyx", line 1382, in pandas.parser._string_box_utf8 (pandas/parser.c:17679)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 4: unexpected end of data
In Python3, when pd.read_csv is passed a file path (as opposed to a file buffer) it decodes the contents with the utf-8 codec by default.1 It appears your CSV file is using a different encoding. Since it was generated by MS Excel, it might be cp-1252:
In [25]: print('\xc9'.decode('cp1252'))
É
In [27]: import unicodedata as UDAT
In [28]: UDAT.name('\xc9'.decode('cp1252'))
Out[28]: 'LATIN CAPITAL LETTER E WITH ACUTE'
The error message
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9
says that '\xc9'.decode('utf-8') raises a UnicodeDecodeError.
The above shows byte 0xc9 can be decoded with cp1252. It remains to be seen if the rest of the file can also be decoded with cp1252, and if it produces the desired result.
Unfortunately, given only a file, there is no surefire way to tell what
encoding (if any) was used. It depends entirely on the program used to generate
the file.
If cp1252 is the right encoding, then to load the file into a DataFrame use
data = pd.read_csv('./2010-11.csv', encoding='cp1252')
1 When pd.read_csv is passed a buffer, the buffer could have been opened with encoding already set:
# Python3
with open('/tmp/test.csv', 'r', encoding='cp1252') as f:
df = pd.read_csv(f)
print(df)
in which case pd.read_csv will not attempt to decode since the buffer f is already supplying decoded strings.

use random forest to classifier review, but hat key error?

I have follow code in python:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( train_data_features, train["sentiment"] )
but have key error for "sentiment", I don't know why,
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
-Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site--packages/pandas/core/frame.py", line 1780, in __getitem__
return self._getitem_column(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 1787, in _getitem_column
return self._get_item_cache(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1068, in _get_item_cache
values = self._data.get(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 2849, in get
loc = self.items.get_loc(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.py", line 1402, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3807)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3687)
File "pandas/hashtable.pyx", line 696, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12310)
File "pandas/hashtable.pyx", line 704, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12261)
KeyError: 'sentiment'
Are you doing the Kaggle competition? https://www.kaggle.com/c/word2vec-nlp-tutorial/data
Are you sure you have downloaded and decompressed the file ok? The first part of the file reads:
id sentiment review
"5814_8" 1 "With all this stuff go
This works for me:
>>> train = pd.read_csv("labeledTrainData.tsv", delimiter="\t")
>>> train.columns
Index([u'id', u'sentiment', u'review'], dtype='object')
>>> train.head(3)
id sentiment review
0 5814_8 1 With all this stuff going down at the moment w...
1 2381_9 1 \The Classic War of the Worlds\" by Timothy Hi...
2 7759_3 0 The film starts with a manager (Nicholas Bell)...
You should check the columns are setup correctly in the train variable. You should have a sentiment column. That column seems to be missing in your dataframe.