use random forest to classifier review, but hat key error?

use random forest to classifier review, but hat key error? - python-2.7

I have follow code in python:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( train_data_features, train["sentiment"] )
but have key error for "sentiment", I don't know why,
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
-Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site--packages/pandas/core/frame.py", line 1780, in __getitem__
return self._getitem_column(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 1787, in _getitem_column
return self._get_item_cache(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1068, in _get_item_cache
values = self._data.get(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 2849, in get
loc = self.items.get_loc(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.py", line 1402, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3807)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3687)
File "pandas/hashtable.pyx", line 696, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12310)
File "pandas/hashtable.pyx", line 704, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12261)
KeyError: 'sentiment'

Are you doing the Kaggle competition? https://www.kaggle.com/c/word2vec-nlp-tutorial/data
Are you sure you have downloaded and decompressed the file ok? The first part of the file reads:
id sentiment review
"5814_8" 1 "With all this stuff go
This works for me:
>>> train = pd.read_csv("labeledTrainData.tsv", delimiter="\t")
>>> train.columns
Index([u'id', u'sentiment', u'review'], dtype='object')
>>> train.head(3)
id sentiment review
0 5814_8 1 With all this stuff going down at the moment w...
1 2381_9 1 \The Classic War of the Worlds\" by Timothy Hi...
2 7759_3 0 The film starts with a manager (Nicholas Bell)...
You should check the columns are setup correctly in the train variable. You should have a sentiment column. That column seems to be missing in your dataframe.

Related

Tensorflow FailedPreconditionError while trying to save frozen model

I'm training a CNN with word embeddings and for some reason I'm getting FailedPreconditionError exception whenever I try to save a frozen version of the model for later use.
This is despite the fact that I call sess.run(tf.global_variables_initializer()) just before training and I have no problem training and checkpointing the model.
The problem occurs when I try to load a model from a checkpoint and save a frozen model. The function I'm using is as follows:
def freeze_model(checkpoint_path, model_save_path, output_node_names):
checkpoint = tf.train.get_checkpoint_state(checkpoint_path)
input_checkpoint = checkpoint.model_checkpoint_path
saver = tf.train.import_meta_graph(input_checkpoint + '.meta', clear_devices=True)
graph = tf.get_default_graph()
input_graph_def = graph.as_graph_def()
with tf.Session() as sess:
saver.restore(sess, input_checkpoint)
output_graph_def = graph_util.convert_variables_to_constants(
sess,
input_graph_def,
output_node_names
)
with tf.gfile.GFile(model_save_path, "wb") as f:
f.write(output_graph_def.SerializeToString())
The error I get is:
Traceback (most recent call last):
File "myproject/train.py", line 522, in <module>
tf.app.run()
File "/home/foo/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "myproject/train.py", line 518, in main
trainer.save_model(preprocessor)
File "myproject/train.py", line 312, in save_model
ut.freeze_model(self.checkpoint_dir, model_save_path, C.OUTPUT_NODE_NAMES)
File "/home/foo/anaconda2/lib/python2.7/site-packages/myproject/utils.py", line 224, in freeze_model
output_node_names
File "/home/foo/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/graph_util_impl.py", line 218, in convert_variables_to_constants
returned_variables = sess.run(variable_names)
File "/home/foo/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/foo/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/foo/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/foo/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value embeddings/W
[[Node: embeddings/W/_20 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_30_embeddings/W", _device="/job:localhost/replica:0/task:0/gpu:0"](embeddings/W)]]
[[Node: conv_maxpool_4/W/_17 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_26_conv_maxpool_4/W", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Turns out I was constructing a Saver object before I made a Session so nothing from the session was being saved.

Preparing data for TfidfVectorizer use (scikitlearn)

I am trying to use TfIdfVectorizer of sklearn. I am having trouble because my input is probably not matching TfIdfVectorizer needs. I have a bunch of JSONs I loaded and appended into a list, and I now want that to be the corpus for TfIdfVectorizer use.
The code:
import json
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
train=pandas.read_csv("train.tsv", sep='\t')
documents=[]
for i,row in train.iterrows():
data = json.loads(row['boilerplate'].lower())
documents.append(data['body'])
vectorizer=TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(documents)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
I am getting the following error:
Traceback (most recent call last):
File "<ipython-input-56-94a6b95b0745>", line 1, in <module>
runfile('C:/Users/Guinea Pig/Downloads/try.py', wdir='C:/Users/Guinea Pig/Downloads')
File "D:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 585, in runfile
execfile(filename, namespace)
File "C:/Users/Guinea Pig/Downloads/try.py", line 19, in <module>
X = vectorizer.fit_transform(documents)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 1219, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'NoneType' object has no attribute 'lower'
I am getting that the documents array consists of Unicode objects, and not string objects, but I can't seem to solve this issue. ant ideas?

Eventually I used:
str_docs=[]
for item in documents:
str_docs.append(documents[i].encode('utf-8'))
As an addition

Reading in OSM buildings geojson data into Python via geopandas

I'm having problems reading an OpenStreetMap buildings (IMPOSM GEOJSON) file into a geopandas data frame object (Python 2.7). This is on MAC OS X 10.11.3. Here are the messages I'm getting:
>>> import geopandas as gpd
>>> df=gpd.read_file('san-francisco-bay_california_buildings.geojson')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/io/file.py", line 28, in read_file
gdf = GeoDataFrame.from_features(f, crs=crs)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/geodataframe.py", line 193, in from_features
d = {'geometry': shape(f['geometry']) if f['geometry'] else None}
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/geo.py", line 34, in shape
return Polygon(ob["coordinates"][0], ob["coordinates"][1:])
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 229, in __init__
self._geom, self._ndim = geos_polygon_from_py(shell, holes)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 508, in geos_polygon_from_py
geos_shell, ndim = geos_linearring_from_py(shell)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 450, in geos_linearring_from_py
n = len(ob[0])
IndexError: list index out of range
The odd thing is that I can load OSM roads data IMPOSM GEOJSON files with geopandas. Am I missing something obvious here? Thanks very much.
EDIT - link to the data below:
OSM data from mapzen

pandas get_group memory error

I am using pandas v0.14.1 with python 2.7
I have a groupby object and I am trying to pull out a group identified by particular key. The key is in fact in the group:
>>> key in key_groups.groups.keys()
True
but when I try to make the get_group call it fails with a memory error:
>>>> key_groups.get_group(key)
*** MemoryError:
The full stacktrace is:
Traceback (most recent call last):
File "main.py", line 141, in <module>
main(num_days=arguments.days, num_variants=arguments.variants)
File "main.py", line 76, in main
problem, solution = Solver.Solve(request, num_variants)
File "/srv/compunctuator/src/Solver.py", line 49, in Solve
solution = attempt_minimization(t)
File "/srv/compunctuator/src/Solver.py", line 41, in attempt_minimization
t.scruple()
File "/srv/compunctuator/src/Compunctuator.py", line 136, in scruple
self.__iterate__()
File "/srv/compunctuator/src/Compunctuator.py", line 95, in __iterate__
self.__maximize_impressions__()
File "/srv/compunctuator/src/Compunctuator.py", line 583, in __maximize_impressions__
df = key_groups.get_group(key)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 573, in get_group
inds = self._get_index(name)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 429, in _get_index
sample = next(iter(self.indices))
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 414, in indices
return self.grouper.indices
File "properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:36380)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 1253, in indices
return _get_indices_dict(label_list, keys)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 3474, in _get_indices_dict
np.prod(shape))
File "algos.pyx", line 1997, in pandas.algos.groupsort_indexer (pandas/algos.c:37521) MemoryError
If I actually use the dictionary lookup I can get the indices out:
>>>> key_groups.groups[key]
[0, 2]
It seems like everything should work here.
I realize a similar question was asked here pandas get_group causes memory error
but it was never resolved and I thought I could give more details if necessary.

reading csv file in python?

i am working on a machine learning project where i am supposed to read a csv file to build a linear regression model and here is i read the csv file
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0)
but when i run i got this error
/usr/bin/python2.7 /home/halawa/PycharmProjects/ML/evergreen.py
Traceback (most recent call last):
File "/home/halawa/PycharmProjects/ML/evergreen.py", line 24, in <module>
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 470, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 256, in _read
return parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 715, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1164, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 758, in pandas.parser.TextReader.read (pandas/parser.c:7411)
File "pandas/parser.pyx", line 780, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7651)
File "pandas/parser.pyx", line 833, in pandas.parser.TextReader._read_rows (pandas/parser.c:8268)
File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 8
Process finished with exit code 1

Your issue is that your CSV doesn't have a consistent number of fields on each line. For example, it appears that the first line has 3 fields
x,y,z
While the third line has 8
x,y,z,a,b,c,d,e
You will need to fix your source CSV file to avoid this error.
Alternatively, if you know that you have 8 fields max, and are ok with some lines missing fields you can use names:
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0, names=list('abcdefgh'))
This parameter tells the CSV reader how many fields to expect, and the rest are filled in with a default value.
EDIT:
If your null columns are marked with a ? then you should set the pandas na_values parameter like so:
data_test = pd.read_csv("/media/halawa/93B77F681EC1B4D2/GUC/Semster 8/CSEN 1022 Machine Learning/2/test.csv",delimiter=",", header=0, na_values=['?'])

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

use random forest to classifier review, but hat key error? - python-2.7

Related

Tensorflow FailedPreconditionError while trying to save frozen model

Preparing data for TfidfVectorizer use (scikitlearn)

Reading in OSM buildings geojson data into Python via geopandas

pandas get_group memory error

reading csv file in python?

Categories

Resources