Preparing data for TfidfVectorizer use (scikitlearn)

Preparing data for TfidfVectorizer use (scikitlearn) - python-2.7

I am trying to use TfIdfVectorizer of sklearn. I am having trouble because my input is probably not matching TfIdfVectorizer needs. I have a bunch of JSONs I loaded and appended into a list, and I now want that to be the corpus for TfIdfVectorizer use.
The code:
import json
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
train=pandas.read_csv("train.tsv", sep='\t')
documents=[]
for i,row in train.iterrows():
data = json.loads(row['boilerplate'].lower())
documents.append(data['body'])
vectorizer=TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(documents)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
I am getting the following error:
Traceback (most recent call last):
File "<ipython-input-56-94a6b95b0745>", line 1, in <module>
runfile('C:/Users/Guinea Pig/Downloads/try.py', wdir='C:/Users/Guinea Pig/Downloads')
File "D:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 585, in runfile
execfile(filename, namespace)
File "C:/Users/Guinea Pig/Downloads/try.py", line 19, in <module>
X = vectorizer.fit_transform(documents)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 1219, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "D:\Anaconda\lib\site-packages\sklearn\feature_extraction\text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'NoneType' object has no attribute 'lower'
I am getting that the documents array consists of Unicode objects, and not string objects, but I can't seem to solve this issue. ant ideas?

Eventually I used:
str_docs=[]
for item in documents:
str_docs.append(documents[i].encode('utf-8'))
As an addition

Related

Error when i tried accessing pandas dataframe index

I used two csv files create a dataframe:
import matplotlib.pyplot as plt
import pandas as pd
#import data
df1= pd.read_csv("C:\Users\Meiji\Desktop\CNST 6308-python\Hourly_TTI.csv")
df2= pd.read_csv("C:\Users\Meiji\Desktop\CNST 6308-python\Weather.csv")
#standardize date format
df1['new_date']= pd.to_datetime(df1['Date'])
df2['new_date']= pd.to_datetime(df2['EST'])
#merge TTI and weather dataframe
df=pd.merge(df1,df2,on=['new_date'])
#plot
df[df["Events"]=='Rain-Hail-Thunderstorm'].groupby('Hour').mean()['TTI'].plot()
df[df["Events"]!='Rain-Hail-Thunderstorm'].groupby('Hour').mean()['TTI'].plot()
plt.ylabel('TTI')
plt.legend(['Rain-Hail-Thunderstorm','Ohters'])
plot.show()
Here is the error I'm getting:
Process started (PID=27504) >>>
Traceback (most recent call last):
File "C:\Users\Meiji\Desktop\CNST 6308-python\HW4\new 2.py", line 12, in <module>
df[df["Events"]=='Rain-Hail-Thunderstorm'].groupby('Hour').mean()['TTI'].plot()
File "c:\python27\lib\site-packages\pandas\core\frame.py", line 2927, in __getitem__
indexer = self.columns.get_loc(key)
File "c:\python27\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Events'
<<< Process finished (PID=27504). (Exit code 1)
what am i missing? My friend has a same code, but he run perfectly.

TensorVariable to Array

I'm trying to evaluate a theano TensorValue expression:
import pymc3
import numpy as np
with pymc3.Model():
growth = pymc3.Normal('growth_%s' % 'some_name', 0, 10)
x = np.arange(4)
(x * growth).eval()
but get the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/graph.py", line 522, in eval
self._fn_cache[inputs] = theano.function(inputs, self)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function.py", line 317, in function
output_keys=output_keys)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/pfunc.py", line 486, in pfunc
output_keys=output_keys)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 1839, in orig_function
name=name)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 1487, in __init__
accept_inplace)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 181, in std_fgraph
update_mapping=update_mapping)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 175, in __init__
self.__import_r__(output, reason="init")
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 346, in __import_r__
self.__import__(variable.owner, reason=reason)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 391, in __import__
raise MissingInputError(error_msg, variable=r)
theano.gof.fg.MissingInputError: Input 0 of the graph (indices start from 0), used to compute InplaceDimShuffle{x}(growth_some_name), was not provided and not given a value. Use the Theano flag exception_verbosity='high', for more information on this error.
I tried
Can someone please help me see what the theano variables actually output?
Thank you!
I'm using Python 2.7 and theano 1.0.3

While PyMC3 distributions are TensorVariable objects, they don't technical have any values to be evaluated outside of sampling. If you want values, you have to at least run sampling on the model:
with pymc3.Model():
growth = pymc3.Normal('growth', 0, 10)
trace = pymc3.sample(10)
x = np.arange(4)
x[:, np.newaxis]*trace['growth']
If you want to view node values during sampling, you'd need to use theano.tensor.printing.Print objects. For more info, see the PyMC3 debugging tips.

Reading in OSM buildings geojson data into Python via geopandas

I'm having problems reading an OpenStreetMap buildings (IMPOSM GEOJSON) file into a geopandas data frame object (Python 2.7). This is on MAC OS X 10.11.3. Here are the messages I'm getting:
>>> import geopandas as gpd
>>> df=gpd.read_file('san-francisco-bay_california_buildings.geojson')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/io/file.py", line 28, in read_file
gdf = GeoDataFrame.from_features(f, crs=crs)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/geodataframe.py", line 193, in from_features
d = {'geometry': shape(f['geometry']) if f['geometry'] else None}
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/geo.py", line 34, in shape
return Polygon(ob["coordinates"][0], ob["coordinates"][1:])
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 229, in __init__
self._geom, self._ndim = geos_polygon_from_py(shell, holes)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 508, in geos_polygon_from_py
geos_shell, ndim = geos_linearring_from_py(shell)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 450, in geos_linearring_from_py
n = len(ob[0])
IndexError: list index out of range
The odd thing is that I can load OSM roads data IMPOSM GEOJSON files with geopandas. Am I missing something obvious here? Thanks very much.
EDIT - link to the data below:
OSM data from mapzen

pandas get_group memory error

I am using pandas v0.14.1 with python 2.7
I have a groupby object and I am trying to pull out a group identified by particular key. The key is in fact in the group:
>>> key in key_groups.groups.keys()
True
but when I try to make the get_group call it fails with a memory error:
>>>> key_groups.get_group(key)
*** MemoryError:
The full stacktrace is:
Traceback (most recent call last):
File "main.py", line 141, in <module>
main(num_days=arguments.days, num_variants=arguments.variants)
File "main.py", line 76, in main
problem, solution = Solver.Solve(request, num_variants)
File "/srv/compunctuator/src/Solver.py", line 49, in Solve
solution = attempt_minimization(t)
File "/srv/compunctuator/src/Solver.py", line 41, in attempt_minimization
t.scruple()
File "/srv/compunctuator/src/Compunctuator.py", line 136, in scruple
self.__iterate__()
File "/srv/compunctuator/src/Compunctuator.py", line 95, in __iterate__
self.__maximize_impressions__()
File "/srv/compunctuator/src/Compunctuator.py", line 583, in __maximize_impressions__
df = key_groups.get_group(key)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 573, in get_group
inds = self._get_index(name)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 429, in _get_index
sample = next(iter(self.indices))
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 414, in indices
return self.grouper.indices
File "properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:36380)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 1253, in indices
return _get_indices_dict(label_list, keys)
File "/srv/compunctuator/.virtualenvs/compunctuator/local/lib/python2.7/site-packages/pandas/core/groupby.py", line 3474, in _get_indices_dict
np.prod(shape))
File "algos.pyx", line 1997, in pandas.algos.groupsort_indexer (pandas/algos.c:37521) MemoryError
If I actually use the dictionary lookup I can get the indices out:
>>>> key_groups.groups[key]
[0, 2]
It seems like everything should work here.
I realize a similar question was asked here pandas get_group causes memory error
but it was never resolved and I thought I could give more details if necessary.

'Polygone' object does not support indexing

I am trying to render an SVG map using Kartograph.py. It throws me the TypeError. Here is the python code:
import kartograph
from kartograph import Kartograph
import sys
from kartograph.options import read_map_config
css = open("stylesheet.css").read()
K = Kartograph()
cfg = read_map_config(open("config.json"))
K.generate(cfg, outfile='dd.svg', format='svg', stylesheet=css)
Here is the error it throws
Traceback (most recent call last):
File "<pyshell#33>", line 1, in <module>
K.generate(cfg, outfile='dd.svg', format='svg', stylesheet=css)
File "C:\Python27\lib\site-packages\kartograph.py-0.6.8-py2.7.egg\kartograph\kartograph.py", line 46, in generate
_map = Map(opts, self.layerCache, format=format)
File "C:\Python27\lib\site-packages\kartograph.py-0.6.8-py2.7.egg\kartograph\map.py", line 61, in __init__
layer.get_features()
File "C:\Python27\lib\site-packages\kartograph.py-0.6.8-py2.7.egg\kartograph\maplayer.py", line 81, in get_features
charset=layer.options['charset']
File "C:\Python27\lib\site-packages\kartograph.py-0.6.8-py2.7.egg\kartograph\layersource\shplayer.py", line 121, in get_features
geom = shape2geometry(shp, ignore_holes=ignore_holes, min_area=min_area, bbox=bbox, proj=self.proj)
File "C:\Python27\lib\site-packages\kartograph.py-0.6.8-py2.7.egg\kartograph\layersource\shplayer.py", line 153, in shape2geometry
geom = shape2polygon(shp, ignore_holes=ignore_holes, min_area=min_area, proj=proj)
File "C:\Python27\lib\site-packages\kartograph.py-0.6.8-py2.7.egg\kartograph\layersource\shplayer.py", line 217, in shape2polygon
poly = MultiPolygon(polygons)
File "C:\Python27\lib\site-packages\shapely\geometry\multipolygon.py", line 74, in __init__
self._geom, self._ndim = geos_multipolygon_from_polygons(polygons)
File "C:\Python27\lib\site-packages\shapely\geometry\multipolygon.py", line 30, in geos_multipolygon_from_polygons
N = len(ob[0][0][0])
TypeError: 'Polygon' object does not support indexing

I had a look at shapely and it seems like you are using an outdated version.
Update your current install:
pip install -U shapely

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Preparing data for TfidfVectorizer use (scikitlearn) - python-2.7

Eventually I used: str_docs=[] for item in documents: str_docs.append(documents[i].encode('utf-8')) As an addition

Related

Error when i tried accessing pandas dataframe index

TensorVariable to Array

Reading in OSM buildings geojson data into Python via geopandas

pandas get_group memory error

'Polygone' object does not support indexing

Categories

Resources