Seaborn KDE visualisation, value error on dataset - python-2.7

I am attempting to visualise a KDE plot in Seaborn, but am encountering an error on entering data.
The data is a set of scores ranging from 1-13 and is in the form of a numpy array.
Below is the section of code I'm using.
query_CNM = 'SELECT SCORE from CNMATCH LIMIT 2000'
df = pd.read_sql(query_CNM, conn, index_col = None)
yy = np.array(df)
plot = sns.kdeplot(yy)
Below is the full error that I'm receiving.
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1758, in <module>
main()
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1752, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1147, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/uni/Desktop/Proof_Of_Concept/PYQTKDE.py", line 66, in <module>
plot = sns.kdeplot(yy)
File "/Users/uni/.conda/envs/fing.py/lib/python2.7/site-packages/seaborn/distributions.py", line 664, in kdeplot
x, y = data.T
ValueError: need more than 1 value to unpack
I can't seem to find exactly how the data needs to be formatted for sea-born in order to fit a KDE, if any insights can be provided on this it would be greatly appreciated.

Related

TensorVariable to Array

I'm trying to evaluate a theano TensorValue expression:
import pymc3
import numpy as np
with pymc3.Model():
growth = pymc3.Normal('growth_%s' % 'some_name', 0, 10)
x = np.arange(4)
(x * growth).eval()
but get the error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/graph.py", line 522, in eval
self._fn_cache[inputs] = theano.function(inputs, self)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function.py", line 317, in function
output_keys=output_keys)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/pfunc.py", line 486, in pfunc
output_keys=output_keys)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 1839, in orig_function
name=name)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 1487, in __init__
accept_inplace)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/compile/function_module.py", line 181, in std_fgraph
update_mapping=update_mapping)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 175, in __init__
self.__import_r__(output, reason="init")
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 346, in __import_r__
self.__import__(variable.owner, reason=reason)
File "/home/danna/.virtualenvs/lib/python2.7/site-packages/theano/gof/fg.py", line 391, in __import__
raise MissingInputError(error_msg, variable=r)
theano.gof.fg.MissingInputError: Input 0 of the graph (indices start from 0), used to compute InplaceDimShuffle{x}(growth_some_name), was not provided and not given a value. Use the Theano flag exception_verbosity='high', for more information on this error.
I tried
Can someone please help me see what the theano variables actually output?
Thank you!
I'm using Python 2.7 and theano 1.0.3
While PyMC3 distributions are TensorVariable objects, they don't technical have any values to be evaluated outside of sampling. If you want values, you have to at least run sampling on the model:
with pymc3.Model():
growth = pymc3.Normal('growth', 0, 10)
trace = pymc3.sample(10)
x = np.arange(4)
x[:, np.newaxis]*trace['growth']
If you want to view node values during sampling, you'd need to use theano.tensor.printing.Print objects. For more info, see the PyMC3 debugging tips.

pysparkDistributedKmodes lib error

I'm trying to run pyspark-distributed-kmodes example:
import numpy as np
data = np.random.choice(["a", "b", "c"], (50000, 10))
data2 = np.random.choice(["e", "f", "g"], (50000, 10))
data = list(data) + list(data2)
from random import shuffle
shuffle(data)
# Create a Spark RDD from our sample data and decrease partitions to max_partions
max_partitions = 32
rdd = sc.parallelize(data)
rdd = rdd.coalesce(max_partitions)
for x in rdd.take(10):
print x
method = EnsembleKModes(n_clusters, max_iter)
model = method.fit(df.rdd)
print(model.clusters)
print(method.mean_cost)
predictions = method.predictions
datapoints = method.indexed_rdd
combined = datapoints.zip(predictions)
print(combined.take(10))
model.predict(rdd).take(5)
I'm using Python 2.7, Apache Zeppelin 0.7.1 and Apache Spark 2.1.0.
This is the output error:
('Iteration ', 0)
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-1298251609305129154.py", line 349, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-1298251609305129154.py", line 337, in <module>
exec(code)
File "<stdin>", line 13, in <module>
File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 430, in fit
self.n_clusters,self.max_dist_iter)
File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 271, in k_modes_partitioned
clusters = check_for_empty_cluster(clusters, rdd)
File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 317, in check_for_empty_cluster
random_element = random.choice(clusters[biggest_cluster].members)
File "/usr/lib/python2.7/random.py", line 275, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range
the RDD used to fit the model is not empty, i've checked it. I think that is a versions incompatibility problem between pyspark-distributed-kmodes and spark, but I can't downgrade Spark.
Any idea how to fix it?
What is df? Doesn't look like a spark error. The code from https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes is working for me under Spark 2.1.0. Even when I changed this line of code from yours it also works:
method = EnsembleKModes(n_clusters, max_iter)
model = method.fit(rdd)

Memory error even though RAM is free

I am merging files together in 4 folders. Within those 4 folders I am merging 80 .dbf files together each of which is 35 megabytes. I am using the following code:
import os
import pandas as pd
from simpledbf import Dbf5
list1=[]
folders=r'F:\dbf_tables'
out=r'F:\merged'
if not os.path.isdir(out):
os.mkdir(out)
for folder in os.listdir(folders):
if not os.path.isdir(os.path.join(out,folder)):
os.mkdir(os.path.join(out,folder))
for f in os.listdir(os.path.join(folders,folder)):
if '.xml' not in f:
if '.cpg' not in f:
table=Dbf5(os.path.join(folders,folder,f))
df=table.to_dataframe()
list1.append(df)
dfs = reduce(lambda left,right: pd.merge(left,right,on=['POINTID'],how='outer',),list1)
dfs.to_csv(os.path.join(out,folder,'combined.csv'), index=False)
almost immediately after running the code I receive this error:
Traceback (most recent call last):
File "<ipython-input-1-77eb6fd0cda7>", line 1, in <module>
runfile('F:/python codes/prelim_codes/raster_to_point.py', wdir='F:/python codes/prelim_codes')
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "F:/python codes/prelim_codes/raster_to_point.py", line 66, in <module>
dfs = reduce(lambda left,right: pd.merge(left,right,on=['POINTID'],how='outer',),list1)
File "F:/python codes/prelim_codes/raster_to_point.py", line 66, in <lambda>
dfs = reduce(lambda left,right: pd.merge(left,right,on=['POINTID'],how='outer',),list1)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\pandas\tools\merge.py", line 39, in merge
return op.get_result()
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\pandas\tools\merge.py", line 217, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\pandas\tools\merge.py", line 353, in _get_join_info
sort=self.sort, how=self.how)
File "C:\Users\spotter\AppData\Local\Continuum\Anaconda_64\lib\site-packages\pandas\tools\merge.py", line 559, in _get_join_indexers
return join_func(lkey, rkey, count, **kwargs)
File "pandas\src\join.pyx", line 160, in pandas.algos.full_outer_join (pandas\algos.c:61256)
MemoryError
but only 30% of my memory is being used, which is pretty much the baseline.
EDIT:
I picked out only 2 files and tried the merge using:
merge=pd.merge(df1,df2, on=['POINTID'], how='outer')
and still get a memory error, something weird is going on.
When I run the same thing in 32-bit Anaconda I get ValueError: negative dimensions are not allowed
EDIT:
The entire problem stemmed from the solution give here:
Value Error: negative dimensions are not allowed when merging
EDITED based on comment:
Try this (it's enough to use only one if statement with logical and conditions):
import os
import pandas as pd
from simpledbf import Dbf5
folders = r'F:\dbf_tables'
out = r'F:\merged'
if not os.path.isdir(out):
os.mkdir(out)
for folder in os.listdir(folders):
if not os.path.isdir(os.path.join(out, folder)):
os.mkdir(os.path.join(out, folder))
# Initialize empty dataframe by folders
dfs = pd.DataFrame(columns=['POINTID'])
for f in os.listdir(os.path.join(folders, folder)):
if ('.xml' not in f) and ('.cpg' not in f):
table = Dbf5(os.path.join(folders, folder, f))
df = table.to_dataframe()
# Merge actual dataframe to result dataframe
dfs = dfs.merge(df, on=['POINTID'], how='outer')
# Save results by folder
dfs.to_csv(os.path.join(out, folder, 'combined.csv'), index=False)

Reading in OSM buildings geojson data into Python via geopandas

I'm having problems reading an OpenStreetMap buildings (IMPOSM GEOJSON) file into a geopandas data frame object (Python 2.7). This is on MAC OS X 10.11.3. Here are the messages I'm getting:
>>> import geopandas as gpd
>>> df=gpd.read_file('san-francisco-bay_california_buildings.geojson')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/io/file.py", line 28, in read_file
gdf = GeoDataFrame.from_features(f, crs=crs)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/geodataframe.py", line 193, in from_features
d = {'geometry': shape(f['geometry']) if f['geometry'] else None}
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/geo.py", line 34, in shape
return Polygon(ob["coordinates"][0], ob["coordinates"][1:])
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 229, in __init__
self._geom, self._ndim = geos_polygon_from_py(shell, holes)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 508, in geos_polygon_from_py
geos_shell, ndim = geos_linearring_from_py(shell)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 450, in geos_linearring_from_py
n = len(ob[0])
IndexError: list index out of range
The odd thing is that I can load OSM roads data IMPOSM GEOJSON files with geopandas. Am I missing something obvious here? Thanks very much.
EDIT - link to the data below:
OSM data from mapzen

use random forest to classifier review, but hat key error?

I have follow code in python:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( train_data_features, train["sentiment"] )
but have key error for "sentiment", I don't know why,
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
-Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site--packages/pandas/core/frame.py", line 1780, in __getitem__
return self._getitem_column(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.py", line 1787, in _getitem_column
return self._get_item_cache(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 1068, in _get_item_cache
values = self._data.get(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 2849, in get
loc = self.items.get_loc(item)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/index.py", line 1402, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3807)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3687)
File "pandas/hashtable.pyx", line 696, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12310)
File "pandas/hashtable.pyx", line 704, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12261)
KeyError: 'sentiment'
Are you doing the Kaggle competition? https://www.kaggle.com/c/word2vec-nlp-tutorial/data
Are you sure you have downloaded and decompressed the file ok? The first part of the file reads:
id sentiment review
"5814_8" 1 "With all this stuff go
This works for me:
>>> train = pd.read_csv("labeledTrainData.tsv", delimiter="\t")
>>> train.columns
Index([u'id', u'sentiment', u'review'], dtype='object')
>>> train.head(3)
id sentiment review
0 5814_8 1 With all this stuff going down at the moment w...
1 2381_9 1 \The Classic War of the Worlds\" by Timothy Hi...
2 7759_3 0 The film starts with a manager (Nicholas Bell)...
You should check the columns are setup correctly in the train variable. You should have a sentiment column. That column seems to be missing in your dataframe.