pysparkDistributedKmodes lib error - python-2.7

I'm trying to run pyspark-distributed-kmodes example:
import numpy as np
data = np.random.choice(["a", "b", "c"], (50000, 10))
data2 = np.random.choice(["e", "f", "g"], (50000, 10))
data = list(data) + list(data2)
from random import shuffle
shuffle(data)
# Create a Spark RDD from our sample data and decrease partitions to max_partions
max_partitions = 32
rdd = sc.parallelize(data)
rdd = rdd.coalesce(max_partitions)
for x in rdd.take(10):
print x
method = EnsembleKModes(n_clusters, max_iter)
model = method.fit(df.rdd)
print(model.clusters)
print(method.mean_cost)
predictions = method.predictions
datapoints = method.indexed_rdd
combined = datapoints.zip(predictions)
print(combined.take(10))
model.predict(rdd).take(5)
I'm using Python 2.7, Apache Zeppelin 0.7.1 and Apache Spark 2.1.0.
This is the output error:
('Iteration ', 0)
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-1298251609305129154.py", line 349, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-1298251609305129154.py", line 337, in <module>
exec(code)
File "<stdin>", line 13, in <module>
File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 430, in fit
self.n_clusters,self.max_dist_iter)
File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 271, in k_modes_partitioned
clusters = check_for_empty_cluster(clusters, rdd)
File "/usr/local/lib/python2.7/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 317, in check_for_empty_cluster
random_element = random.choice(clusters[biggest_cluster].members)
File "/usr/lib/python2.7/random.py", line 275, in choice
return seq[int(self.random() * len(seq))] # raises IndexError if seq is empty
IndexError: list index out of range
the RDD used to fit the model is not empty, i've checked it. I think that is a versions incompatibility problem between pyspark-distributed-kmodes and spark, but I can't downgrade Spark.
Any idea how to fix it?

What is df? Doesn't look like a spark error. The code from https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes is working for me under Spark 2.1.0. Even when I changed this line of code from yours it also works:
method = EnsembleKModes(n_clusters, max_iter)
model = method.fit(rdd)

Related

Seaborn KDE visualisation, value error on dataset

I am attempting to visualise a KDE plot in Seaborn, but am encountering an error on entering data.
The data is a set of scores ranging from 1-13 and is in the form of a numpy array.
Below is the section of code I'm using.
query_CNM = 'SELECT SCORE from CNMATCH LIMIT 2000'
df = pd.read_sql(query_CNM, conn, index_col = None)
yy = np.array(df)
plot = sns.kdeplot(yy)
Below is the full error that I'm receiving.
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1758, in <module>
main()
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1752, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1147, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/uni/Desktop/Proof_Of_Concept/PYQTKDE.py", line 66, in <module>
plot = sns.kdeplot(yy)
File "/Users/uni/.conda/envs/fing.py/lib/python2.7/site-packages/seaborn/distributions.py", line 664, in kdeplot
x, y = data.T
ValueError: need more than 1 value to unpack
I can't seem to find exactly how the data needs to be formatted for sea-born in order to fit a KDE, if any insights can be provided on this it would be greatly appreciated.

NoneType has no attribute 'select' KerasDML SystemML

Im have some issue while running example Keras2DML code in this page. While running the code, i've got this error:
Traceback (most recent call last):
File "/home/fregy/kerasplayground/sysml/examplenn.py", line 12, in <module>
sysml_model = Keras2DML(spark, keras_model,input_shape=(3,224,224))
File "/usr/local/lib/python2.7/dist-packages/systemml/mllearn/estimators.py", line 909, in __init__
convertKerasToCaffeNetwork(keras_model, self.name + ".proto")
File "/usr/local/lib/python2.7/dist-packages/systemml/mllearn/keras2caffe.py", line 201, in convertKerasToCaffeNetwork
jsonLayers = list(chain.from_iterable(imap(lambda layer: _parseKerasLayer(layer), kerasModel.layers)))
File "/usr/local/lib/python2.7/dist-packages/systemml/mllearn/keras2caffe.py", line 201, in <lambda>
jsonLayers = list(chain.from_iterable(imap(lambda layer: _parseKerasLayer(layer), kerasModel.layers)))
File "/usr/local/lib/python2.7/dist-packages/systemml/mllearn/keras2caffe.py", line 137, in _parseKerasLayer
ret = { 'layer': { 'name':layer.name, 'type':supportedLayers[layerType], 'bottom':_getBottomLayers(layer), 'top':layer.name, paramName:param[paramName] } }
File "/usr/local/lib/python2.7/dist-packages/systemml/mllearn/keras2caffe.py", line 112, in _getBottomLayers
return [ bottomLayer.name for bottomLayer in _getInboundLayers(layer) ]
File "/usr/local/lib/python2.7/dist-packages/systemml/mllearn/keras2caffe.py", line 70, in _getInboundLayers
for node in layer.inbound_nodes: # get inbound nodes to current layer
AttributeError: 'Conv2D' object has no attribute 'inbound_nodes'
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/SocketServer.py", line 230, in serve_forever
r, w, e = _eintr_retry(select.select, [self], [], [],
AttributeError: 'NoneType' object has no attribute 'select'
Im using Tensorflow-GPU 1.5 , and Keras 2.1.3 .
Thanks for trying out Keras2DML. The issue arises because the newer Keras versions renamed the attribute inbound_nodes to _inbound_nodes. This issue was fixed in yesterday's commit: https://github.com/apache/systemml/commit/9c3057a34c84d5bf1c698ad0a5c3c34d90412dbb.
Since you are using TensorFlow-GPU, you may want to check if TF grabs onto most of GPU memory when Keras model is compiled using nvidia-smi. If yes, here are two easy workarounds:
a. Hide GPUs from TF:
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = ''
import tensorflow as tf
b. Or minimize the overhead due to TensorFlow:
from keras.backend.tensorflow_backend import set_session
tf_config = tf.ConfigProto()
tf_config.gpu_options.allow_growth = True
set_session(tf.Session(config=tf_config))

Google Vision Python 2.7 TypeError: construct_settings() got an unexpected keyword argument 'metrics_headers'

After installing the required packages using pip, downloading a Json key and setting the enviroment variable in the cmd window with: set GOOGLE_APPLICATION_CREDENTIALS = 'C:\Users\ xxx .json' and following the instructions to use the Google Vision API on https://googlecloudplatform.github.io/google-cloud-python/stable/vision-usage.html#authentication-and-configuration
I tried the following and got the following error without any idea how to solve the error, so all suggestions are much appreciated
>>> from google.cloud import vision
>>> client =vision.Client()
>>> print client
<google.cloud.vision.client.Client object at 0x08D414F0>
>>> image = client.image(filename='test2.jpg')
>>> print image
<google.cloud.vision.image.Image object at 0x0CBF68F0>
>>> text = image.detect_text()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\google\cloud\vision\image.py", line 289, in detect_text
annotations = self.detect(features)
File "C:\Python27\lib\site-packages\google\cloud\vision\image.py", line 143, in detect
return self._detect_annotation(images)
File "C:\Python27\lib\site-packages\google\cloud\vision\image.py", line 117, in _detect_annotation
return self.client._vision_api.annotate(images)
File "C:\Python27\lib\site-packages\google\cloud\vision\client.py", line 114, in _vision_api
self._vision_api_internal = _GAPICVisionAPI(self)
File "C:\Python27\lib\site-packages\google\cloud\vision\_gax.py", line 34, in __init__
lib_version=__version__)
File "C:\Python27\lib\site-packages\google\cloud\gapic\vision\v1\image_annotator_client.py", line 140, in __init__
metrics_headers=metrics_headers, )
TypeError: construct_settings() got an unexpected keyword argument 'metrics_headers'

Reading in OSM buildings geojson data into Python via geopandas

I'm having problems reading an OpenStreetMap buildings (IMPOSM GEOJSON) file into a geopandas data frame object (Python 2.7). This is on MAC OS X 10.11.3. Here are the messages I'm getting:
>>> import geopandas as gpd
>>> df=gpd.read_file('san-francisco-bay_california_buildings.geojson')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/io/file.py", line 28, in read_file
gdf = GeoDataFrame.from_features(f, crs=crs)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/geopandas/geodataframe.py", line 193, in from_features
d = {'geometry': shape(f['geometry']) if f['geometry'] else None}
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/geo.py", line 34, in shape
return Polygon(ob["coordinates"][0], ob["coordinates"][1:])
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 229, in __init__
self._geom, self._ndim = geos_polygon_from_py(shell, holes)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 508, in geos_polygon_from_py
geos_shell, ndim = geos_linearring_from_py(shell)
File "/Users/ewang/anaconda/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 450, in geos_linearring_from_py
n = len(ob[0])
IndexError: list index out of range
The odd thing is that I can load OSM roads data IMPOSM GEOJSON files with geopandas. Am I missing something obvious here? Thanks very much.
EDIT - link to the data below:
OSM data from mapzen

NotSupportedError when trying to build primary index in N1QL in Couchbase Python SDK

I'm trying to get into the new N1QL Queries for Couchbase in Python.
I got my database set up in Couchbase 4.0.0.
My initial try was to retreive all documents like this:
from couchbase.bucket import Bucket
bucket = Bucket('couchbase://localhost/dafault')
rv = bucket.n1ql_query('CREATE PRIMARY INDEX ON default').execute()
for row in bucket.n1ql_query('SELECT * FROM default'):
print row
But this produces a OperationNotSupportedError:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 2357, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1777, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/my_user/python_tests/test_n1ql.py", line 9, in <module>
rv = bucket.n1ql_query('CREATE PRIMARY INDEX ON default').execute()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 215, in execute
for _ in self:
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 235, in __iter__
self._start()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/couchbase/n1ql.py", line 180, in _start
self._mres = self._parent._n1ql_query(self._params.encoded)
couchbase.exceptions.NotSupportedError: <RC=0x13[Operation not supported], Couldn't schedule n1ql query, C Source=(src/n1ql.c,82)>
Here the version numbers of everything I use:
Couchbase Server: 4.0.0
couchbase python library: 2.0.2
cbc: 2.5.1
python: 2.7.8
gcc: 4.2.1
Anyone an idea what might have went wrong here? I could not find any solution to this problem up to now.
There was another ticket for node.js where the same issue happened. There was a proposal to enable n1ql for the specific bucket first. Is this also needed in python?
It would seem you didn't configure any cluster nodes with the Query or Index services. As such, the error returned is one that indicates no nodes are available.
I also got similar error while trying to create primary index.
Create a primary index...
Traceback (most recent call last):
File "post-upgrade-test.py", line 45, in <module>
mgr.n1ql_index_create_primary(ignore_exists=True)
File "/usr/local/lib/python2.7/dist-packages/couchbase/bucketmanager.py", line 428, in n1ql_index_create_primary
'', defer=defer, primary=True, ignore_exists=ignore_exists)
File "/usr/local/lib/python2.7/dist-packages/couchbase/bucketmanager.py", line 412, in n1ql_index_create
return IxmgmtRequest(self._cb, 'create', info, **options).execute()
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 160, in execute
return [x for x in self]
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 144, in __iter__
self._start()
File "/usr/local/lib/python2.7/dist-packages/couchbase/_ixmgmt.py", line 132, in _start
self._cmd, index_to_rawjson(self._index), **self._options)
couchbase.exceptions.NotSupportedError: <RC=0x13[Operation not supported], Couldn't schedule ixmgmt operation, C Source=(src/ixmgmt.c,98)>
Adding query and index node to the cluster solved the issue.