Python - Shapely large shapefiles - python-2.7

I am reading in a GeoJSON file that contains two simple polygon descriptions that I made and six complicated vectors from http://ryanmullins.org/blog/2015/8/18/land-area-vectors-for-geographic-combatant-commands
I can read my own 4-8 point description into Shapely Polygons. However, the more complicated descriptions from the website above give me the following error:
from shapely.geometry import Polygon
jsonFile="path/to/file.json"
with open(jsonFile) as f:
data=json.load(f)
for feature in data['features']:
#This is not how I'm saving the polygons, and is only for testing purposes:
myPoly=Polygon(feature['geometry']['coordinates'])
The error message:
File "/.../anaconda2/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 229, in __init__
self._geom, self._ndim = geos_polygon_from_py(shell, holes)
File "/.../anaconda2/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 508, in geos_polygon_from_py
geos_shell, ndim = geos_linearring_from_py(shell)
File "/.../anaconda2/lib/python2.7/site-packages/shapely/geometry/polygon.py", line 454, in geos_linearring_from_py
assert (n == 2 or n == 3)
AssertionError
They are read as list, with USAFRICOM having length 113.
Is there a way to read these very long vectors into shapely? I have tried Polygon, MultiPoint, asMultiPointIf. If not, would you be able to suggest how to simplify this vector description into something that could be read by Shapely?

Well it gets a bit more complicated than just throwing all coordinates at Shapely in one go.
According to the GeoJSON spec and Shapely's documentation regarding multipolygons: MultiPolygons consist of arrays of Polygons, and Polygons again consist of LinearRings which depict outer and inner areas / holes.
Here's my attempt at a MultiPolygon reader for your GeoJSON files, opening the output back in QGIS shows up correctly, let me know if you have issues.
import json
from shapely.geometry import mapping
from shapely.geometry import Polygon
from shapely.geometry import LinearRing
from shapely.geometry import MultiPolygon
jsonFile = "USCENTCOM.json"
polygons = []
with open(jsonFile) as f:
data = json.load(f)
for feature in data['features']:
for multi_polygon in feature['geometry']['coordinates']:
# collect coordinates (LinearRing coordinates) for the Polygon
tmp_poly = []
for polygon in multi_polygon:
tmp_poly.append(polygon)
if len(tmp_poly) > 1:
# exterior LinearRing at [0], all following interior/"holes"
polygons.append(Polygon(tmp_poly[0], tmp_poly[1:]))
else:
# there's just the exterior LinearRing
polygons.append(Polygon(tmp_poly[0]))
# finally generate the MultiPolygon from our Polygons
mp = MultiPolygon(polygons)
# print GeoJSON string
print(json.dumps(mapping(mp)))

Related

Saving data from traceplot in PyMC3

Below is the code for a simple Bayesian Linear regression. After I obtain the trace and the plots for the parameters, is there any way in which I can save the data that created the plots in a file so that if I need to plot it again I can simply plot it from the data in the file rather than running the whole simulation again?
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,9,5)
y = 2*x + 5
yerr=np.random.rand(len(x))
def soln(x, p1, p2):
return p1+p2*x
with pm.Model() as model:
# Define priors
intercept = pm.Normal('Intercept', 15, sd=5)
slope = pm.Normal('Slope', 20, sd=5)
# Model solution
sol = soln(x, intercept, slope)
# Define likelihood
likelihood = pm.Normal('Y', mu=sol,
sd=yerr, observed=y)
# Sampling
trace = pm.sample(1000, nchains = 1)
pm.traceplot(trace)
print pm.summary(trace, ['Slope'])
print pm.summary(trace, ['Intercept'])
plt.show()
There are two easy ways of doing this:
Use a version after 3.4.1 (currently this means installing from master, with pip install git+https://github.com/pymc-devs/pymc3). There is a new feature that allows saving and loading traces efficiently. Note that you need access to the model that created the trace:
...
pm.save_trace(trace, 'linreg.trace')
# later
with model:
trace = pm.load_trace('linreg.trace')
Use cPickle (or pickle in python 3). Note that pickle is at least a little insecure, don't unpickle data from untrusted sources:
import cPickle as pickle # just `import pickle` on python 3
...
with open('trace.pkl', 'wb') as buff:
pickle.dump(trace, buff)
#later
with open('trace.pkl', 'rb') as buff:
trace = pickle.load(buff)
Update for someone like me who is still coming over to this question:
load_trace and save_trace functions were removed. Since version 4.0 even the deprecation waring for these functions were removed.
The way to do it is now to use arviz:
with model:
trace = pymc.sample(return_inferencedata=True)
trace.to_netcdf("filename.nc")
And it can be loaded with:
trace = arviz.from_netcdf("filename.nc")
This way works for me :
# saving trace
pm.save_trace(trace=trace_nb, directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")
# loading saved traces
with model_nb:
t_nb = pm.load_trace(directory=r"c:\Users\xxx\Documents\xxx\traces\trace_nb")

Load HDF file into list of Python Dask DataFrames

I have a HDF5 file that I would like to load into a list of Dask DataFrames. I have set this up using a loop following an abbreviated version of the Dask pipeline approach. Here is the code:
import pandas as pd
from dask import compute, delayed
import dask.dataframe as dd
import os, h5py
#delayed
def load(d,k):
ddf = dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
return ddf
if __name__ == '__main__':
d = 'C:\Users\User\FileD'
loaded = [load(d,'/DF'+str(i)) for i in range(1,10)]
ddf_list = compute(*loaded)
print(ddf_list[0].head(),ddf_list[0].compute().shape)
I get this error message:
C:\Python27\lib\site-packages\tables\group.py:1187: UserWarning: problems loading leaf ``/DF1/table``::
HDF5 error back trace
File "..\..\hdf5-1.8.18\src\H5Dio.c", line 173, in H5Dread
can't read data
File "..\..\hdf5-1.8.18\src\H5Dio.c", line 543, in H5D__read
can't initialize I/O info
File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 841, in H5D__chunk_io_init
unable to create file chunk selections
File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 1330, in H5D__create_chunk_file_map_hyper
can't insert chunk into skip list
File "..\..\hdf5-1.8.18\src\H5SL.c", line 1066, in H5SL_insert
can't create new skip list node
File "..\..\hdf5-1.8.18\src\H5SL.c", line 735, in H5SL_insert_common
can't insert duplicate key
End of HDF5 error back trace
Problems reading the array data.
The leaf will become an ``UnImplemented`` node.
% (self._g_join(childname), exc))
The message mentions a duplicate key. I iterated over the first 9 files to test out the code and, in the loop, I am using each iteration to assemble a different key that I use with dd.read_hdf. Across all iterations, I'm keeping the filename is the same - only the key is being changed.
I need to use dd.concat(list,axis=0,...) in order to vertically concatenate the contents of the file. My approach was to load them into a list first and then concatenate them.
I have installed PyTables and h5Py and have Dask version 0.14.3+2.
With Pandas 0.20.1, I seem to get this to work:
for i in range(1,10):
hdf = pd.HDFStore(os.path.join(d,'Cleaned.h5'),mode='r')
df = hdf.get('/DF{}' .format(i))
print df.shape
hdf.close()
Is there a way I can load this HDF5 file into a list of Dask DataFrames? Or is there another approach to vertically concatenate them together?
Dask.dataframe is already lazy, so there is no need to use dask.delayed to make it lazier. You can just call dd.read_hdf repeatedly:
ddfs = [dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
for k in keys]
ddf = dd.concat(ddfs)

Invalid parameter error when doing python scikit-learn grid-search method

I am trying to learn how to find the optimal hyperparameters in decision trees classifier using the GridSearchCV() method from scikit-learn.
The problem is it is fine if I am specifying just one parameter's options, it is fine as in the following:
print(__doc__)
# Code source: Gael Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
from sklearn import datasets
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# define classifier
dt = DecisionTreeClassifier()
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
# define parameter values that should be searched
min_samples_split_options = range(2, 4)
# create a parameter grid: map the parameter names to the values that should be saved
param_grid_dt = dict(min_samples_split= min_samples_split_options) # for DT
# instantiate the grid
grid = GridSearchCV(dt, param_grid_dt, cv=10, scoring='accuracy')
# fit the grid with param
grid.fit(X, y)
# view complete results
grid.grid_scores_
'''# examine results from first tuple
print grid.grid_scores_[0].parameters
print grid.grid_scores_[0].cv_validation_scores
print grid.grid_scores_[0].mean_validation_score'''
# examine the best model
print '*******Final results*********'
print grid.best_score_
print grid.best_params_
print grid.best_estimator_
Result:
None
*******Final results*********
0.68
{'min_samples_split': 3}
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=3, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
But when I add another parameters' options into the mix, it gives me an "invalid parameter" error, as follows:
print(__doc__)
# Code source: Gael Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
from sklearn import datasets
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# define classifier
dt = DecisionTreeClassifier()
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features.
y = iris.target
# define parameter values that should be searched
max_depth_options = range(10, 251) # for DT
min_samples_split_options = range(2, 4)
# create a parameter grid: map the parameter names to the values that should be saved
param_grid_dt = dict(max_depth=max_depth_options, min_sample_split=min_samples_split_options) # for DT
# instantiate the grid
grid = GridSearchCV(dt, param_grid_dt, cv=10, scoring='accuracy')
# fit the grid with param
grid.fit(X, y)
'''# view complete results
grid.grid_scores_
# examine results from first tuple
print grid.grid_scores_[0].parameters
print grid.grid_scores_[0].cv_validation_scores
print grid.grid_scores_[0].mean_validation_score
# examine the best model
print '*******Final results*********'
print grid.best_score_
print grid.best_params_
print grid.best_estimator_'''
Result:
None
Traceback (most recent call last):
File "C:\Users\KubiK\Desktop\GridSearch_ex6.py", line 31, in <module>
grid.fit(X, y)
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 804, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
for parameters in parameter_iterable
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__
while self.dispatch_one_batch(iterator):
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch
self._dispatch(tasks)
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch
job = ImmediateComputeBatch(batch)
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__
self.results = batch()
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\cross_validation.py", line 1520, in _fit_and_score
estimator.set_params(**parameters)
File "C:\Users\KubiK\Anaconda2\lib\site-packages\sklearn\base.py", line 270, in set_params
(key, self.__class__.__name__))
ValueError: Invalid parameter min_sample_split for estimator DecisionTreeClassifier. Check the list of available parameters with `estimator.get_params().keys()`.
[Finished in 0.3s]
There's a typo in your code, it should be min_samples_split not min_sample_split.

feed pre-computed estimates to TfidfVectorizer

I trained an instance of scikit-learn's TfidfVectorizer and I want to persist it to disk. I saved the IDF matrix (the idf_ attribute) to disk as a numpy array and I saved the vocabulary (vocabulary_) to disk as a JSON object (I'm avoiding pickle, for security and other reasons). I'm trying to do this:
import json
from idf import idf # numpy array with the pre-computed IDFs
from sklearn.feature_extraction.text import TfidfVectorizer
# dirty trick so I can plug my pre-computed IDFs
# necessary because "vectorizer.idf_ = idf" doesn't work,
# it returns "AttributeError: can't set attribute."
class MyVectorizer(TfidfVectorizer):
TfidfVectorizer.idf_ = idf
# instantiate vectorizer
vectorizer = MyVectorizer(lowercase = False,
min_df = 2,
norm = 'l2',
smooth_idf = True)
# plug vocabulary
vocabulary = json.load(open('vocabulary.json', mode = 'rb'))
vectorizer.vocabulary_ = vocabulary
# test it
vectorizer.transform(['foo bar'])
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1314, in transform
return self._tfidf.transform(X, copy=False)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 1014, in transform
check_is_fitted(self, '_idf_diag', 'idf vector is not fitted')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/utils/validation.py", line 627, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.utils.validation.NotFittedError: idf vector is not fitted
So, what am I doing wrong? I'm failing to fool the vectorizer object: somehow it knows that I'm cheating (i.e., passing it pre-computed data and not training it with actual text). I inspected the attributes of the vectorizer object but I can't find anything like 'istrained', 'isfitted', etc. So, how do I fool the vectorizer?
Ok, I think I got it: the vectorizer instance has an attribute _tfidf, which in turn must have an attribute _idf_diag. The transform method calls a check_is_fitted function that checks whether whether that _idf_diag exists. (I had missed it because it's an attribute of an attribute.) So, I inspected the TfidfVectorizer source code to see how _idf_diag is created. Then I just added it to the _tfidf attribute:
import scipy.sparse as sp
# ... code ...
vectorizer._tfidf._idf_diag = sp.spdiags(idf,
diags = 0,
m = len(idf),
n = len(idf))
And now the vectorization works.

How to use list of strings as training data for svm using scikit.learn?

I am using scikit.learn to train an svm based on data where each observation (X) is a list of words. The tags for each observation (Y) are floating point values. I have tried following the example given in the scikit learn documentation (http://scikit-learn.org/stable/modules/svm.html) for Multi-class classification.
Here is my code:
from __future__ import division
from sklearn import svm
import os.path
import numpy
import re
'''
The stanford-postagger was included to see how it tags the words and to see if it would help in getting just the names
of the ingredients. Turns out its pointless.
'''
#from nltk.tag.stanford import POSTagger
mainDirectory = './nyu/PROJECTS/Epicurious/DATA/ingredients'
#st = POSTagger('/usr/share/stanford-postagger/models/english-bidirectional-distsim.tagger','/usr/share/stanford-postagger/stanford-postagger.jar')
'''
This is where we would reach each line of the file and then run a regex match on it to get all the words before
the first tab. (these are the names of the ingredients. Some of them may have adjectives like fresh, peeled,cut etc.
Not sure what to do about them yet.)
'''
def getFileDetails(_filename,_fileDescriptor):
rankingRegexMatch = re.match('([0-9](?:\_)[0-9]?)', _filename)
if len(rankingRegexMatch.group(0)) == 2:
ranking = float(rankingRegexMatch.group(0)[0])
else:
ranking = float(rankingRegexMatch.group(0)[0]+'.'+rankingRegexMatch.group(0)[2])
_keywords = []
for line in _fileDescriptor:
m = re.match('(\w+\s*\w*)(?=\t[0-9])', line)
if m:
_keywords.append(m.group(0))
return [_keywords,ranking]
'''
Open each file in the directory and pass the name and file descriptor to getFileDetails
'''
def this_is_it(files):
_allKeywords = []
_allRankings = []
for eachFile in files:
fullFilePath = mainDirectory + '/' + eachFile
f = open(fullFilePath)
XandYForThisFile = getFileDetails(eachFile,f)
_allKeywords.append(XandYForThisFile[0])
_allRankings.append(XandYForThisFile[1])
#_allKeywords = numpy.array(_allKeywords,dtype=object)
svm_learning(_allKeywords,_allRankings)
def svm_learning(x,y):
clf = svm.SVC()
clf.fit(x,y)
'''
This just prints the directory path and then calls the callback x on files
'''
def print_files( x, dir_path , files ):
print dir_path
x(files)
'''
code starts here
'''
os.path.walk(mainDirectory, print_files, this_is_it)
When the svm_learning(x,y) method is called, it throws me an error:
Traceback (most recent call last):
File "scan for files.py", line 72, in <module>
os.path.walk(mainDirectory, print_files, this_is_it)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/posixpath.py", line 238, in walk
func(arg, top, names)
File "scan for files.py", line 68, in print_files
x(files)
File "scan for files.py", line 56, in this_is_it
svm_learning(_allKeywords,_allRankings)
File "scan for files.py", line 62, in svm_learning
clf.fit(x,y)
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/svm/base.py", line 135, in fit
X = atleast2d_or_csr(X, dtype=np.float64, order='C')
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 116, in atleast2d_or_csr
"tocsr")
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 96, in _atleast2d_or_sparse
X = array2d(X, dtype=dtype, order=order, copy=copy)
File "/Library/Python/2.7/site-packages/scikit_learn-0.14_git-py2.7-macosx-10.8-intel.egg/sklearn/utils/validation.py", line 80, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/Library/Python/2.7/site-packages/numpy-1.8.0.dev_bbcfcf6_20130307-py2.7-macosx-10.8-intel.egg/numpy/core/numeric.py", line 331, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.
Can anyone help? I am new to scikit and could not find any help in the documentation.
You should take a look at: Text feature extraction. You are going to want to use either a TfidfVectorizer, a CountVectorizer, or a HashingVectorizer(if your data is very large). These components take your text in and output feature matrices that are acceptable to classifiers. Be advised that these work on lists of strings, with one string per example, so if you have a list of lists of strings (you have already tokenized), you may need to either join() the tokens to get a list of strings or skip tokenization.