Turi Create - Please use dropna() to drop rows - python-2.7

I am having issues with Apple Turi Create and image classifier. I have successfully created a model with 22 categories. I have recently added 5 more categories and console is giving me error warning
Please use dropna() to drop rows with missing target values.
The full console log looks like this:
[16:30:30] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[16:30:30] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded!
Resizing images...
Performing feature extraction on resized images...
Premature end of JPEG file
Completed 512/1883
Completed 1024/1883
Completed 1536/1883
Completed 1883/1883
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
You can set ``validation_set=None`` to disable validation tracking.
[ERROR] turicreate.toolkits._main: Toolkit error: Target column has missing value.
Please use dropna() to drop rows with missing target values.
Traceback (most recent call last):
File "train.py", line 8, in <module>
model = tc.image_classifier.create(train_data, target='label', max_iterations=1000)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/turicreate/toolkits/image_classifier/image_classifier.py", line 132, in create
verbose=verbose)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/turicreate/toolkits/classifier/logistic_classifier.py", line 312, in create
seed=seed)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/turicreate/toolkits/_supervised_learning.py", line 397, in create
options, verbose)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/turicreate/toolkits/_main.py", line 75, in run
raise ToolkitError(str(message))
turicreate.toolkits._main.ToolkitError: Target column has missing value.
Please use dropna() to drop rows with missing target values.
I have upgraded turi and coremltools to the lates versions, but I don't know where I should implement the dropna() in the code. I only found this reference and followed the code.
It looks like this:
data.py
import turicreate as tc
image_data = tc.image_analysis.load_images('images', with_path=True)
labels = ['A', 'B', 'C', 'D']
def get_label(path, labels=labels):
for label in labels:
if label in path:
return label
image_data['label'] = image_data['path'].apply(get_label)
#import os
#image_data['label'] = image_data['path'].apply(lambda path: os.path.dirname(path).split('/')[-1])
image_data.save('boxes.sframe')
image_data.explore()
train.py
import turicreate as tc
data = tc.SFrame('boxes.sframe')
data.dropna()
train_data, test_data = data.random_split(0.8)
model = tc.image_classifier.create(train_data, target='label', max_iterations=1000)
predictions = model.classify(test_data)
results = model.evaluate(test_data)
print "Accuracy : %s" % results['accuracy']
print "Confusion Matrix : \n%s" % results['confusion_matrix']
model.save('boxes.model')
How do I drop all the empty columns and rows please? Does the max_iterations=1000 have also effect on the error?
Thank you for suggestions

data.dropna() isn't done in place, you need to write it: data = data.dropna()
See documentation here https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.dropna.html

Related

BigQuery Storage API: the table has a storage format that is not supported

I've used the sample from the BQ documentation to read a BQ table into a pandas dataframe using this query:
query_string = """
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
print(dataframe.head())
url view_count
0 https://stackoverflow.com/questions/22879669 48540
1 https://stackoverflow.com/questions/13530967 45778
2 https://stackoverflow.com/questions/35159967 40458
3 https://stackoverflow.com/questions/10604135 39739
4 https://stackoverflow.com/questions/16609219 34479
However, the minute I try and use any other non-public data-set, I get the following error:
google.api_core.exceptions.FailedPrecondition: 400 there was an error creating the session: the table has a storage format that is not supported
Is there some setting I need to set in my table so that it can work with the BQ Storage API?
This works:
query_string = """SELECT funding_round_type, count(*) FROM `datadocs-py.datadocs.investments` GROUP BY funding_round_type order by 2 desc LIMIT 2"""
>>> bqclient.query(query_string).result().to_dataframe()
funding_round_type f0_
0 venture 104157
1 seed 43747
However, when I set it to use the bqstorageclient I get that error:
>>> bqclient.query(query_string).result().to_dataframe(bqstorage_client=bqstorageclient)
Traceback (most recent call last):
File "/Users/david/Desktop/V/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
return callable_(*args, **kwargs)
File "/Users/david/Desktop/V/lib/python3.6/site-packages/grpc/_channel.py", line 533, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/Users/david/Desktop/V/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "there was an error creating the session: the table has a storage format that is not supported"
debug_error_string = "{"created":"#1565047973.444089000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"there was an error creating the session: the table has a storage format that is not supported","grpc_status":9}"
>
I experienced the same issue as of 06 Nov 2019 and it turns out that the error that you are getting is a known issue with the Read API as it cannot currently handle result sets smaller than 10MB. I came across this that shed some light on this problem:
GitHub.com - GoogleCloudPlatform/spark-bigquery-connector - FAILED_PRECONDITION: there was an error creating the session: the table has a storage format that is not supported #46
I have tested it with a query that returns a larger than 10MB result set and it seems to be working fine for me with an EU multi-regional location of the dataset that I am querying against.
Also, you will need to install fastavro in your environment for this functionality to work.

tensorflow object detection API: generate TF record of custom data set

I am trying to retrain the tensorflow object detection API with my own data
i have labelled my image with labelImg but when i am using the script create_pascal_tf_record.py which is included in the tensorflow/models/research, i got some errors and i dont really know why it happens
python object_detection/dataset_tools/create_pascal_tf_record.py --data_dir=/home/jim/Documents/tfAPI/workspace/training_cabbage/images/train/ --label_map_path=/home/jim/Documents/tfAPI/workspace/training_cabbage/annotations/label_map.pbtxt --output_path=/home/jim/Desktop/cabbage_pascal.record --set=train --annotations_dir=/home/jim/Documents/tfAPI/workspace/training_cabbage/images/train/ --year=merged
Traceback (most recent call last):
File "object_detection/dataset_tools/create_pascal_tf_record.py", line 185, in <module>
tf.app.run()
File "/home/jim/.virtualenvs/enrouteDeepDroneTF/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "object_detection/dataset_tools/create_pascal_tf_record.py", line 167, in main
examples_list = dataset_util.read_examples_list(examples_path)
File "/home/jim/Documents/tfAPI/models/research/object_detection/utils/dataset_util.py", line 59, in read_examples_list
lines = fid.readlines()
File "/home/jim/.virtualenvs/enrouteDeepDroneTF/local/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 188, in readlines
self._preread_check()
File "/home/jim/.virtualenvs/enrouteDeepDroneTF/local/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 85, in _preread_check
compat.as_bytes(self.__name), 1024 * 512, status)
File "/home/jim/.virtualenvs/enrouteDeepDroneTF/local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/jim/Documents/tfAPI/workspace/training_cabbage/images/train/VOC2007/ImageSets/Main/aeroplane_train.txt; No such file or directory
the train folder contains the xml and the jpg
the annotation folder contains my labelmap.pbtxt for my custom class
and i want to publish the TF record file on the desktop
it seems that it cant find a file in my images and annotations folder but i dont know why
If someone has idea, thank you in advance
This error happens because you use the code for PASCAL VOC, which requires certain data folders structure. Basically, you need to download and unpack VOCdevkit to make the script work. As user phd pointed you out, you need the file VOC2007/ImageSets/Main/aeroplane_train.txt.
I recommend you to write your own script for tfrecords creation, it's not difficult. You need just two key components:
Loop over your data that reads the images and annotations
A function that encodes the data into tf.train.Example. For that you can pretty much re-use the dict_to_tf_example
Inside the loop, having created the tf_example, pass it to TFRecordWriter:
writer.write(tf_example.SerializeToString())
OK for future references, this is how i add background images to the dataset allowing the model to train on it.
Functions used from: datitran/raccoon_dataset
Generate CSV file -> xml_to_csv.py
Generate TFRecord from CSV file -> generate_tfrecord.py
First Step - Creating XML file for it
Example of background image XML file
<annotation>
<folder>test/</folder>
<filename>XXXXXX.png</filename>
<path>your_path/test/XXXXXX.png</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>640</width>
<height>640</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
</annotation>
Basically you remove the entire <object> (i.e no annotation )
Second Step - Generate CSV file
Using the xml_to_csv.py I just add a little change, to consider the XML file that do not have any annotation (the background images) as so:
From the original:
https://github.com/datitran/raccoon_dataset/blob/93938849301895fb73909842ba04af9b602f677a/xml_to_csv.py#L12-L22
I add:
value = None
for member in root.findall('object'):
value = (root.find('filename').text,
int(root.find('size')[0].text),
int(root.find('size')[1].text),
member[0].text,
int(member[4][0].text),
int(member[4][1].text),
int(member[4][2].text),
int(member[4][3].text)
)
xml_list.append(value)
if value is None:
value = (root.find('filename').text,
int(root.find('size')[0].text),
int(root.find('size')[1].text),
'-1',
'-1',
'-1',
'-1',
'-1'
)
xml_list.append(value)
I'm just adding negative values to the coordinates of the bounding box if there is no in the XML file, which is the case for the background images, and it will be usefull when generating the TFRecords.
Third and Final Step - Generating the TFRecords
Now, when creating the TFRecords, if the the corresponding row/image has negative coordinates, i just add zero values to the record (before, this would not even be possible).
So from the original:
https://github.com/datitran/raccoon_dataset/blob/93938849301895fb73909842ba04af9b602f677a/generate_tfrecord.py#L60-L66
I add:
for index, row in group.object.iterrows():
if int(row['xmin']) > -1:
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
classes_text.append(row['class'].encode('utf8'))
classes.append(class_text_to_int(row['class']))
else:
xmins.append(0)
xmaxs.append(0)
ymins.append(0)
ymaxs.append(0)
classes_text.append('something'.encode('utf8')) # this doe not matter for the background
classes.append(5000)
To note that in the class_text (of the else statement), since for the background images there are no bounding boxes, you can replace the string with whatever you would like, for the background cases, this will not appear anywhere.
And lastly for the classes (of the else statement) you just need to add a number label that does not belong to neither of your own classes.
For those who are wondering, I've used this procedure many times, and currently works for my use cases.
Hope it helped in some way.

retriving data saved under HDF5 group as Carray

I am new to HDF5 file format and I have a data(images) saved in HDF5 format. The images are saved undere a group called 'data' which is under the root group as Carrays. what I want to do is to retrive a slice of the saved images. for example the first 400 or somthing like that. The following is what I did.
h5f = h5py.File('images.h5f', 'r')
image_grp= h5f['/data/'] #the image group (data) is opened
print(image_grp[0:400])
but I am getting the following error
Traceback (most recent call last):
File "fgf.py", line 32, in <module>
print(image_grp[0:40])
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
(/feedstock_root/build_artefacts/h5py_1496410723014/work/h5py-2.7.0/h5py/_objects.c:2846)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
(/feedstock_root/build_artefacts/h5py_1496410723014/work/h5py
2.7.0/h5py/_objects.c:2804)
File "/..../python2.7/site-packages/h5py/_hl/group.py", line 169, in
__getitem__oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "/..../python2.7/site-packages/h5py/_hl/base.py", line 133, in _e name = name.encode('ascii')
AttributeError: 'slice' object has no attribute 'encode'
I am not sure why I am getting this error but I am not even sure if I can slice the images which are saved as individual datasets.
I know this is an old question, but it is the first hit when searching for 'slice' object has no attribute 'encode' and it has no solution.
The error happens because the "group" is a group which does not have the encoding attribute. You are looking for the dataset element.
You need to find/know the key for the item that contains the dataset.
One suggestion is to list all keys in the group, and then guess which one it is:
print(list(image_grp.keys()))
This will give you the keys in the group.
A common case is that the first element is the image, so you can do this:
image_grp= h5f['/data/']
image= image_grp(image_grp.keys[0])
print(image[0:400])
yesterday I had a similar error and wrote this little piece of code to take my desired slice of h5py file.
import h5py
def h5py_slice(h5py_file, begin_index, end_index):
slice_list = []
with h5py.File(h5py_file, 'r') as f:
for i in range(begin_index, end_index):
slice_list.append(f[str(i)][...])
return slice_list
and it can be used like
the_desired_slice_list = h5py_slice('images.h5f', 0, 400)

scheduler produces empty files

I'm using pythonanywhere for a simple scheduled task.
I want to download data from a link once a day and save csv files. Later once i have a decent time series I'll figure out how I actually want to manage the data. It's not much data so don't need anything fancy like a database.
My script takes the data from the google sheets link, adds a log column and a time column, then writes a csv with the date in the filename.
It works exactly as I want it to when I run it manually in pythonanywhere, but the scheduler is just creating empty csv files albeit with the correct name.
Any ideas what's up? I don't understand the log file. Surely the error should happen when it is run manually?
script:
import pandas as pd
import time
import datetime
def write_today(df):
date = time.strftime("%Y-%m-%d")
df.to_csv('Properties_'+date+'.csv')
url = 'https://docs.google.com/spreadsheets/d/19h2GmLN-2CLgk79gVxcazxtKqS6rwW36YA-qvuzEpG4/export?format=xlsx'
df = pd.read_excel(url, header=1).rename(columns={'Unnamed: 1':'code'})
source = pd.read_excel(url).columns[0]
df['source'] = source
df['time'] = datetime.datetime.now()
write_today(df)
the scheduler is set up as so:
log file:
Traceback (most recent call last):
File "/home/abmoore/load_data.py", line 24, in <module>
write_today(df)
File "/home/abmoore/load_data.py", line 16, in write_today
df.to_csv('Properties_'+date+'.csv')
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1344, in to_csv
formatter.save()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1551, in save
self._save()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1638, in _save
self._save_header()
File "/usr/local/lib/python2.7/dist-packages/pandas/formats/format.py", line 1634, in _save_header
writer.writerow(encoded_labels)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
Your problem there is the UnicodeDecodeError -- you have some non-ascii data in your spreadsheet, and the pandas to_csv function defaults to ascii encoding. try specifying utf8 instead:
def write_today(df):
filename = 'Properties_{date}.csv'.format(date=time.strftime("%Y-%m-%d"))
df.to_csv(filename, encoding='utf8')
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

How to load retrained_graph.pb and retrained_label.txt using pycharm editor

Using pete warden tutorials i had trained the inception network and training of which i am getting two files
1.retrained_graph.pb
2.retrained_label.txt
Using this i wanted to classify the flower image.
I had install pycharm and linked all the tensorflow library , i had also test the sample tensorflow code it is working fine.
Now when i run the label_image.py program which is
import tensorflow as tf, sys
image_path = sys.argv[1]
# Read in the image_data
image_data = tf.gfile.FastGFile(image_path, 'rb').read()
# Loads label file, strips off carriage return
label_lines = [line.rstrip() for line
in tf.gfile.GFile("/tf_files/retrained_labels.txt")]
# Unpersists graph from file
with tf.gfile.FastGFile("/tf_files/retrained_graph.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
_ = tf.import_graph_def(graph_def, name='')
with tf.Session() as sess:
# Feed the image_data as input to the graph and get first prediction
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor, \
{'DecodeJpeg/contents:0': image_data})
# Sort to show labels of first prediction in order of confidence
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
for node_id in top_k:
human_string = label_lines[node_id]
score = predictions[0][node_id]
print('%s (score = %.5f)' % (human_string, score))
i am getting this error message
/home/chandan/Tensorflow/bin/python /home/chandan/PycharmProjects/tf/tf_folder/tf_files/label_image.py
Traceback (most recent call last):
File "/home/chandan/PycharmProjects/tf/tf_folder/tf_files/label_image.py", line 7, in <module>
image_path = sys.argv[1]
IndexError: list index out of range
Could any one please help me with this issue.
You are getting this error because it is expecting image name (with path) as an argument.
In pycharm go to View->Tool windows->Terminal.
It is same as opening separate terminal. And run
python label_image.py /image_path/image_name.jpg
You are trying to get the command line argument by calling sys.argv[1]. So you need to give command line arguments to satisfy it. Looks like the argument required is a test image, you should pass its location as a parameter.
Pycharm should have a script parameters and interpreter options dialog which you can use to enter the required parameters.
Or you can call the script from a command line and enter the parameter via;
>python my_python_script.py my_python_parameter.jpg
EDIT:
According to the documents (I don't have pycharm installed on this computer), you should go to Run/Debug configuration menu and edit the configurations for your script. Add the absolute path of your file into Script Parameters box in quotes.
Or alternatively if you just want to skip the parameter thing completely, just get the path as raw_input (input in python3) or just simply give it to image_path = r"absolute_image_path.jpg"