problem in reading the images of mjsynth dataset - computer-vision

recently I am trying to train a text recognition network. I tried to start the training by feeding the mjsynth dataset to network. However, there seems to be some images in the dataset which are blank. So, while training, if I directly feed the data to network, it generates the error while reading the image, and because of this error, training stops. Does anyone know the list of the blank images in mjsynth dataset. So that I can remove those blank images from the dataset.

After trying many things, I ended up running a pretty long experiment to read almost 9 million images of the mjsynth dataset and collected images which are currupted or are blank. I found that theren are 12 currupted images which stops the model training when the mjsynth data is directly fed to the model for training without any varification. Here is the code and founded invalid images. So you can remove this images from the mjsynth dataset before starting the model training.
import os
import cv2
import numpy as np
rootdir = './mjsynth/mnt/ramdisk/max/90kDICT32px'
invalid_images = []
for subdir, dirs, files in os.walk(rootdir):
for file in files:
im_path = os.path.join(subdir, file)
im = cv2.imread(im_path)
if type(im) != np.ndarray:
invalid_images.append(im_path)
print('invalid_images = {}'.format(invalid_images ))
# output
invalid_images =
['./mjsynth/mnt/ramdisk/max/90kDICT32px\\1863/4/223_Diligently_21672.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\913/4/231_randoms_62372.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2025/2/364_SNORTERS_72304.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\495/6/81_MIDYEAR_48332.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\869/4/234_TRIASSIC_80582.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\173/2/358_BURROWING_10395.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2013/2/370_refract_63890.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\368/4/232_friar_30876.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\1881/4/225_Marbling_46673.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\1817/2/363_actuating_904.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\275/6/96_hackle_34465.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2069/4/192_whittier_86389.jpg']

Related

Google Cloud Vision not automatically splitting images for trainin/test

It's weird, for some reason GCP Vision won't allow me to train my model. I have met the minimum of 10 images per label, no images unlabeled and tried uploading a CSV pointing to 3 of this labels images as VALIDATION images.. Yet I get this error
Some of your labels (e.g. ‘Label1’) do not have enough images assigned to your Validation sets. Import another CSV file and assign those images to those sets.
any ideas would be appreciated
This error generally occurs when you did not labelled all the images because AutoML divides your images, including the mislabelled ones, into the categories and this error is triggered when the unlabelled images go to the VALIDATION set.
According to the documentation, it is recomended 1000 images per label. However, the minimum is 10 images for each label or 50 for complex cases. In addition,
The model works best when there are at most 100x more images for the most common label than for the least common label. We recommend removing very low frequency labels.
Furthermore, AutoML Vision uses the 80% of your content documents for training, 10% for validating, and 10% for testing. Since your images were not divided into these three categories, you should manually assign them to TRAIN, VALIDATION and TEST. You can do that by, uploading your images to a GCS bucket and referencing each labelled image in a .csv file, as follows:
TRAIN, gs://my_bucket/image1.jpeg,cat
As you can see above, it follows the format [SET],[GCS image path], [Label]. Note that you will be dividing your dataset manullay and it should respect the percentages already mentioned. Thus, you will have enough data in each category. You can follow the steps for preparing your training data here and here.
Note: please be aware that your .csv file is case sentive.
Lastly, in order to validate your dataset and inspect labelled/unlabelled images you can export the created dataset and check the exported .csv file. You can do it as described in the documentation. After exporting, download it and verify each SET( TRAIN, VALIDATION and TEST).

Choropleth in Folium - having issues

I am quite new to coding and need to create some type of map to display my data.
I have data in an excel sheet showing a list of countries and a number showing amount of certain crimes. I want to show this in a choropleth map.
I have seen lots of different ways to code this but can't seem to get any to correctly read in the data. Will I need to get import country codes into my df?
I have my world map from github and downloaded it in its raw format to my computer and into jupyter notebook.
My dataframe is also loaded, as the excel sheet, in jupyter notebook.
What are the first steps i need to take to load this into a map?
This is the code I have had the most success with:
import pandas as pd
import folium
df = pd.read_excel('UK - Nationality and Type.xlsx')
state_geo = 'countries.json'
m1 = folium.Map(location=[55, 4], zoom_start=3)
m1.choropleth(
geo_data=state_geo,
data=df,
columns=['Claimed Nationality', 'Labour Exploitation'],
key_on='feature.id',
fill_color='YlGn',
fill_opacity=0.5,
line_opacity=0.2,
legend_name='h',
highlight=True
)
m1.save("my_map.html")`
But i just get a big world map, all in the same shade of grey
this is what the countries.json looks like

Number of examples in each tfrecord

Running the sample.sh script in Google Cloud Shell to call the below preprocess on set of images following the steps of flowers example.
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/trainer/preprocess.py
Preprocess was successfully on both eval set and train set. But the generated .tfrecord.gz files does not seem matching the image numbers in eval/train_set.csv.
i.e. eval-00000-of-00157.tfrecord.gz says there are 158 tfrecord while there are 35227 rows in eval_set.csv. Each record include a valid image_url (all of them are uploaded to Storage), each record has valid label tagged.
Would like to know if there is a way to monitor and control the number of images per tfrecord in preproces.py config.
Thanks
Update, got this work out right:
import tensorflow as tf
import os
from tensorflow.python.lib.io import file_io
options = tf.python_io.TFRecordOptions(
compression_type=tf.python_io.TFRecordCompressionType.GZIP)
sum(1 for f in file_io.get_matching_files(os.path.join(url/path, '*.tfrecord.gz'))
for example in tf.python_io.tf_record_iterator(f, options=options))
The filename eval-00000-of-00157.tfrecord.gz means that this is the first file out of 158. There should be 157 similarly named files. Within each file, there can be any number of records.
If you want to manually count each record, try something like:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
files = os.path.join('gs://my_bucket/my_dir', 'eval-*.tfrecord.gz')
print(sum(1 for f in tf.python_io.file_io.get_matching_files(files)
for tf.python_io.tf_record_iterator(f)))
Note that there is no guarantee from Dataflow as to the relationship between the number of files and ordering of records (inter- and intra-file) between input files and output files. However, the counts should be the same.

What's caffe's input format?

I'm try to use caffe for audio recognition, but can't find a document for its input format.
I want to use leveldb, thus I must create a key and a value for each record, which is a pair of label string and data byte array.
It seems that no document describes this, and after I found the value is written by Datum.SerializeToString(), I can't find where Datum is and then lost.
Does anyone know how to convert non-image records into leveldb records for caffe? Thanks!
leveldb, lmdb and HDF5 are currently the main formats for feeding data into Caffe. The MemoryData layer enable in-memory input as well, so it's possible to use whatever input format and and use Caffe's python or c++ interfaces to populate the data blobs.
If you're already set on leveldb, this discussion on caffe issues could be useful.
Below is an example for populating a leveldb with python. It requires pycaffe and plyvel. It's adapted from caffe's github issues posted by Zackory. It's not specific to images as long as you represent each example in the form of a CxHxW where any or all can be equal to 1:
import caffe
db = plyvel.DB('train_leveldb/', create_if_missing=True, error_if_exists=True, write_buffer_size=268435456)
wb = db.write_batch()
count = 0
for file in dataset:
mat = # load numpy array from file
# Load matrix into datum object
datum = caffe.io.array_to_datum(mat)
wb.put('%08d_%s' % (count, file), datum.SerializeToString())
count += 1
# Write to db in regular intervals
if count % 1000 == 0:
# Write batch of images to database
wb.write()
del wb
wb = db.write_batch()
# Write last batch of images
if count % 1000 != 0:
wb.write()
I find constructing lmdb a lot simpler. lmdb example here.
The Datum object is defined with protobuf. See here:
https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L30-L41
It generates a file caffe.pb.h in .build_release/src/caffe/proto with the class Datum. You can have a look there to understand how this object works.

Monitor training/validation process in Caffe

I'm training Caffe Reference Model for classifying images.
My work requires me to monitor the training process by drawing graph of accuracy of the model after every 1000 iterations on entire training set and validation set which has 100K and 50K images respectively.
Right now, Im taking the naive approach, make snapshots after every 1000 iterations, run the C++ classififcation code which reads raw JPEG image and forward to the net and output the predicted labels. However, this takes too much time on my machine (with a Geforce GTX 560 Ti)
Is there any faster way that I can do to have the graph of accuracy of the snapshot models on both training and validation sets?
I was thinking about using LMDB format instead of raw images. However, I cannot find documentation/code about doing classification in C++ using LMDB format.
1) You can use the NVIDIA-DIGITS app to monitor your networks. They provide a GUI including dataset preparation, model selection, and learning curve visualization. More, they use a caffe distribution allowing multi-GPU training.
2) Or, you can simply use the log-parser inside caffe.
/pathtocaffe/build/tools/caffe train --solver=solver.prototxt 2>&1 | tee lenet_train.log
This allows you to save train log into "lenet_train.log". Then by using:
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log .
you parse your train log into two csv files, containing train and test loss. You can then plot them using the following python script
import pandas as pd
from matplotlib import *
from matplotlib.pyplot import *
train_log = pd.read_csv("./lenet_train.log.train")
test_log = pd.read_csv("./lenet_train.log.test")
_, ax1 = subplots(figsize=(15, 10))
ax2 = ax1.twinx()
ax1.plot(train_log["NumIters"], train_log["loss"], alpha=0.4)
ax1.plot(test_log["NumIters"], test_log["loss"], 'g')
ax2.plot(test_log["NumIters"], test_log["acc"], 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
savefig("./train_test_image.png") #save image as png
Caffe creates logs each time you try to train something, and its located in the tmp folder (both linux and windows).
I also wrote a plotting script in python which you can easily use to visualize your loss/accuracy.
Just place your training logs with .log extension next to the script and double click on it.
You can use command prompts as well, but for ease of use, when executed it loads all logs (*.log) it can find in the current directory.
it also shows the top 4 accuracies and at-which accuracy they were achieved.
you can find it here : https://gist.github.com/Coderx7/03f46cb24dcf4127d6fa66d08126fa3b
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log
command produces the following error:
usage: parse_log.py [-h] [--verbose] [--delimiter DELIMITER]
logfile_path output_dir
parse_log.py: error: too few arguments
Solution:
For successful execution of "parse_log.py" command, we should pass the two arguments:
log file
path of output directory
So the correct command is as follows:
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log output_dir