Number of examples in each tfrecord

Number of examples in each tfrecord - google-cloud-ml

Running the sample.sh script in Google Cloud Shell to call the below preprocess on set of images following the steps of flowers example.
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/trainer/preprocess.py
Preprocess was successfully on both eval set and train set. But the generated .tfrecord.gz files does not seem matching the image numbers in eval/train_set.csv.
i.e. eval-00000-of-00157.tfrecord.gz says there are 158 tfrecord while there are 35227 rows in eval_set.csv. Each record include a valid image_url (all of them are uploaded to Storage), each record has valid label tagged.
Would like to know if there is a way to monitor and control the number of images per tfrecord in preproces.py config.
Thanks
Update, got this work out right:
import tensorflow as tf
import os
from tensorflow.python.lib.io import file_io
options = tf.python_io.TFRecordOptions(
compression_type=tf.python_io.TFRecordCompressionType.GZIP)
sum(1 for f in file_io.get_matching_files(os.path.join(url/path, '*.tfrecord.gz'))
for example in tf.python_io.tf_record_iterator(f, options=options))

The filename eval-00000-of-00157.tfrecord.gz means that this is the first file out of 158. There should be 157 similarly named files. Within each file, there can be any number of records.
If you want to manually count each record, try something like:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
files = os.path.join('gs://my_bucket/my_dir', 'eval-*.tfrecord.gz')
print(sum(1 for f in tf.python_io.file_io.get_matching_files(files)
for tf.python_io.tf_record_iterator(f)))
Note that there is no guarantee from Dataflow as to the relationship between the number of files and ordering of records (inter- and intra-file) between input files and output files. However, the counts should be the same.

Related

GCP > Video Intelligence: Prepare CSV error: Has critical error in root level csv, Expected 2 columns, but found 1 columns only

I'm trying to follow documentation from below GCP link to prepare my video training data. In the doc, it says that if you want to use GCP to label videos, you can use UNASSIGNED feature.
I have my videos uploaded to a bucket.
I have a traffic_video_labels.csv with below rows:
gs://video_intel/1.mp4
gs://video_intel/2.mp4
Now, in my Video Intelligence Import section, I want to use a CSV called check.csv that has below row as it references back to the video locations. Using UNNASIGNED value should let me use the labelling feature within GCP.
UNASSIGNED,gs://video_intel/traffic_video_labels.csv
However, when I try to check.csv as a file, I get the error:
Has critical error in root level csv gs://video_intel/check.csv line 1: Expected 2 columns, but found
1 columns only.
Can anyone pls help with this? thanks!
https://cloud.google.com/video-intelligence/automl/object-tracking/docs/prepare

For the error message "Expected 2 columns, but found
1 columns only." try to fix the format of your CSV file, open the file in a text editor of your choice (such as Cloud Shell, Sublime, Atom, etc.) to inspect the file format.
When opening a CSV file in Google Sheets or a similar product, you won't be able to format the file properly (i.e. empty values from tailing commas) due to limitation on the user interface, but in text editors, you should not run into those issues.
If this does not work, please share your CSV file to make a test with your file by my own.

problem in reading the images of mjsynth dataset

recently I am trying to train a text recognition network. I tried to start the training by feeding the mjsynth dataset to network. However, there seems to be some images in the dataset which are blank. So, while training, if I directly feed the data to network, it generates the error while reading the image, and because of this error, training stops. Does anyone know the list of the blank images in mjsynth dataset. So that I can remove those blank images from the dataset.

After trying many things, I ended up running a pretty long experiment to read almost 9 million images of the mjsynth dataset and collected images which are currupted or are blank. I found that theren are 12 currupted images which stops the model training when the mjsynth data is directly fed to the model for training without any varification. Here is the code and founded invalid images. So you can remove this images from the mjsynth dataset before starting the model training.
import os
import cv2
import numpy as np
rootdir = './mjsynth/mnt/ramdisk/max/90kDICT32px'
invalid_images = []
for subdir, dirs, files in os.walk(rootdir):
for file in files:
im_path = os.path.join(subdir, file)
im = cv2.imread(im_path)
if type(im) != np.ndarray:
invalid_images.append(im_path)
print('invalid_images = {}'.format(invalid_images ))
# output
invalid_images =
['./mjsynth/mnt/ramdisk/max/90kDICT32px\\1863/4/223_Diligently_21672.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\913/4/231_randoms_62372.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2025/2/364_SNORTERS_72304.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\495/6/81_MIDYEAR_48332.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\869/4/234_TRIASSIC_80582.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\173/2/358_BURROWING_10395.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2013/2/370_refract_63890.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\368/4/232_friar_30876.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\1881/4/225_Marbling_46673.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\1817/2/363_actuating_904.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\275/6/96_hackle_34465.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2069/4/192_whittier_86389.jpg']

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a (string or BINARY) column.
And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ#mail.gmail.com%3E
which uses scala instead of pyspark as an example:
Configuration conf = new Configuration();
+ conf.set("parquet.strings.signed-min-max.enabled", "true");
Path inputPath = new Path(input);
FileStatus inputFileStatus =
inputPath.getFileSystem(conf).getFileStatus(inputPath);
List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);
I've been unable to set this value in pyspark (perhaps I'm setting it in the wrong place?)
example dataframe
import random
import string
from pyspark.sql.types import StringType
r = []
for x in range(2000):
r.append(u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)))
df = spark.createDataFrame(r, StringType())
I've tried a few different ways of setting this option:
df.write.format("parquet").option("parquet.strings.signed-min-max.enabled", "true").save("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", "true").parquet("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", True).parquet("s3a://test.bucket/option")
But all of the saved parquet files are missing the ST/STATS for the BINARY column. Here is an example output of the metadata from one of the parquet files:
creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"value","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
value: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:33 TS:515
---------------------------------------------------------------------------------------------------
Also, based on this email chain https://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C9DEF4C39-DFC2-411B-8987-5B9C33842974#videoamp.com%3E and question: Specify Parquet properties pyspark
I tried sneaking the config in through the pyspark private API:
spark.sparkContext._jsc.hadoopConfiguration().setBoolean("parquet.strings.signed-min-max.enabled", True)
So I am still unable to set this conf parquet.strings.signed-min-max.enabled in parquet-mr (or it is set, but something else has gone wrong)
Is it possible to configure parquet-mr from pyspark
Does pyspark 2.3.x support BINARY column stats?
How do I take advantage of the PARQUET-686 feature to add min/max metadata for string columns in a parquet file?

Since historically Parquet writers wrote wrong min/max values for UTF-8 strings, new Parquet implementations skip those stats during reading, unless parquet.strings.signed-min-max.enabled is set. So this setting is a read option that tells the Parquet library to trust the min/max values in spite of their known deficiency. The only case when this setting can be safely enabled is if the strings only contain ASCII characters, because the corresponding bytes for those will never be negative.
Since you use parquet-tools for dumping the statistics and parquet-tools itself uses the Parquet library, it will ignore string min/max statistics by default. Although it seems that there are no min/max values in the file, in reality they are there, but get ignored.
The proper solution for this problem is PARQUET-1025, which introduces new statistics fields min-value and max-value. These handle UTF-8 strings correctly.

Scikit Learn - Working with datasets

Reading through some stackoverflow questions and I could not find what I was looking for, at least, I didn't think it was when I read various posts.
I have some Training data set up like described here
So, I am using sklearn.datasets.load_files to read those it as it was a perfect match on set up.
BUT my files are tsv as bag of words already (aka each line is a word and it's frequency count separated by a tab).
To be honest, I am not sure how to proceed. The data pulled in by load_files is set up as a list where each element is the contents of each file, including the new line characters. I am not even 100% sure how the Bunch data type is tracking which files belong to which classifier folder.
I have worked with scikit-learn before with tsvs, but it was a single tsv file that had all the data so i used pandas to read it in and then used numpy.array to fetch what I needed from it, which is one of the things I attempted to do, but I am not sure how to do it with multiple files where the classifier is the folder name, as in that single tsv file i worked with before, each line of training data was individually
Some help on getting the data to a format that is useable for training classifiers would be appreciated.

You could loop over the files and read them, to create a list of dictionaries where each dictionary will contain the features and the frequencies of each document. Assume the file 1.txt:
import codecs
corpus = []
#make a loop over the files here and repeat the following
f = codecs.open("1.txt", encoding='utf8').read().splitlines()
corpus.append({line.split("\t")[0]:line.split("\t")[1] for line in f})
#exit the loop here
from sklearn.feature_extraction import DictVectorizer
vec=DictVectorizer()
X=vec.fit_transform(measurements)
You can find more here for DictVectorizer

Monitor training/validation process in Caffe

I'm training Caffe Reference Model for classifying images.
My work requires me to monitor the training process by drawing graph of accuracy of the model after every 1000 iterations on entire training set and validation set which has 100K and 50K images respectively.
Right now, Im taking the naive approach, make snapshots after every 1000 iterations, run the C++ classififcation code which reads raw JPEG image and forward to the net and output the predicted labels. However, this takes too much time on my machine (with a Geforce GTX 560 Ti)
Is there any faster way that I can do to have the graph of accuracy of the snapshot models on both training and validation sets?
I was thinking about using LMDB format instead of raw images. However, I cannot find documentation/code about doing classification in C++ using LMDB format.

1) You can use the NVIDIA-DIGITS app to monitor your networks. They provide a GUI including dataset preparation, model selection, and learning curve visualization. More, they use a caffe distribution allowing multi-GPU training.
2) Or, you can simply use the log-parser inside caffe.
/pathtocaffe/build/tools/caffe train --solver=solver.prototxt 2>&1 | tee lenet_train.log
This allows you to save train log into "lenet_train.log". Then by using:
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log .
you parse your train log into two csv files, containing train and test loss. You can then plot them using the following python script
import pandas as pd
from matplotlib import *
from matplotlib.pyplot import *
train_log = pd.read_csv("./lenet_train.log.train")
test_log = pd.read_csv("./lenet_train.log.test")
_, ax1 = subplots(figsize=(15, 10))
ax2 = ax1.twinx()
ax1.plot(train_log["NumIters"], train_log["loss"], alpha=0.4)
ax1.plot(test_log["NumIters"], test_log["loss"], 'g')
ax2.plot(test_log["NumIters"], test_log["acc"], 'r')
ax1.set_xlabel('iteration')
ax1.set_ylabel('train loss')
ax2.set_ylabel('test accuracy')
savefig("./train_test_image.png") #save image as png

Caffe creates logs each time you try to train something, and its located in the tmp folder (both linux and windows).
I also wrote a plotting script in python which you can easily use to visualize your loss/accuracy.
Just place your training logs with .log extension next to the script and double click on it.
You can use command prompts as well, but for ease of use, when executed it loads all logs (*.log) it can find in the current directory.
it also shows the top 4 accuracies and at-which accuracy they were achieved.
you can find it here : https://gist.github.com/Coderx7/03f46cb24dcf4127d6fa66d08126fa3b

python /pathtocaffe/tools/extra/parse_log.py lenet_train.log
command produces the following error:
usage: parse_log.py [-h] [--verbose] [--delimiter DELIMITER]
logfile_path output_dir
parse_log.py: error: too few arguments
Solution:
For successful execution of "parse_log.py" command, we should pass the two arguments:
log file
path of output directory
So the correct command is as follows:
python /pathtocaffe/tools/extra/parse_log.py lenet_train.log output_dir

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Number of examples in each tfrecord - google-cloud-ml

Related

GCP > Video Intelligence: Prepare CSV error: Has critical error in root level csv, Expected 2 columns, but found 1 columns only

problem in reading the images of mjsynth dataset

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

Scikit Learn - Working with datasets

Monitor training/validation process in Caffe

Categories

Resources