Pickled scipy sparse matrix as input data? - google-cloud-ml

I am working on a multiclass classification problem consisting in classifying resumes.
I used sklearn and its TfIdfVectorizer to get a big scipy sparse matrix that I feed in a Tensorflow model after pickling it. On my local machine, I load it, convert a small batch to dense numpy arrays and fill a feed dictionnary. Everything works great.
Now I would like to do the same thing on ML cloud. My pickle is stored at gs://my-bucket/path/to/pickle but when I run my trainer, the pickle file can't be found at this URI (IOError: [Errno 2] No such file or directory). I am using pickle.load(open('gs://my-bucket/path/to/pickle), 'rb') to extract my data. I suspect that this is not the good way to open a file on GCS but I'm totally new to Google Cloud and I can't find the proper way to do so.
Also, I read that one must use TFRecords or a CSV format for input data but I don't understand why my method could not work. CSV is excluded since the dense representation of the matrix would be too big to fit in memory. Can TFRecords encode efficiently sparse data like that? And is it possible to read data from a pickle file?

You are correct that Python's "open" won't work with GCS out of the box. Given that you're using TensorFlow, you can use the file_io library instead, which will work both with local files as well as files on GCS.
from tensorflow.python.lib.io import file_io
pickle.loads(file_io.read_file_to_string('gs://my-bucket/path/to/pickle'))
NB: pickle.load(file_io.FileIO('gs://..', 'r')) does not appear to work.
You are welcome to use whatever data format works for you and are not limited to CSV or TFRecord (do you mind pointing to the place in the documentation that makes that claim?). If the data fits in memory, then your approach is sensible.
If the data doesn't fit in memory, you will likely want to use TensorFlow's reader framework, the most convenient of which tend to be CSV or TFRecords. TFRecord is simply a container of byte strings. Most commonly, it contains serialized tf.Example data which does support sparse data (it is essentially a map). See tf.parse_example for more information on parsing tf.Example data.

Related

Word2Vec model output types

When Word2Vec model is trained, there are three outputs created.
model
model.wv.syn0
model.syn1neg
I have a couple of questions regarding these models.
How are these outputs essentially different from each other?
Which model to look at if I want to access trained results?
Thanks in advance !
Those are 3 files created by the gensim Word2Vec .save() function. The model file is a Python pickle of the main model; the other files are some of the over-large numpy arrays stored separately for efficiency. The syn0 happens to contain the raw word vectors, and the syn1neg the model's internal weights – but neither are cleanly interpretable without the other data.
So, the only support for re-loading them is to use the matching .load() function, with all three available. A successful re-load() will result in a model object just like the one you save()d, and you'd access the results via that loaded object.
(If you only need the raw word-vectors, you can also use the .save_word2vec_format() method, which writes in a format compatible with the original Google-releases word2vec.c code. But that format has strictly less information that gensim's native save, so you'd only use it if you absolutely need to for compatibility with other software. Working with the gensim native files ensures you could always save the other format later, while you can't go the other way.)

Preprocessing data in EMR

I want to crunch 10 PB data. The input data is in some proprietary format (stored in S3) and first preprocessing step is to convert this proprietary data to CSV and move it back to S3. Due to some constraints, I can't couple the preprocessing step with Map task. What would be the correct way to do that?
I'm planning to use AWS EMR for the same. One way would be to run a separate EMR job with no reduce task and upload data to S3 in the Map phase. Is there any better way to do that as running a map-reduce job without reduce task for preprocessing data looks like a hacky solution.
It would seem you have at least two options:
Convert the data into a format you find easier to work with. You might want to look at formats such as Parquet or Avro. Using a map-only task for this is an appropriate method, you would only use a reducer in this case if you wanted to control the number of files produced, ie combine lots of small files into a larger one.
Create a custom InputFormat and just read the data directly. There are lots of resources on the net about how to do this. Depending on what this proprietary formats looks like you might need to do this anyway to achieve #1.
A few things for you to think about are:
Is the proprietary format space efficient compared with other formats?
How easy is the format to work with, would making it into a CSV make your processing jobs simpler?
Is the original data ever updated or added to, would you continually need to convert it to another format or update already converted data?

Implementing Data Frames in OCaml

I have been learning OCaml on my own and I've been really impressed with the language. I wanted to develop a small machine learning library for practice but I've been presented with a problem.
In Python one can use Pandas to load data files then pass it to a library like Scikit-Learn very easily. I would like to emulate the same process in OCaml. However, there doesn't seem to be any data frames library in OCaml. I've checked 'ocaml-csv' but it doesn't really seem to be doing what I want. I also looked into 'Frames' from Haskell but it uses TemplateHaskell but I believe a simpler way to do the same thing should be possible if Pandas can simply load the data file into memory without compile-time metaprogramming.
Does anyone know how data frames are implemented in Pandas or R, a quick search on Google doesn't seem to return useful links.
Is it possible to use a parser generator such as Menhir to parse CSV files? Also, I'm unsure how static typing works with data frames.
Would you have a reference about the format of data frames? It may not be so hard to add to ocaml-csv if CSV is the underlying representation. The better is to open an issue with a request and the needed information.

Store a file metadata in an extra file

I have a bunch of image files (mostly .jpg). I would like to store metadata about these files (e.g. dominant color, color distribution, maximum gradient flow field, interest points, ...). These data fields are not fixed and are not available in all images.
Right now I am storing the metadata for each file as a separate file with the same name but a different extension. The format is just text:
metadataFieldName1 metadataFieldValue1
metadataFieldName2 metadataFieldValue2
This gets me wondering, is there a better/easier way to store these metadata? I thought of ProtocolBuffer since I need to be able to read and write these information in both C++ and Python. But, how do I support the case where some metadata are not available?
I would suggest that you store such metadata within the image files themselves.
Most image formats support storing metadata. I think that .jpeg support it through Exif.
If you're on Windows you can use the WIC to store and retrieve metadata in a unified manner.
Why protocol buffers and not XML or INI files or whatever text-ish format? Just choose some format...
And what do you mean with "metadata not available"? It is up to your application to respond to such error situations...what has this to do with the format of the storage?
Look at http://www.yaml.org. YAML is less verbose than XML and more human friendly to read.
There are YAML libraries for both C++, Python and many other languages.
Example:
import yaml
data = { "field1" : "value1",
"field2" : "value2" }
serializedData = yaml.dump(data, default_flow_style=False)
open("datafile", "w").write(serializedData)
I thought long on this matter and went with ProtocolBuffer to store metadata for my images. For each image e.g. Image00012.jpg, I store the metadata in Image00012.jpg.pbmd. Once I have my .proto file setup, the Python class and C++ class got auto-generated. It works very well and require me to spend little time on parsing (clearly better than writing custom reader for YAML files).
RestRisiko brings up a good point about how I should handle metadata not available. The good thing about ProtocolBuffer is it supports optional/required fields. This solves my problem on this front.
The reason I think XML and INI are not good for this purpose is because many of my metadata are complex (color distribution, ...) and require a bit of storage customization. ProtocolBuffer allows me to nest proto declaration. Plus, the size of the metadata file and the parsing speed is clearly superior to my hand-roll XML reading/writing.

DICOM File compression

My line of work requires the use of DICOM files. Each DICOM file constitutes many .dcm files in a single directory. I am required to send these files over the network, a process which is somewhat so due to the massive size of the files.
I am also a programmer and I was wondering what is the ideal way to compress such files? I'm talking about a compression that will be made on the local computer and later decompressed on the destination computer (namely the compression is solely for speeding up the over-the-network transfer of the file). Is there a simple way to crop the DICOM files? (the files contain imaging of an entire head, whereas I'm only interested in a small part of the head).
Thanks!
In medical context, lossy compression is somewhere between not encouraged and forbidden. If you'd insist on cropping existing datasets the standard demands you to form at least new image & series UIDs. The standard does allow losless compression in the form of jpeg2000, but it is quite rare - if I had to bet I'd say your dataset is uncompressed altogether.
In my experience it is significantly better to compress a medical dataset as a solid archive - that is, unify all the images into a single stream. This makes a lot of sense, as there is typically a lot of similarity between nearby images and this is the way to take advantage of that similarity (a unified compression dictionary). This is available as a command line option both to rar and gzip compressors.
Solution:
gdcmconv --jpeg uncompressed.dcm compressed.dcm
or for better compression ratio:
gdcmconv --jpegls uncompressed.dcm compressed.dcm
See:
http://gdcm.sourceforge.net/html/gdcmconv.html
I would also recommend against lossy compression, you would need to be a DICOM wizard to do it properly (see derivation mechanism in the DICOM standard). I would also recommend against cropping the image (you would need to regenerate UIDs, get the Frame or Reference updated...)
HTH
You could use something simple like lzma compression on one end to pack up the files and send them over. This is the easiest solution, since you can grab something like gzip and pack/unpack the files easily programmaticly. This may help considerably, because modern computers prefer transmitting/receiving one large file over many small files (a single 1GB file will transfer much faster than 10000 100KB files).
As for actually reducing the aggregate size, each .dcm file is probably a slice (if you're looking at something like MRI or CT data), and the viewer you are using reconstructs the slices into the 3d image. Cropping them isn't impossible, but parsing the DICOM format is a bit tricky. I'm not aware of any free programs that will help you parse the DICOM files, but I haven't looked for some time.
Since DICOM is a container format, the image data you are after is usually stored in a common format (such as JPEG), so if you are able to grab the relevant part of the file to extract the image data, you can use any of the loads of image processing tools available to crop the image to whatever dimensions you choose.
We have a compression router called "DICOM Shrinkinator" that can do this as it transmits the study to PACS:
http://fluxinc.ca/medical/dicom-shrinkinator/