AWS ec2 to run a python program using latex and OpenCV - amazon-web-services

A friend and I are working on a machine learning project together. We've managed to collect about 5,000 tex documents (we hope to get up to around 100,000 soon). We have a python script that we run on each document to do some text manipulation, extract particular parts of the tex code, compile the parts, convert the compiled parts to cropped PNG images, and search a converted PNG of the full tex for the cropped images using OpenCV. The code takes between 30 seconds and 2 minutes on the documents we've tried so far, so we really need to speed it up.
I've been tasked with gaining access to a computer cluster and figuring out how to implement our code on such a cluster. Someone suggested I look into using AWS, so I've made an account and have been trying to figure out how to use EC2 for the past few hours. Am I on the right track, or is there some other part of AWS or something else entirely that would be better suited to my task?
Whatever I use, it has to have access to the various python libraries in our code and to pdflatex and the full set of tex packages. Is this possible on EC2? I have almost no idea how to go about using EC2 (I've managed to start some instances, but how do I use them to run my script? and do I need to change my python script to accomodate the parallel processing, or does EC2 take care of that somehow? is it as easy as starting a linux instance and installing the programs I need like I would on any other linux machine?). None of the tutorials are immediately useful, and I'm still not even sure if EC2 is capable of doing what I'm looking for. Any advice is appreciated.

I wouldn't normally answer this kind of question but it sounds like you are doing something interesting. So let's have a go
Q1.
"We have a python script that we run on each document to do some text
manipulation, extract particular parts of the tex code, compile the
parts, convert the compiled parts to cropped PNG images, and search a
converted PNG of the full tex for the cropped images using OpenCV.. we
really need to speed it up"
Probably you could split the 100,000 documents into 10 parts and set up
10 instances of the processing software and do the run in parallel.
To set up 10 instances the same, there are many methods but one of the simpler ways is to set up one machine as desired, take a snapshot, make an AMI and then
use the AMI to launch many more copies.
There might be an extra step with putting the results of the search into some
kind of central database.
I don't know anything about OpenCV but there are several suggestions that with a G3 instance type (this has a GPU) it might go faster. Google for "Open CV on AWS"
Q2.
"trying to figure out how to use EC2 for the past few hours. Am I on
the right track, or is there some other part of AWS or something else
entirely that would be better suited to my task?"
EC2 is a general purpose virtual machine, so if you already have code that runs on
some other machine it is easy to move it to EC2
EC2 has many features but one you might find interesting is "spot instances", these are short lived but cheap ( typically 10% of the price ) instance launch
Q3.
Whatever I use, it has to have access to the various python libraries
in our code and to pdflatex and the full set of tex packages. Is this
possible on EC2?
Yes, they will pip install or install from packages just like any other system
Q4.
how do I use them to run my script? and do I need to change my python
script to accomodate the parallel processing, or does EC2 take care of
that somehow? is it as easy as starting a linux instance and
installing the programs I need like I would on any other linux
machine?
As described above your basic task seems to scale well, you may need a step to
collate the results. Yes it is basically the same as any other linux machine

Related

How do you clear the persistent storage for a notebook instance on AWS SageMaker?

So I'm running into the following error on AWS SageMaker when trying to save:
Unexpected error while saving file: untitled.ipynb [Errno 28] No space left on device
If I remove my notebook, create a new identical one and run it, everything works fine. However, I'm suspecting the Jupyter checkpoint takes up too much space if I save the notebook while it's running and therefore I'm running out of space. Sadly, getting more storage is not an option for me, so I'm wondering if there's any command I can use to clear the storage before running my notebook?
More specifically, clearing the persistent storage in the beginning and at the end of the training process.
I have googled like a maniac but there is no suggestion aside from "just increase the amount of storage bro" and that's why I'm asking the question here.
Thanks in advance!
If you don't want your data to be persistent across multiple notebook runs, just store them in /tmp which is not persistent. You have at least 10GB. More details here.
I had the exact same problem, and was not unable to find a decent answer to it online. However, I was fortunately able to resolve the issue.
I use an R kernel, so the solution might be slightly different.
You can check the storage going in the terminal and typing db -kh
You are likely mounted on the /home/ec2-user/SageMaker and can see its "Size" "Used" "Avail" and "Use%".
There are hidden folders that function as a recycle bin. When I use R command list.dirs() it reveals a folder named ./.Trash-1000/ which kept a lot of random things that had been supposedly removed from the storage.
I just deleted the folder unlink('./.Trash-1000/', recursive = T) and it the entire storage was freed.
Hope it helps.

How to convert food-101 dataset into usable format for AWS SageMaker

I'm still very new to the world of machine learning and am looking for some guidance for how to continue a project that I've been working on. Right now I'm trying to feed in the Food-101 dataset into the Image Classification algorithm in SageMaker and later deploy this trained model onto an AWS deeplens to have food detection capabilities. Unfortunately the dataset comes with only the raw image files organized in sub folders as well as a .h5 file (not sure if I can just directly feed this file type into sageMaker?). From what I've gathered neither of these are suitable ways to feed in this dataset into SageMaker and I was wondering if anyone could help point me in the right direction of how I might be able to prepare the dataset properly for SageMaker i.e convert to a .rec or something else. Apologies if the scope of this question is very broad I am still a beginner to all of this and I'm simply stuck and do not know how to proceed so any help you guys might be able to provide would be fantastic. Thanks!
if you want to use the built-in algo for image classification, you can either use Image format or RecordIO format, re: https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html#IC-inputoutput
Image format is straightforward: just build a manifest file with the list of images. This could be an easy solution for you, since you already have images organized in folders.
RecordIO requires that you build files with the 'im2rec' tool, re: https://mxnet.incubator.apache.org/versions/master/faq/recordio.html.
Once your data set is ready, you should be able to adapt the sample notebooks available at https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms

How to parse freebase quad dump using Amazon mapreduce

Im trying to extract movie informations from freebase, i just need name of the movie, name and id of the director and of the actors.
I found it hard to do so using freebases topic dumps, because there is no reference to the director ID, just directors name.
What is the right approach for this task? Do i need to parse somehow whole quad dump using amazons cloud? Or is there some esy way?
You do need to use the quad dump, but it is under 4 GB and shouldn't require Hadoop, MapReduce, or any cloud processing to do. A decent laptop should be fine. On a couple year old laptop, this simple-minded command:
time bzgrep '/film/' freebase-datadump-quadruples.tsv.bz2 | wc -l
10394545
real 18m56.968s
user 19m30.101s
sys 0m56.804s
extracts and counts everything referencing the film domain in under 20 minutes. Even if you have to make multiple passes through the file (which is likely), you'll be able to complete your whole task in less than an hour, which should mean there's no need for beefy computing resources.
You'll need to traverse an intermediary node (CVT in Freebase-speak) to get the actors, but rest of your information should be connected directly to the subject film node.
Tom
First of all, I completely share Tom's point of view and his suggestion. I often use UNIX command line tools to take 'interesting' slices of data out of Freebase data dump.
However, an alternative would be to load Freebase data into a 'graph' storage system locally and use APIs and/or the query language available from that system to interact with the data for further processing.
I use RDF, since the data model is quite similar and it is very easy to convert the Freebase data dump into RDF (see: https://github.com/castagna/freebase2rdf). I then load it into Apache Jena's TDB store (http://incubator.apache.org/jena/documentation/tdb/) and use the Jena APIs or SPARQL for further processing.
Another reasonable and scalable approach would be to implement what you need to do in MapReduce, but this makes sense only if the amount of processing you do is touching a large fraction of Freebase data and it is not as trivial as counting lines. This is more expensive than using your own machine, you need an Hadoop cluster or you need to use Amazon EMR. (I should probably write a MapReduce version of freebase2rdf ;-))
My 2 cents.

How to measure the amount of data transmitted by my MPI program?

I'm experimenting my distributed clustering algorithm (implemented with MPI) on 24 computers that I set up as a cluster using BCCD (Bootable Cluster CD) that can be downloaded at http://bccd.net/.
I've written a batch program to run my experiment that consists in running my algorithm several times varying the number of nodes and the size of the input data.
I want to know the amount of data used in the MPI communications for each run of my algorithm so I can see how the amount of data changes when varying the previous mentioned parameters. And I want to do all this automatically using a batch program.
Someone told me to use tcpdump, but I found some difficulties in this approach.
First, I don't know how to call tcpdump in my batch program (which is written in C++ using the command system for making calls) before each run of my algorithm, since tcpdump requires another terminal to run in parallel with my application. And I can't run tcpdump in another computer since the network uses a switch. So I need to run it on the master node.
Second, I saw the traffic with tcpdump while my experiment was going on and I couldn't figure out what was the port used by MPI. It seems to use many ports. I wanted to know that for filtering the packages.
Third, I tried capturing whole packages and saving it to a file using tcpdump and in a few seconds the file was 3,5MB. But my whole experiment takes 2 days. So the final log file will be huge if I follow this approach.
The ideal approach would be to capture just the size field in the header of the packages and sum this up to obtain the total amount of data transmitted. In that way the logfile would be much smaller than if I were capturing the whole package. But I don't know how to do it.
Another restriction is that I don't have access to the computer disc. So I only have the RAM and my 4GB USB Flash drive. So I can't have huge logfiles.
I have already thought about using some MPI tracing or profiling tool such as those mentioned at http://www.open-mpi.org/faq/?category=perftools. I have only tested Sun Performance Analyzer until now. The problem is that I guess it will be difficult to install those tools on BCCD and maybe even impossible. In addtion to that, this tool will make my experiment take longer to end, sice it adds overhead. But if someone is familiar with BCCD and think it is a good choice to use one of those tools, so please let me know.
Hope someone have a solution.
Implementations like tcpdump won't work if there are multi-core nodes which use shard memory to communicate, anyway.
Using something like MPE is almost certainly the way to go. Those tools add very little overhead, and some overhead is always going to be necessary if you want to count messages. You can use mpitrace to write out every MPI call, and parse the resulting text file yourself. By the way, note that MPE is explicitly discussed on the bccd website. MPICH2 comes with MPE built in, but it can be compiled for any implementation. I've only found a very modest overhead for MPE.
IPM is another nice tool that does counting of messages and sizes; you should be able either parse the XML output, or use the postprocessing tools and just manually integrate the graphs (say either bytes_rx/bytes_tx by rank, or the message buffer size/count graph). The overhead for IPM is even less than for MPE, and mostly comes after the program's finished running to do the file I/O.
If you were really super worried about the overhead with either of these approaches, you could always write your own MPI wrappers using the profiling interface that wrapped MPI_Send, MPI_Recv, etc, and just counted # of bytes sent and recieved for each process, and output only that total at the end.

Comparing DLL sets from the same app on two different machines

Is there a good way to compare the DLLs loaded between two machines running the same app. (And to replicate the process between N other machines, two at a time?)
Background: I am trying to track down a configuration/setup issue. It's the age-old, DLL-hell-type problem where an app will run on one machine but not on another.
I have eliminated our installer as an issue; it's stable but there are differences between the target systems. Different Windows flavors, MDAC versions etc.
I have tried: exporting EXE snapshots with Proc Explorer to a delimited file and using Excel to do the comparison. But this is very time-consuming and error prone. (I'm not ruling out Excel as a possibility, i just don't know enough tricks to use it to my ends.)
I'd recommend you take a look at EasyHook, using it, you can create a detour on all calls to LoadLibraryA and LoadLibraryW. This way you can monitor all files that gets loaded, and get the path to them. After that, you can compare the files in whatever way you'd like. If you need help using EasyHook, let me know, and I'll cook up an example.