I am trying to follow the example here and create my own dataset for training using MXnet. My data is organized as specified in the example:
/data
yes/
file1.png
file2.png
...
no/
file1.png
file2.png
...
The tutorial says the first step is to run im2rec.py to create a .lst file, then run im2rec.py again (different options) to create the .rec file. To create the .lst file I type:
> python tools/im2rec.py my_data /data --list True --recursive True --train-ratio .75 --exts .png
After doing this, two files are created (as expected), my_data_train.lst and my_data_val.lst. The total number of lines in the two files is the same as the number of files in my yes/ and no/ directory combined. Then, I attempt to run im2rec a second time to create the .rec file using:
> python tools/im2rec.py my_data /data --resize 227 --num-thread 16
This runs for a few seconds and then (silently) crashes. In the process it creates 4 empty files: my_data_train.idx, my_data_train.rec, my_data_val.idx, and my_data_val.rec.
Question: What do I need to do differently to be able to create a proper .rec file containing my own .png images?
Extra Details:
I am working inside a docker container (mxnet/python:gpu) provided by dmlc on docker hub; they also provided the example on their github page. The data is available through a shared directory in the container. So it is presumably possible that this is a docker issue. What makes me slightly worried that it is a docker issue is that I had to pip install opencv-python in order for im2rec to be able to import cv2... I would have hoped that the people providing the container would have taken care of this.
You are right that the image is missing opencv for python. Instead of installing via pip, please do apt-get install python-opencv.
PR posted here: Using im2rec in MXnet to create dataset with png images
Related
What is the idiomatic way to write a docker file for building against many different versions of the same compiler?
I have a project which tests against a wide-range of versions of different compilers like gcc and clang as part of a CI job. At some point, the agents for the CI tasks were updated/changed, resulting in newer jobs failing -- and so I've started looking into dockerizing these builds to try to guarantee better reliability and stability.
However, I'm having some difficulty understanding what a proper and idiomatic approach is to producing build images like this without causing a large amount of duplication caused by layers.
For example, let's say I want to build using the following toolset:
gcc 4.8, 4.9, 5.1, ... (various versions)
cmake (latest)
ninja-build
I could write something like:
# syntax=docker/dockerfile:1.3-labs
# Parameterizing here possible, but would cause bloat from duplicated
# layers defined after this
FROM gcc:4.8
ENV DEBIAN_FRONTEND noninteractive
# Set the work directory
WORKDIR /home/dev
COPY . /home/dev/
# Install tools (cmake, ninja, etc)
# this will cause bloat if the FROM layer changes
RUN <<EOF
apt update
apt install -y cmake ninja-build
rm -rf /var/lib/apt/lists/*
EOF
# Default command is to use CMak
CMD ["cmake"]
However, the installation of tools like ninja-build and cmake occur after the base image, which changes per compiler version. Since these layers are built off of a different parent layer, this would (as far as I'm aware) result in layer duplication for each different compiler version that is used.
One alternative to avoid this duplication could hypothetically be using a smaller base image like alpine with separate installations of the compiler instead. The tools could be installed first so the layers remain shared, and only the compiler changes as the last layer -- however this presents its own difficulties, since it's often the case that certain compiler versions may require custom steps, such as installing certain keyrings.
What is the idiomatic way of accomplishing this? Would this typically be done through multiple docker files, or a single docker file with parameters? Any examples would be greatly appreciated.
I would separate the parts of preparing the compiler and doing the calculation, so the source doesn't become part of the docker container.
Prepare Compiler
For preparing the compiler I would take the ARG approach but without copying the data into the container. In case you wanna fast retry while having enough resources you could spin up multiple instances the same time.
ARG COMPILER=gcc:4.8
FROM ${COMPILER}
ENV DEBIAN_FRONTEND noninteractive
# Install tools (cmake, ninja, etc)
# this will cause bloat if the FROM layer changes
RUN <<EOF
apt update
apt install -y cmake ninja-build
rm -rf /var/lib/apt/lists/*
EOF
# Set the work directory
VOLUME /src
WORKDIR /src
CMD ["cmake"]
Build it
Here you have few options. You could either prepare a volume with the sources or use bind mounts together with docker exec like this:
#bash style
for compiler in gcc:4.9 gcc:4.8 gcc:5.1
do
docker build -t mytag-${compiler} --build-arg COMPILER=${compiler} .
# place to clean the target folder
docker run -v $(pwd)/src:/src mytag-${compiler}
done
And because the source is not part of the docker image you don't have bloat. You can also have two mounts, one for a readonly source tree and one for the output files.
Note: If you remove the CMake command you could also spin up the docker containers in parallel and use docker exec to start the build. The downside of this is that you have to take care of out of source builds to avoid clashes on the output folder.
put an ARG before the FROM and then invoke the ARG as the FROM
so:
ARG COMPILER=gcc:4.8
FROM ${COMPILER}
# rest goes here
then you
docker build . -t test/clang-8 --build-args COMPILER=clang-8
or similar.
If you want to automate just make a list of compilers and a bash script looping over the lines in your file, and paste the lines as inputs to the tag and COMPILER build args.
As for Cmake, I'd just do:
RUN wget -qO- "https://cmake.org/files/v3.23/cmake-3.23.1-linux-"$(uname -m)".tar.gz" | tar --strip-components=1 -xz -C /usr/local
When copying, I find it cleaner to do
WORKDIR /app/build
COPY . .
edit: formatting
As far as I know, there is no way to do that easily and safely. You could use a RUN --mount=type=cache, but the documentation clearly says that:
Contents of the cache directories persist between builder invocations without invalidating the instruction cache. Cache mounts should only be used for better performance. Your build should work with any contents of the cache directory as another build may overwrite the files or GC may clean it if more storage space is needed.
I have not tried it but I guess the layers are duplicated anyway, you just save time, assuming the cache is not emptied.
The other possible solution you have is similar to the one you mention in the question: starting with the tools installation and then customizing it with the gcc image. Instead of starting with an alpine image, you could start FROM scratch. scratch is basically the empty image, you could COPY the files generated by
RUN <<EOF
apt update
apt install -y cmake ninja-build
rm -rf /var/lib/apt/lists/*
EOF
Then you COPY the entire gcc filesystem. However, I am not sure it will work because the order of the initial layers is now reversed. This means that some files that were in the upper layer (coming from tools) now are in the lower layer and could be overwritten. In the comments, I asked you for a working Dockerfile because I wanted to try this out before answering. If you want, you can try this method and let us know. Anyway, the first step is extracting the files created from the tools layer.
How to extract changes from a layer?
Let's consider this Dockerfile and build it with docker build -t test .:
FROM debian:10
RUN apt update && apt install -y cmake && ( echo "test" > test.txt )
RUN echo "new test" > test.txt
Now that we have built the test image, we should find 3 new layers. You mainly have 2 ways to extract the changes from each layer:
the first is docker inspecting the image and then find the ids of the layers in the /var/lib/docker folder, assuming you are on Linux. Each layer has a diff subfolder containing the changes. Actually, I think it is more complex than this, that is why I would opt for...
skopeo: you can install it with apt install skopeo and it is a very useful tool to operate on docker images. The command you are interested in is copy, that extracts the layers of an image and export them as .tar:
skopeo copy docker-daemon:{image_name}:latest "dir:/home/test_img"
where image_name is test in this case.
Extracting layer content with Skopeo
In the specified folder, you should find some tar files and a configuration file (look at the skopeo copy command output and you will know which one is that). Then extract each {layer}.tar in a different folder and you are done.
Note: to find the layer containing your tools just open the configuration file (maybe using jq because it is json) and take the diff_id that corresponds to the RUN instruction you find in the history property. You should understand it once you open the JSON configuration. This is unnecessary if you have a small image that has, for example, debian as parent image and a single RUN instruction containing the tools you want to install.
Get GCC image content
Now that we have the tool layer content, we need to extract the gcc filesystem. we don't need skopeo for this one, but docker export is enough:
create a container from gcc (with the tag you need):
docker create --name gcc4.8 gcc:4.8
export it as tar:
docker export -o gcc4.8.tar gcc4.8
finally extract the tar file.
Putting all together
The final Dockerfile could be something like:
FROM scratch
COPY ./tools_layer/ /
COPY ./gcc_4.x/ /
In this way, the tools layer is always reused (unless you change the content of that folder, of course), but you can parameterize the gcc_4.x with the ARG instruction for example.
Read carefully: all of this is not tested but you might encounter 2 issues:
the gcc image overwrites some files you have changed in the tools layer. You could check if this happens by computing the diff between the gcc layer folder and the tools layer folder. If it happens, you can only keep track of that file/s and add it/them in the dockerfile after the COPY ./gcc ... with another COPY.
When in the upper layer a file is removed, docker marks that file with a .wh extension (not sure if it is different with skopeo). If in the tools layer you delete a file that exists in the gcc layer, then that file will not be deleted using the above Dockerfile (the COPY ./gcc ... instruction would overwrite the .wh). In this case too, you would need to add an additional RUN rm ... instruction.
This is probably not the correct approach if you have a more complex image that the one you are showing us. In my opinion, you could give this a try and just see if this works out with a single Dockerfile. Obviously, if you have many compilers, each one having its own tools set, the maintainability of this approach could be a real burden. Instead, if the Dockerfile is more or less linear for all the compilers, this might be good (after all, you do not do this every day).
Now the question is: is avoiding layer replication so important that you are willing to complicate the image-building process this much?
I am running the following line:
wget -P "C:\My Web Sites\REGEX" -r --no-parent -A jpg,jpeg https://www.mywebsite.com/directory1/directory2/
and it stops (no errors) without returning more than a small amount of the website (two files). I am then running this:
wget -P "C:\My Web Sites\REGEX" https://www.mywebsite.com/directory1/directory2/ -m
and expecting to see data only from the directory. As a start, I found out that the script downloaded everything from the website as if I gave the https://www.mywebsite.com/ url. Also, the images are returned with an additional string in the extension (e.g. instead of .jpg I get something like .jpg#f=l=q)
Is there anything wrong in my code that causes that? I only want to get the images from the links that are shown in the directory given initially.
If there is nothing I can change, then I want to only download the files that contain .jpg in their names. Then, I have a prepared script in Python that can rename the files to have the original extension. Worst case, I can try Python instead of the CMD in Windows (page scraping)?
Note that --no-parent doesn't work in this case because the images are saved in a different directory. --accept-regex can be used if there is no way to get the correct extension.
PS: I do this thing in order to learn more about the wget options and protect my future hobby website.
UPD: Any suggestions regarding a Python script are welcome.
I am on a mac and am trying to import a virtual machine image (.ova file). I try to import the file on a VM and get the following error.
Could not find a storage controller named 'SCSI Controller'
Any solutions out there that already exists for this problem.
I got a clue to the answer from here: https://ctors.net/2014/07/17/vmware_to_virtualbox
basically you need to change the virtual disk controller eg change ddb.adapterType from "buslogic" or "lsilogic" to "ide"
However if you don't have VMware to boot the original image and remove the vmware tools and remove the hard disk, you can hack the .ovf file in the .ova file to switch the virtual SCSI controller to an IDE controller.
here's how.
First open the ova archive, lets assume its in the current dir called vm.ova
mkdir ./temp
cd temp
tar -xvf ../vm.ova
This will extract 3 files, an *.ovf file, a virtual disk *.vmdk file, and a manifest .mf file.
edit the .ovf file, find the SCSI reference, it will be lsilogicsas or "buslogic" or "lsilogic". replace that word with ide.
While you are at it you may want to rename all the files so that they don't have spaces or strange chars in the name, this males it more UNIX friendly. Of course if you rename the files you need to modify the references in the .ovf and .mf files.
because you've modified the files the you need to recompute the sha1 values in the .mf file. eg run sha1sum to get the value and replace the old ones in the mf file.
$ sha1sum vm.ovf
4806ebc2630d9a1325ed555a396c00eadfc72248 vm.ovf
now that you've swapped the disk controller and fixed up the manifest's sha1 values you can pack the .ova back up. The files have to be in order inside the archive so do this (use your file names)
tar -cvf ../vm-new.ova ./vm.ovf
tar -rvf ../vm-new.ova ./vm.vmdk
tar -rvf ../vm-new.ova ./vm.mf
done. Now you can open Virtualbox and click File -> Import Appliance then point it at the vm-new.ova file. once done you should be able to start the vm.
hope that helps.
Cheers Karl
I run through a similar problem and I just extracted the.ova file and create new VM with my own settings using the .vmdk file.
tar -xvf vm.ova
vm.ovf
vm.vmdk
vm.mf
So I'm trying to install a module, specifically xlutils. I've read through the resources that I've linked at the bottom, but none of those resources have allowed me to successfully install and import the module. I'm running Windows 8 and using Python 2.7.
I downloaded the .tar.gz file containing xlutils, and unpacked it to C:\Python which was then a .tar file, so I unpacked that to the same folder. This created a folder, xlutils, which looked like it contained what I need. I also read somewhere that these should be stored in site-packages, so I moved it there.
But when I run import commands, they don't work, just tell me the module couldn't be found. When I look at the path browser, it doesn't see the folder, but I'm certain it's in there. That leads me to wonder, do I need to do something to manually update what the path browser can view?
Note that I've also already tried going to the command line, navigating to the folder containing the module, and typing python setup.py install but that just tells me that the term "python" is not found. In general, my command line always does this though. Usually I have to type .\python instead to run Python from the command line, but I also tried doing that here (i.e navigating to the folder and typing .\python setup.py install but it still says the same thing).
Also note that I can import numpy and scipy just fine, and I can see them in the path browser--not sure why those work while this one doesn't.
Resources I've already read but hasn't solved my problem:
(... Well, I tried providing the resources I've already viewed, but can't so many links with such low reputation. But basically, I've read the first ten links on a Google search and two or three past Stack questions and answers.)
Solutions I see:
You can use the absolute path C:\Python27\python.exe setup.py install
You can add the Python directory C:\Python27\ to your path variable before running python setup.py install
I was using VirtualBox on my PC(WIN 7)
I managed to View some files in my .VDI file..
How can I open or view the contents of my .vdi file and retrieve the files from there?
I had a corrupted VDI file (according to countless VDI-viewer programs I've used with cryptic errors like invalid handle, no file selected, please format disk) and I was not able to open the file, even with VirtualBox. I tried to convert it using the VirtualBox command line tools, with no success. I tried mounting it to a new virtual machine, tried mounting it with ImDisk, no dice. I read four Microsoft TechNet articles, downloaded their utilities and tried countless things; no success.
However, when I tried 7Zip (https://www.7-zip.org/download.html) I was able to view all of the files, and extract them selectively. Here's how I did it:
install 7zip (make sure that you also install the context-menu items, if prompted.)
right-click on the VDI file, select "Open Archive"
when the window appears, right click on the largest file in the archive (there should be two files, one is "Basic Microsoft Data Partition" and the other one something else, called system or something.) Left click on the largest one and click "Open inside". The file size is listed to the right of each file in bytes.
you should see all of the files inside of the archive. You can drag files that you'd like to extract right to your desktop. You can double click on folders to view inside them too.
If 7zip gives you a cryptic error after extracting the files, it means that you closed the folder's window that you are copying files to in Windows Explorer.
If you didn't close the window and you're still getting an error, try extracting each sub-folder individually. Also make sure that you have enough local hard drive space to copy the files to, even if you are copying them just to an external disk, as 7zip copies them first to your local disk. If the files are highly compressible, you might be able to get away with using NTFS compression for the AppData/temp folder so that when 7zip extracts the files locally, it'll compress them so that it can copy them over to your other disk.
You can mount partitions from .vdi images using qemu-nbd:
sudo apt install qemu-utils
sudo modprobe nbd
vdi="/path/to/your.vdi" # <<== Edit this
sudo qemu-nbd -c /dev/nbd0 "$vdi"
# view partitions and select the one you want to mount.
# Using parted here, but you can also use cfdisk, fdisk, etc.
sudo parted /dev/nbd0 print
part=nbd0p2 # <<== partition you want to mount
sudo mkdir /mnt/vdi
sudo mount /dev/$part /mnt/vdi
Some users seem to need to add a parameter to the modprobe command. I didn't with Ubuntu 16.04, but if it doesn't work for you, try adding max_part=16 :
sudo modprobe nbd max_part=16
When done:
sudo umount /dev/$part
sudo qemu-nbd --disconnect /dev/nbd0
Try out VMXray.
You can explore your vmdk image right inside your browser. Select the files that you want to extract and extract them to the desired location. Not just vmdk, you can use VMXRay for looking into and extracting files from RAW, QEMU/KVM QCOW2, Virtualbox VDI, and ISO images. ext2, ext3, FAT and NTFS are current supported file systems. You can also use this to recover deleted photos from raw dumps of your camera's SD card, for example.
And, do not worry, no data from your files is ever sent over the network. Data never leaves your machine. VMXRay works completely inside your browser.
As a first approach you can simply try any archive viewer to open .vdi file.
I tried 7zip to open Ubuntu Mate .vdi file and it shown all Linux file system like below.
An easy way is to attach the VDI as a second disk in another Virtual Machine.
The drive does not appear immediately; in Windows go to Disk Manager, bring the disk online and assign it a drive letter.
You can use ImDisk to mount VDI file as a local drive in Windows. Follow this virtualbox forum thread and become happy )) Also you can convert VDI to VHD and use default Windows Disk manager to mount VHD (described here)