How to move part vmdk files into a larger disk partition? - vmware

Using VMWare Workstation 10.
The vmdk files is exhaustting my "D:\" space, so i want to move part vmdk files into a larger disk partition(e.g. "C:\").
I tried to move some vmdk files into "C:\Ubuntu-1-vmd-extended" and modified the main .vmdk file like bellow:
# Extent description
RW 4192256 SPARSE "C:\Ubuntu-1-vmd-extended\Ubuntu 64 位-s001.vmdk"
RW 4192256 SPARSE "C:\Ubuntu-1-vmd-extended\Ubuntu 64 位-s002.vmdk"
RW 4192256 SPARSE "C:\Ubuntu-1-vmd-extended\Ubuntu 64 位-s003.vmdk"
RW 4192256 SPARSE "C:\Ubuntu-1-vmd-extended\Ubuntu 64 位-s004.vmdk"
RW 4192256 SPARSE "C:\Ubuntu-1-vmd-extended\Ubuntu 64 位-s005.vmdk"
RW 4192256 SPARSE "Ubuntu 64 位-s006.vmdk"
RW 4192256 SPARSE "Ubuntu 64 位-s007.vmdk"
RW 4192256 SPARSE "Ubuntu 64 位-s008.vmdk"
RW 4192256 SPARSE "Ubuntu 64 位-s009.vmdk"
RW 4192256 SPARSE "Ubuntu 64 位-s010.vmdk"
RW 4192256 SPARSE "Ubuntu 64 位-s011.vmdk"...
But it doesn't work, can not start it now.

I have found a solution. Use the symbolic link in windows 7.

Related

Ultrafast 2x lossy audio/image compression algorithm?

I'm looking for an audio or image compression algorithm that can compress a torrent of 16-bit samples
by a fairly predictable amount (2-3x)
at very high speed (say, 60 cycles per sample at most: >100MB/s)
with lossiness being acceptable but, of course, undesirable
My data has characteristics of images and audio (2-dimensional, correlated in both dimensions and audiolike in one dimension) so algorithms for audio or images might both be appropriate.
An obvious thing to try would be this one-dimensional algorithm:
break up the data into segments of 64 samples
measure the range of values among those samples (as an example, the samples might be between 3101 and 9779 in one segment, a difference of 6678)
use 2 to 4 additional bytes to encode the range
linearly downsample each 16-bit sample to 8 bits in that segment.
For example, I could store 3101 in 16 bits, and store a scaling factor ceil(6678/256) = 27 in 8 bits, then convert each 16-bit sample to 8-bit as s8 = (s16 - base) / scale where base = 3101 + 27>>1, scale = 27, with the obvious decompression "algorithm" of s16 = s8 * 27 + 3101.) Compression ratio: 128/67 = 1.91.
I've played with some ideas to avoid the division operation, but hasn't someone by now invented a superfast algorithm that could preserve fidelity better than this one?
Note: this page says that FLAC compresses at 22 million samples per second (44MB/s) at -q6 which is pretty darn good (assuming its implementation is still single-threaded), if not quite enough for my application. Another page says FLAC has similar performance (40MB/s on a 3.4GHz i3-3240, -q5) as 3 other codecs, depending on quality level.
Take a look at the PNG filters for examples of how to tease out your correlations. The most obvious filter is "sub", which simply subtracts successive samples. The differences should be more clustered around zero. You can then run that through a fast compressor like lz4. Other filter choices may result in even better clustering around zero, if they can find advantage in the correlations in your other dimension.
For lossy compression, you can decimate the differences before compressing them, dropping a few low bits until you get the compression you want, and still retain the character of the data that you would like to preserve.

Meaning of the CL_DEVICE... parameters

I've implemented a single function that retrieves some informations related to my opencl devices, specifically I have this device:
1. Vendor NVIDIA Corporation
1. Device: GeForce GTX 1070
1.1 Hardware Version: OpenCL 1.2 CUDA
1.2 Software Version: 391.24
1.3 OpenCL C version: OpenCL C 1.2
1.4 Address bits: 64
1.5 Max Work Item Dimensions: 3
1.6 Work Item Sizes 1024 1024 64
1.7 Work group size: 1024
1.8 Parallel compute units 15
And I need to be sure I understand some of them (specifically work groups/items).
Given I have Work Item Sizes : 1024 1024 64 this means that when I instantiate the kernel I can use a total amount of 2^26 work items, is this correct? The Work group size : 1024 means, I guess, the max amount per work groups (in case I need to sue barriers etc I guess this info is useful). Not sure about the Parallel compute units because to me, given the name, this should be covered somehow in the work items, so
What's the meaning of Parallel compute units (CL_DEVICE_MAX_COMPUTE_UNITS)?
How does the parameter above relate with Work items?
And one more question
Is there any relationship between Address bits and Work items?
Thank you
What's the meaning of Parallel compute units
On CPUs, this is the amount of logical processors. On NVidias, this is the amount of "Streaming Multiprocessors", on AMD GPUs they are actually called "Compute Units".
The point of having these in OpenCL is that with some devices, you can "carve them up" by their compute units, and launch kernels independently on these units.
Given I have Work Item Sizes : 1024 1024 64 this means that when I instantiate the kernel I can use a total amount of 2^26 work items, is this correct?
Incorrect. These are the maximums of each dimension. The Work group size limit is the maximum of the multiplication of each dimension. IOW, if you have maximum "Work group size" of 1024, then you could launch e.g. [1024,1,1] or [128,8,1] or [4,16,4] but launching [2000,1,1] or [100,100,1] will fail. Go ahead and try it.
The reason for such (usually) small limit is related to barriers, but also local memory sizes (which are relatively tiny on most GPUs).
Also, it's actually explained in documentation to clEnqueueNDRangeKernel:
local_work_size
Points to an array of work_dim unsigned values that describe the number of
work-items that make up a work-group (also referred to as the size of the
work-group) that will execute the kernel specified by kernel. The total
number of work-items in a work-group is computed as local_work_size[0]
*... * local_work_size[work_dim - 1]. The total number of work-items
in the work-group must be less than or equal to the CL_DEVICE_MAX_WORK_GROUP_SIZE value

Store OpenCV boolean matrix on disk

I have a float matrix of 1024x1024 and I want to keep sign of this matrix inside a file. For this purpose, I want to keep the sign matrix as Matrix of boolean which I fail to do.
Assume, my matrix is:
2.312, 0.232, -2,132
5.754, -4,34, -3.23
-4.34, -1.23, 7.9453
My output should be
1,1,0
1,0,0
0,0,1
Since float is 4Byte and my matrix size is 10^20(1M) the size is 4MB and boolean is 1bit and matrix size is 1M, I expect the bool mat to be around 1Mb=128KB however, when I use threshold method in opencv my output file is 1MB which means the file is saved as uchar(8bit).
I tried to use imwrite but it didn't work.
EDIT: I realized that I didn't mention speed is also another important factor for my tests. I'm loading approximately 10 million of 1K*1K matrix from disk.
Thanks in advance
In OpenCV you can write
Mat input;
Mat A = (input >= 0);
Now the problem is that OpenCV has no bitmap data type. So the best you can get is Mat1u (unsigned char).
If you want to save space in your storage, you need to do it on your own. For example, you can use libpng to write out a PNG file of bit depth 1. Unfortunately, imwrite does not support setting that bit depth (it can write PNGs with bit depths 8 and 16).
If you want to write a compressed PNG with bitdepth 8, you can use imwrite:
std::vector<int> flags;
flags.push_back(CV_IMWRITE_PNG_COMPRESSION);
flags.push_back(9); // [0-9] 9 being max compression, default is 3
cv::imwrite("output.png", A, flags);
This will result in the best compression effort. Now you can use Imagemagick to compare the filesize against the same image stored with bit depth 1:
convert output.png -type Bilevel -define "png:bit-depth=1" -define "png:compression-level=9" output-1b.png
I tested with a random example image (see below).
8 bit, compressed PNG: 24,732 bytes
1 bit, compressed PNG: 20,529 bytes
8 bit, uncompressed PGM: 270,015 bytes
1 bit, uncompressed PBM: 34,211 bytes
As you can see, a compressed 8bit storage still beats uncompressed 1bit storage in this example.

L2 data and instruction cache decreased suddenly

I am working on performance of paralle algorithm on Multicore machine. I did an experiment on Matrix Multiplication with loop reordering (ikj) technique.
The serial execution result is as in images below.L1 data cache hit for loop order ikj and kij for all size of nXn matrix is near 100% (Image 1 box number 1 & 2) and as you can see loop order ikj in size 2048 and 4096 suddenly L2 data cach hit decrease by %50 (Image 2 box number 1 & 2) also in L2 instruction cache hit the same is true. Incase where L1 data cache hit for these 2 size are like other sizes(256,512,1024) is about %100. I could not find any resonable reason for this slope in both Instruction and data cache hit. could any one give me clue on how to find the reason(s)?
do you think that L2 unified cache has any effect on exacerbating the problem? But still what causes this reduction what characteristic of algorithm and performance should I profile to find reason.
experimental machine is Intel e4500 with 2Mb L2 cache,cache line 64, os is fedora 17 x64 with gcc 4.7 -o no compiler optimization
Abridged & Complete Question?
my problem is that why sudden decrease of about 50% in both L2 data and instruction cache happens in only ikj & kij algorithm as it's boxed and numbered 1 & 2 in images, but not in other loop variation?
*Image 1*
*Image 2*
*Image 3*
*Image 4*
*Image 5*
Despite the aformentioned problem there is no increase in timing of ikj&kij algorithm. But also is faster than others.
ikj and kij algorithm are two variation of loop reordering technique/
kij Algorithm
For (k=0;k<n;k++)
For(i=0;i<n;i++){
r=A[i][k];
For (j=0;j<n;j++)
C[i][j]+=r*B[k][j]
}
ikj Algorithm
For (i=0;i<n;i++)
For(k=0;k<n;k++){
r=A[i][k];
For (j=0;j<n;j++)
C[i][j]+=r*B[k][j]
}
thanks
I bet this happens because of the super-alignment issue discussed in the answer of following questions:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Matrix multiplication: Small difference in matrix size, large difference in timings
I hope it is understandable that I don't like to copy&paste from those answers.

quantization of dct image for steganography

I hav a greyscale image. I did 8x8 blocks and computed each of their DCTs. I want to quantize the DCT coefficients and then replace their LSBs with my secret message bits. How exactly do I quantize the coefficients? Should I use the quantization matrix used by JPEG? How to determine the values of such a quantization matrix?
You will probably want to set the quality level to the highest (smallest values in the quantization matrix) so that the modified LSB of each coefficient perturbs the image data the least.
For encoding:
You will need access to the DCT values after quantization and before entropy coding. There you can modify the LSB's. You should probably only modify the non-zero coefficient values or you will make the compressed image file much larger and more distorted. This way, you will probably be able to encode 20-30 bits per DCT block.
For decoding:
You will need to do the reverse and get access to the DCT values immediately after the entropy decode and before the dequantization step.
To calculate the total number of bits available for your message, use the following example:
For a VGA sized image (640x480) which is encoded as 4:2:0 (subsampled color in both dimensions), you will have 40 x 30 = 1200 MCUs. Each MCU has 6 DCT blocks (4Y, 1Cr, 1Cb). This is a total of 7200 DCT blocks. If each block encodes an average of 25 coefficients (a reasonable quality level), then your message can be a total of 7200x25 = 180000 bits.