I have a 500Mhz CPU and 256MB RAM machine running 32bit Linux.
I have a large number of files around 300KB in size. I need to compress them very fast. I have set up the compression level for zlib at Z_BEST_SPEED. Is there any other measure I could take?
Is it possible to compress 25-30 files like this in a second on such a machine?
You are essentially talking about a 10MB/sec speed. Even if you were to only copy the files from one place to another I would doubt that your slow hardware could do it. So, for compression I would vote No - it's not possible "to compress 25-30 files like this in a second on such a machine".
Related
I have a large collection of ISO files (around 1GB each) that have shared 'runs of data' between them. So, for example, one of the audio tracks may be the same (same length and content across 5 isos), but it may not necessarily have the same name or location in each.
Is there some compression technique I can apply that will detect and losslessly deduplicate this information across multiple files?
For anyone reading this, after some experimentation it turns out that by putting all the similar ISO or CHD files in a single 7zip archive (Solid archive, with maximum dictionary size of 1536MB), I was able to achieve extremely high compression via deduplication on already compressed data.
The lrzip program is designed for this kind of thing. It is available on most Linux/BSD systems package mangers, or via Cygwin for Windows.
It uses an extended version of rzip to first de-duplicate the source files, and then compresses them. Because it uses mmap() it does not have issues with the size of your RAM, like 7zip does.
In my tests lrzip was able to massively de-duplicate similar ISOs, bringing a 32GB set of OS installation discs down to around 5GB.
Can you keep sending the output of BZip2 (or any compression software) back through the compression process over and over again to make the output files smaller and smaller? Can you compress a file using one software (BZip2) that was already compressed using another method (Snappy)?
No and no. (For lossless compression.)
If the original file was extremely redundant, like megabytes of nothing but zeros, then the first, and maybe the second recompression will result in compression. But at some point there will be no gain from recompression, and instead a small increase in file size. For normal files, the first recompression will result in no gain.
This is true regardless of how you might mix lossless compressors.
I installed MonetDB and imported a (uncompressed) 291 GB TSV MySQL dump. It worked like a charm and the database is really fast, but the database needs more than 542 GB on the disk. It seems like MonetDB is also able to use compression, but I was not able to find out how to enable (or even force) it. How can I do so? I don't know if it really speeds up execution, but I would like to try it.
There is no user-controllable compression scheme available in the official MonetDB release. The predominant compression scheme is dictionary encoding for string valued columns. In general, a compression scheme reduces the disk/network footprint by spending more CPU cycles.
To speed up queries, it might be better to first look at the TRACE of the SQL queries for simple hints on where the time is actually spent. This often give hints on 'liberal' use of column types. For example, a BIGINT is an overkill if the actual value range is known to fit in 32bits.
I'm optimizing our web service, and heard about gzip.
It would be good if we can reduce the network load using gzip, but I'm a little worried about how much unpacking overhead it'll bring to client.
Especially, our service uses javascript very often - which means that page rending in web browser will cost CPU time.
I cannot sure that taking cpu time to decompress gzip packet (instead of taking care of javascript) would bring positive effect to our service still.
Things like HTML and javascript libraries, particularly static files, are good candidates for compression. images aren't - they're already compressed.
Decompression of gzip compressed data is very fast compared to most internet connections - a quick test on my PC (AMD phenom 2.8GHz) results in decompression of about 170m/second, in a single core. So a ~200k javascript file would be decompressed by a modern browser on a modern PC in about 2 milliseconds, and javascript typically compresses to about 25% of its original size (~35% if it is already minified).
Of course, just what proportion of your network load is made up of decompressed javascript is another matter.
Is there any way to determine a removable drive speed in Windows without actually reading in a file. And if I do have to read in a file, how much needs to be read to get a semi accurate speed (e.g. determine whether a device is USB2 or USB1)?
EDIT: Just to clarify, USB2 and USB1 were an example. These could be Compact Flash, could be SSD, could be a removable drive. And I am trying to determine this as fast as possible as it has a real effect on the responsiveness of the application.
EDIT: Should also clarify, this has to be done programatically. It will probably be done in C++.
EDIT: Boost answer is kind of what I was looking for (though I haven't written any WMI in C++). But I need to know what properties I have to check to determine relative speed. I don't need exact speed (like I said about the difference in speed between USB1 and USB2), but I need to know if it is going to be SLLOOOOWWW.
WMI - Physical Disks Properties is an article I found which would at least help you figure out what you have connected. I foresee things heading toward tables equating particular manufacturers and models to speeds, which is not as simple a solution as you may have hoped for.
You may have better results querying the operating system for information about the hardware rather than trying to reverse engineer it from data transfer timing information.
For example, identical transfer speeds don't necessarily mean the same technology is being used by two devices, although other factors such as seek times would improve the accuracy, if such information is available to your application.
In order to keep the application responsive while this work is done, try doing the calls asynchronously and provide some sort of progress indicator to the user. As an example, take a look at how WinDirStat handles this progress indication (I love the pac-man animation as each directory is analyzed).
Several megabytes, I'd say. Transfer speeds can start out slow, and then speed up as the transfer progresses. There are also variations because of file sizes (a single 1GB file will transfer much faster than 1GB of smaller files).
Best way to do that would be to copy a file to/from the device, and time how long it takes with your code. USB1 speed is 11Mb/s (I think), and USB2 is 480Mb/s (note those are numbers for the whole bus, not each port, so multiple devices on the same bus will change the actual numbers).
Try TerraCopy and copy one large file ~400mb - 500mb from device and to the device and you'll see the speed.
In Windows you can determine if a connected USB device is USB2 by selecting View -> "Devices by Connection" from the Device Manager and then checking to see if the device is under a USB2 controller (USB2 Enhanced Host Controller).
Note that this doesn't mean your device will actually perform at the higher speeds though, you would still need actual throughput tests for that. The Sisoft Sandra benchmarking software lists removable hard drives as supported in its feature list.
EDIT: Due to clarification in original question, I have submitted a new answer.
Consider the number of things that could affect data transfer speed:
The speed of the bus used to connect the device to the system. This is unlikely to be your bounding factor unless it's connected via USB1.
For hard drives, rotational speed and seek time matter. 7200 RPM drives will read and write blocks of data faster than 5400 RPM drives.
Optical and magnetic drives usually spin down when not in use, so the first access will take orders of magnitude more than the second access.
The filesystem used on the particular device.
Caching of data and filesystem metadata. The less metadata is cached, the more a magnetic or optical drive has to seek to figure out where the data is.
Data access pattern. Accessing a small number of large, contiguous files is almost always faster than accessing a large number of small files scattered around the disk.
File system fragmentation
You might be able to work up some heuristics based on the various characteristics of the devices you expect to see, but in general there's no good way to figure out transfer speed for a particular combination of bus, media, filesystem, and data access pattern without actually measuring it. If you decide to measure, try to simulate your final access pattern as closely as possible.
I'm going to borrow Raymond Chen's crystall ball and say that you really don't want this. You probably want to use asynchronous I/O. If you do not get the result of your I/O within a second, you want to check how much did happen. Take the inverse of that number, and you have a good estimate to quote to the user.
If nothing happened after a second, you may be in for a surprise. But even that can happen. For instance, a harddisk may need a second to spin up. Just poll every second until something has happened.