How to enable (force) compression in MonetDB? - compression

I installed MonetDB and imported a (uncompressed) 291 GB TSV MySQL dump. It worked like a charm and the database is really fast, but the database needs more than 542 GB on the disk. It seems like MonetDB is also able to use compression, but I was not able to find out how to enable (or even force) it. How can I do so? I don't know if it really speeds up execution, but I would like to try it.

There is no user-controllable compression scheme available in the official MonetDB release. The predominant compression scheme is dictionary encoding for string valued columns. In general, a compression scheme reduces the disk/network footprint by spending more CPU cycles.
To speed up queries, it might be better to first look at the TRACE of the SQL queries for simple hints on where the time is actually spent. This often give hints on 'liberal' use of column types. For example, a BIGINT is an overkill if the actual value range is known to fit in 32bits.

Related

rough idea on compression

I have a old EMC (about 12 years old, zero compression and zero de-dup) nas that I'm looking to replace. Now that pretty much everything has built in compression, de-dup etc. I want to get an idea on what size I need to look into.
On my nas it has been used as file storage. Text file, word docs, excel, database files, audio files, images, etc, etc, etc.
Is there a utility out there that I can have scan my NAS, look at all the files, then give me a report telling me how much space I would actually need if I was to get a system with built-in compression?
I don't expect it to be perfect, but a rough idea would be nice.

Best way to query Sqlite DB in Win32 C

I added the SQLite3 source to my project and compiled it. My file size is huge (~400KB).
I need my file to be as small as possible. What is the best way to do SQLite queries in C++ ?
When i say best i mean the smallest possible file size. Any other light weight SQLite libs for C++?
From sqlite about page
If optional features are omitted, the size of the SQLite library can
be reduced below 300KiB
I guess it will be hard to go lower, and I don't think there are alternative implementations doing less400 KB is a lot but SQlite do a lot too. Even a small database will be more than 50M. You may go lower dynamically linking with some Microsoft ADO but with many potential install or security problems (and no sqlite file support). My final words 400K is a lot. But for today 400K is pretty small. Many homepage are more than 1M and that's even more crazy.

Redis is slow to get large strings

I'm kind of a newb with Redis, so I apologize if this is a stupid question.
I'm using Django with Redis as a cache.
I'm pickling a collection of ~200 objects and storing it in Redis.
When I request the collection from Redis, Django Debug Toolbar is informing me that the request to Redis is taking ~3 seconds. I must be doing something horribly wrong.
The server has 3.5GB of ram, and it looks like Redis is currently using only ~50mb, so I'm pretty sure it's not running out of memory.
When I get the key using the redis-cli it takes just as long as when I do it from Django
Running strlen on the key from redis-cli I'm informed that the length is ~20 million (Is this too large?)
What can I do to have Redis return the data faster? If this seems unusual, what might be some common pitfalls? I've seen this page on latency problems, but nothing has really jumped out at me yet.
I'm not sure if it's a really bad idea to store a large amount of data in one key, or if there's just something wrong with my configuration. Any help or suggestions or things to read would be greatly appreciated.
Redis is not designed to store very large objects. You are not supposed to store your entire collection in a single string in Redis, but rather use Redis list or set as a container for your objects.
Furthermore, the pickle format is not optimized for space ... you would need a more compact format. Protocol Buffers, MessagePack, or even plain JSON, are probably better for this. You should consider to apply a light compression algorithm before storing your data (like Snappy, LZO, Quicklz, LZF, etc ...).
Finally, the performance is probably network bound. On my machine, retrieving a 20 MB object from Redis takes 85 ms (not 3 seconds). Now, if I run the same test using a remote server, it takes 1.781 seconds, which is expected on this 100 Mbit/s network. The duration is fully dependent on the network bandwidth.
Last point: be sure to use a recent Redis version - a number of optimization have been done to deal with large objects.
It's most likely just the size of the string. I'd look at whether your objects are being serialized efficiently.

Hadoop, how to compress mapper output but not the reducer output

I have a map-reduce java program in which I try to only compress the mapper output but not the reducer output. I thought that this would be possible by setting the following properties in the Configuration instance as listed below. However, when I run my job, the generated output by the reducer still is compressed since the file generated is: part-r-00000.gz. Has anyone successfully just compressed the mapper data but not the reducer? Is that even possible?
//Compress mapper output
conf.setBoolean("mapred.output.compress", true);
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString());
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
mapred.compress.map.output: Is the compression of data between the mapper and the reducer. If you use snappy codec this will most likely increase read write speed and reduce network overhead. Don't worry about spitting here. These files are not stored in hdfs. They are temp files that exist only for the map reduce job.
mapred.map.output.compression.codec: I would use snappy
mapred.output.compress: This boolean flag will define is the whole map/reduce job will output compressed data. I would always set this to true also. Faster read/write speeds and less disk spaced used.
mapred.output.compression.type: I use block. This will make the compression splittable even for all compression formats (gzip, snappy, and bzip2) just make sure you're using a splitable file format like sequence, RCFile, or Avro.
mapred.output.compression.codec: this is the compression codec for the map/reduce job. I mostly use one of the three: Snappy (Fastest r/w 2x-3x compression), gzip (normal r fast w 5x-8x compression), bzip2 (slow r/w 8x-12x compression)
Also remember when compression mapred output, that because of splitting compression will differ base on your sorting order. The close like data is together the better the compression.
With MR2, now we should set
conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
For more details, refer: http://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
"output compression" will compress your final output. To compress map-outputs only, use something like this:
conf.set("mapred.compress.map.output", "true")
conf.set("mapred.output.compression.type", "BLOCK");
conf.set("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec");
You need to set "mapred.compress.map.output" to true.
Optionally you can choose your compression codec by setting "mapred.map.output.compression.codec".
NOTE1: mapred output compression should never be BLOCK. See the following JIRA for detail:
https://issues.apache.org/jira/browse/HADOOP-1194
NOTE2: GZIP and BZ2 are CPU intensive. If you have slow network and GZIP or BZ2 gives better compression ratio, it may justify the spending of CPU cycles. Otherwise, consider LZO or Snappy codec.
NOTE3: if you want to use map output compression, consider install the native codec which is invoked via JNI and gives you better performance.
If you use MapR's distribution for Hadoop, you can get the benefits of compression without all the folderol with the codecs.
MapR compresses natively at the file system level so that the application doesn't need to know or care. Compression can be turned on or off at the directory level so you can compress inputs, but not outputs or whatever you like. Generally, the compression is so fast (it uses an algorithm similar to snappy by default) that most applications see a performance boost when using native compression. If your files are already compressed, that is detected very quickly and compression is turned off automatically so you don't see a penalty there, either.

Any way to determine speed of a removable drive in windows?

Is there any way to determine a removable drive speed in Windows without actually reading in a file. And if I do have to read in a file, how much needs to be read to get a semi accurate speed (e.g. determine whether a device is USB2 or USB1)?
EDIT: Just to clarify, USB2 and USB1 were an example. These could be Compact Flash, could be SSD, could be a removable drive. And I am trying to determine this as fast as possible as it has a real effect on the responsiveness of the application.
EDIT: Should also clarify, this has to be done programatically. It will probably be done in C++.
EDIT: Boost answer is kind of what I was looking for (though I haven't written any WMI in C++). But I need to know what properties I have to check to determine relative speed. I don't need exact speed (like I said about the difference in speed between USB1 and USB2), but I need to know if it is going to be SLLOOOOWWW.
WMI - Physical Disks Properties is an article I found which would at least help you figure out what you have connected. I foresee things heading toward tables equating particular manufacturers and models to speeds, which is not as simple a solution as you may have hoped for.
You may have better results querying the operating system for information about the hardware rather than trying to reverse engineer it from data transfer timing information.
For example, identical transfer speeds don't necessarily mean the same technology is being used by two devices, although other factors such as seek times would improve the accuracy, if such information is available to your application.
In order to keep the application responsive while this work is done, try doing the calls asynchronously and provide some sort of progress indicator to the user. As an example, take a look at how WinDirStat handles this progress indication (I love the pac-man animation as each directory is analyzed).
Several megabytes, I'd say. Transfer speeds can start out slow, and then speed up as the transfer progresses. There are also variations because of file sizes (a single 1GB file will transfer much faster than 1GB of smaller files).
Best way to do that would be to copy a file to/from the device, and time how long it takes with your code. USB1 speed is 11Mb/s (I think), and USB2 is 480Mb/s (note those are numbers for the whole bus, not each port, so multiple devices on the same bus will change the actual numbers).
Try TerraCopy and copy one large file ~400mb - 500mb from device and to the device and you'll see the speed.
In Windows you can determine if a connected USB device is USB2 by selecting View -> "Devices by Connection" from the Device Manager and then checking to see if the device is under a USB2 controller (USB2 Enhanced Host Controller).
Note that this doesn't mean your device will actually perform at the higher speeds though, you would still need actual throughput tests for that. The Sisoft Sandra benchmarking software lists removable hard drives as supported in its feature list.
EDIT: Due to clarification in original question, I have submitted a new answer.
Consider the number of things that could affect data transfer speed:
The speed of the bus used to connect the device to the system. This is unlikely to be your bounding factor unless it's connected via USB1.
For hard drives, rotational speed and seek time matter. 7200 RPM drives will read and write blocks of data faster than 5400 RPM drives.
Optical and magnetic drives usually spin down when not in use, so the first access will take orders of magnitude more than the second access.
The filesystem used on the particular device.
Caching of data and filesystem metadata. The less metadata is cached, the more a magnetic or optical drive has to seek to figure out where the data is.
Data access pattern. Accessing a small number of large, contiguous files is almost always faster than accessing a large number of small files scattered around the disk.
File system fragmentation
You might be able to work up some heuristics based on the various characteristics of the devices you expect to see, but in general there's no good way to figure out transfer speed for a particular combination of bus, media, filesystem, and data access pattern without actually measuring it. If you decide to measure, try to simulate your final access pattern as closely as possible.
I'm going to borrow Raymond Chen's crystall ball and say that you really don't want this. You probably want to use asynchronous I/O. If you do not get the result of your I/O within a second, you want to check how much did happen. Take the inverse of that number, and you have a good estimate to quote to the user.
If nothing happened after a second, you may be in for a surprise. But even that can happen. For instance, a harddisk may need a second to spin up. Just poll every second until something has happened.