I'm kind of a newb with Redis, so I apologize if this is a stupid question.
I'm using Django with Redis as a cache.
I'm pickling a collection of ~200 objects and storing it in Redis.
When I request the collection from Redis, Django Debug Toolbar is informing me that the request to Redis is taking ~3 seconds. I must be doing something horribly wrong.
The server has 3.5GB of ram, and it looks like Redis is currently using only ~50mb, so I'm pretty sure it's not running out of memory.
When I get the key using the redis-cli it takes just as long as when I do it from Django
Running strlen on the key from redis-cli I'm informed that the length is ~20 million (Is this too large?)
What can I do to have Redis return the data faster? If this seems unusual, what might be some common pitfalls? I've seen this page on latency problems, but nothing has really jumped out at me yet.
I'm not sure if it's a really bad idea to store a large amount of data in one key, or if there's just something wrong with my configuration. Any help or suggestions or things to read would be greatly appreciated.
Redis is not designed to store very large objects. You are not supposed to store your entire collection in a single string in Redis, but rather use Redis list or set as a container for your objects.
Furthermore, the pickle format is not optimized for space ... you would need a more compact format. Protocol Buffers, MessagePack, or even plain JSON, are probably better for this. You should consider to apply a light compression algorithm before storing your data (like Snappy, LZO, Quicklz, LZF, etc ...).
Finally, the performance is probably network bound. On my machine, retrieving a 20 MB object from Redis takes 85 ms (not 3 seconds). Now, if I run the same test using a remote server, it takes 1.781 seconds, which is expected on this 100 Mbit/s network. The duration is fully dependent on the network bandwidth.
Last point: be sure to use a recent Redis version - a number of optimization have been done to deal with large objects.
It's most likely just the size of the string. I'd look at whether your objects are being serialized efficiently.
Related
I installed MonetDB and imported a (uncompressed) 291 GB TSV MySQL dump. It worked like a charm and the database is really fast, but the database needs more than 542 GB on the disk. It seems like MonetDB is also able to use compression, but I was not able to find out how to enable (or even force) it. How can I do so? I don't know if it really speeds up execution, but I would like to try it.
There is no user-controllable compression scheme available in the official MonetDB release. The predominant compression scheme is dictionary encoding for string valued columns. In general, a compression scheme reduces the disk/network footprint by spending more CPU cycles.
To speed up queries, it might be better to first look at the TRACE of the SQL queries for simple hints on where the time is actually spent. This often give hints on 'liberal' use of column types. For example, a BIGINT is an overkill if the actual value range is known to fit in 32bits.
So I have two 200mb JSON files. The first one takes 1.5 hours to load, and the second which (which makes a bunch of many-to-many relationships models with the first), takes 24+ hours (since there's no updates via the console, I have no clue if it was still going or if it froze, so I stopped it).
Since the loaddata wasn't working that well, I wrote my own script that loaded the data while also outputting what's been recently saved into the db, but I noticed the speed of the script (along with my computer) decayed the longer it went. So I had to stop the script -> restart my computer -> resume at the section of data where I left off, and that would be faster than running the script throughout. This was a tedious process since it took roughly 18 hrs with me restarting the computer every 4 hours to get all the data fully loaded.
I'm wondering if there is a better solution for loading in large amounts of data?
EDIT: I realized there's an option to load in raw SQL, so I may try that, although I need to brush up on my SQL.
When you're loading large amounts of data, writing your own custom script is generally the fastest. Once you've got it loaded in once, you can use your databases import/export options, which will generally be very fast (ex, pgdump).
When you are writing your own script, though, two things which will drastically speed things up:
Loading data inside a transaction. By default the database is likely in autocommit mode, which causes an expensive commit after each insert. Instead, make sure that you begin a transaction before you insert anything, then commit it afterwards (importantly, though, don't forget to commit; nothing sucks like spending three hours importing data, only to realize you forgot to commit it).
Bypassing the Django ORM and use raw INSERT statements. There is some computational overhead to the ORM, and bypassing it will make things faster.
I want to write c++ program that will put selected files from my lan into zip. But my problem is that i dont know how to limit speed of that process. Do you have any idea how to do that?
Sorry for my bad english :P .
Edit
Lets imagine lan with ~16 PCs and u want to "backup" 5 GB from each to server. And while this "backup" takes time u want to check something in web. Impossible because netwotk packed up.
What I want to accomplish is lowering load on lan by specifying speed in bytes. It doesnt even matter if it wont be exact, but precise has to be about 10-15%.
"You don't want to limit zipping speed, but lower bandwidth usage. – bartimar" Ure right.
The system will always try to execute orders as fast as possible. If you want to really slow down a process, you can make it
sleep()
It does not really make sense though to slow down your application. Are you maybe waiting for your data IO instead?
In that case, use some sort of callback to compress data whenever enough is available.
If you're worried about negatively impacting overall system performance, set the priority of the thread or process to below normal or perhaps even idle priority.
So I have an app which needs to read a lot of small data (e.g an app that processes lots of customer records).
Normally, for server systems that do this kind of stuff, you can use a database which handles a) caching most recently used data b) indexing them, and c) storing them for efficient retrieval from the file system.
Right now, for my app, I just have a std::map<> that maps data id into the data themselves which are pretty small (like 120 bytes each, or something close to that). The data id themselves are just mapped into the file system directly.
However, this kind of system does not handle unloading of data (if the memory starts to run out) or efficient storage of the data (granted, iOS uses flash storage, and so the OS should handle the caching), but yeah.
Are there libraries or something like that that can handle this kind of stuff? Or would I have to build my own caching system from scratch? Not too bothered with indexing, only a) caching/unloading and c) storing for efficient retrieval. I'm feeling wary on storing thousands (or potentially hundreds of thousands) of files onto the file system.
There is SQLite that can be used in iOS or use CoreData which is already available on the device
Why don't you trust property list files and NSDictionary? They're exactly for this purpose.
I will be storing terabytes of information, before indexes, and after compression methods.
Should I code up a Binary Tree Database by hand using sort files etc, or use something like MongoDB or even something like MySQL?
I am worried about (space) cost per record with things like MySQL and other DB's that are around. I also know that some databases even allow for compression, but they convert to read only tables. These tables/records need to be accessed and overwritten with new data fairly often. I think if I were to code something in C++ I'd be able to keep the cost of space per record to a minimum.
What should I do?
There are new non-relational databases that are becoming popular these days, that specialize in managing large-scale data.
Check out Hadoop or Cassandra, both of these are at the Apache Project.