Django fixtures, loading in large amounts of data - django

So I have two 200mb JSON files. The first one takes 1.5 hours to load, and the second which (which makes a bunch of many-to-many relationships models with the first), takes 24+ hours (since there's no updates via the console, I have no clue if it was still going or if it froze, so I stopped it).
Since the loaddata wasn't working that well, I wrote my own script that loaded the data while also outputting what's been recently saved into the db, but I noticed the speed of the script (along with my computer) decayed the longer it went. So I had to stop the script -> restart my computer -> resume at the section of data where I left off, and that would be faster than running the script throughout. This was a tedious process since it took roughly 18 hrs with me restarting the computer every 4 hours to get all the data fully loaded.
I'm wondering if there is a better solution for loading in large amounts of data?
EDIT: I realized there's an option to load in raw SQL, so I may try that, although I need to brush up on my SQL.

When you're loading large amounts of data, writing your own custom script is generally the fastest. Once you've got it loaded in once, you can use your databases import/export options, which will generally be very fast (ex, pgdump).
When you are writing your own script, though, two things which will drastically speed things up:
Loading data inside a transaction. By default the database is likely in autocommit mode, which causes an expensive commit after each insert. Instead, make sure that you begin a transaction before you insert anything, then commit it afterwards (importantly, though, don't forget to commit; nothing sucks like spending three hours importing data, only to realize you forgot to commit it).
Bypassing the Django ORM and use raw INSERT statements. There is some computational overhead to the ORM, and bypassing it will make things faster.

Related

How to keep the SQLite database file in RAM?

Okay I know this question feels weird, but let me present my problem. I am building Qt based application with SQLite as my database. I noticed a few things. Whenever you are performing operations like manipulating row one by one directly on the sqlite file, it seems slow. That is because it is doing I/O operations on a file which is stored in hard drive. But when I use SSD instead of HDD, the speed is considerably improved, that is because SSD has high IO speed. But if I load a table into QSqlTableModel, we can make all the changes and save it the speed is good. That is because in one query the data is fetched from sqlite file and stored in RAM memory. And So the IO operations are less. So it got me thinking is it possible to save the sqlite file in RAM when my application launches, perform all my opeartions and then when user chooses to close, at that instant i can save the file to HDD? One might think why don't I just use qsqltablemodel itself, but there are some cases for me which involves creating tables and deleting tables, which qt doesnt support out of box, we need to execute query for that. So if anyone can point me if there's a way to achieve this in Qt, that would be great!

Loading the dataset everytime when kernel starts?

Do I need to load dataset everytime I shut down the kernel and start again the next day?
As I am constantly getting an error of NameError that the train dataset has not been identified.
Is there any better way rather than to load dataset and run the same commands again?
There are a couple of things to try. You could create and load a pickle object. I've done this for large text based datasets.
Have a look at the %%cache magic at https://github.com/rossant/ipycache
This can be used to save your data in a pickle like object between sessions.
Search stackoverflow for 'how to cache jpython notebook'.
I'd have added this just as a comment as I am not supplying code examples, but I don't have the necessary reputation.

Should I store a list in memory or in a database and should I build a class to connect to DB?

I am writing a C++ program, I have a class that provides services for the rest of the clases in the program.
I am writing now the clases and the UML.
1) the class that I refer to has a task list that is changing over time and conditions are being checked on this list, I am thinking to keep it in a table in a databasse that every line in the table would represent a task, this way in case that the program crashes or stops working I can restore the last situation, the other option is to keep the task list in memory and keep a copy in the database.
the task list should be searched every second
Which approach is more recommended?
2) In order to write and to read to the database I can call the database directly from the class or build a database communication class, if I write a data communication class I need to give specific options and to build a mini server for this,
e.g. write a line to the database, read a line to the database, update only the first column etc..
what is the recommended approach for this?
Thanks.
First, if the database is obvious and easy, and there are no performance problems, just do that. You're talking about running a query once/second, and maybe marking a task done or adding a new one every so often; even sqlite on a slow SMB share should be able to handle that just fine.
If you do need to optimize it, then there are two approaches: Either still with the database and cache it in-memory, or use memory as your primary storage and come up with a persistence mechanism that uses the database. But until you need to optimize it, don't.
Next, how should you do it? Your question makes it sound like you're thinking in terms of a whole three-tier system, with a "mini-server" sitting between the database server and your task list. There's really no need for that. What you want is a bespoke ORM, but that makes it sound more complicated than it is. All you're doing is writing a class that wraps a database connection and provides a handful of methods—get_due, mark_done, add, get_next_id—each of which maps SQL parameters to Task members. For example (with no error handling):
void mark_done(Task task) {
db.execute("UPDATE Task SET done=true WHERE id=%s", task.id);
}
Three more methods like that, plus a constructor to connect to the database (including creating the Task table if it didn't already exist), and your class is done.
The reason you don't want to write the database stuff directly into Task is that you don't really have anywhere to store shared information like the database connection object; either you need globals (or class attributes, which are effectively globals), or you need copies in every single Task instance (or, really, weak references—which you're going to fake with either a reference or a raw pointer, either way leading to shutdown problems somewhere down the line).
Finally, your whole reason for doing this is error recovery, and databases do a great job of journaling so nothing ever gets inconsistent, but you do have to make sure to structure your app to take advantage of that. For example, you may want to mark all the now-due tasks "in process", then process them, then mark them all "done"; that way, at recovery time, you know exactly which tasks may or may not have been done, and can act appropriately. The more steps you can commit to the database, the less data loss you have to deal with—but of course the more code you have to write, and the slower it gets. So, do as much as necessary, but no more.
Saving information in Database just to recover crashed information may be bit of an overkill.
You ideally want to serialize the list and save it - as binary, xml or csv based values. This can be done based on a timer or certain events in your applications.
Databases may also be used if you can come up with a structure that looks exactly similar to tables - so that you can do one-to-one mapping between the objects and probably write SQL queries easily. But keep that on a separate layer for abstraction.

Why sqlite3 can't work with NFS?

I switch to using sqlite3 instead of MySQL because I had to run many jobs on a PBS system which doesn't not have mysql. Of course on my machine I do not have a NFS while there exists one on the PBS. After spending lots of time switching to sqlite3, I go to run many jobs and I corrupt my database.
Of course down in the sqlite3 FAQ it is mentioned about NFS, but I didn't even think about this when I started.
I can copy the database at the beginning of the job but it will turn into a merging nightmare!
I would never recommend sqlite to any of my colleagues for this simple reason: "sqlite doesn't work (on the machines that matter)"
I have read rants about NFS not being up to par and it being their fault.
I have tried a few workarounds, but as this post suggests, it is not possible.
Isn't there a workaround which sacrifices performance?
So what do I do? Try some other db software? Which one?
You are using the wrong tool. Saying "I would never recommend sqlite ..." based on this experience is a bit like saying "I would never recommend glass bottles" after they keep breaking when you use them to hammer in a nail.
You need to specify your problem more precisely. My attempt to read between the lines of your question gives me something like this:
You have many nodes that get work through some unspecified path, and produce output. The jobs do not interact because you say you can copy the database. The output from all the jobs can be merged after they are finished. How do you effectively produce the merged output?
Given that as the question, this is my advice:
Have each job produce its output in a structured file, unique to each job. After the jobs are finished, write a program to parse each file and insert it into an sqlite3 database. This uses NFS in a way it can handle (single process writing sequentially to a file) and uses sqlite3 in a way that is also sensible (single process writing to a database on a local filesystem). This avoid NFS locking issues while running the job, and should improve throughput because you don't have contention on the sqlite3 database.

Redis is slow to get large strings

I'm kind of a newb with Redis, so I apologize if this is a stupid question.
I'm using Django with Redis as a cache.
I'm pickling a collection of ~200 objects and storing it in Redis.
When I request the collection from Redis, Django Debug Toolbar is informing me that the request to Redis is taking ~3 seconds. I must be doing something horribly wrong.
The server has 3.5GB of ram, and it looks like Redis is currently using only ~50mb, so I'm pretty sure it's not running out of memory.
When I get the key using the redis-cli it takes just as long as when I do it from Django
Running strlen on the key from redis-cli I'm informed that the length is ~20 million (Is this too large?)
What can I do to have Redis return the data faster? If this seems unusual, what might be some common pitfalls? I've seen this page on latency problems, but nothing has really jumped out at me yet.
I'm not sure if it's a really bad idea to store a large amount of data in one key, or if there's just something wrong with my configuration. Any help or suggestions or things to read would be greatly appreciated.
Redis is not designed to store very large objects. You are not supposed to store your entire collection in a single string in Redis, but rather use Redis list or set as a container for your objects.
Furthermore, the pickle format is not optimized for space ... you would need a more compact format. Protocol Buffers, MessagePack, or even plain JSON, are probably better for this. You should consider to apply a light compression algorithm before storing your data (like Snappy, LZO, Quicklz, LZF, etc ...).
Finally, the performance is probably network bound. On my machine, retrieving a 20 MB object from Redis takes 85 ms (not 3 seconds). Now, if I run the same test using a remote server, it takes 1.781 seconds, which is expected on this 100 Mbit/s network. The duration is fully dependent on the network bandwidth.
Last point: be sure to use a recent Redis version - a number of optimization have been done to deal with large objects.
It's most likely just the size of the string. I'd look at whether your objects are being serialized efficiently.