I am pulling out large data from oracle database using cx_oracle using below sample script:
from cx_Oracle import connect
TABLEDATA = []
con = connect("user/password#host")
curs = con.cursor()
curs.execute("select * from TABLE where rownum < 100000")
for row in curs:
TABLEDATA.append([str(col) for col in list(row)])
curs.close()
con.close()
Problem with storing in list is that it ends up to about 800-900 mb of RAM usages.
I know I can instead save this in file and not store in list but I am using this list to display table using QTABLEVIEW and QABSTRACTTABLE MODEL.
Is there any alternate or more effient way where I can minimise memory usage of storing this data and also use it to display my table?
I have tried multiple possobilities, I don't think qsqltablemodel works for me. Though it load data directly from database, as you keep scrolling down it loads more and more data in table and hence the memory usage keep on increasing.
What I think will ideally work is being able to load set number of rows in model. As you scroll down it loads new rows but also at the same time unloads what's already there. So at any point of time we only have set number of rows loaded in model.
If you don't want to store all the data in RAM, then you need to use a model for you tableview that get's information from the database as needed. Fortunately, Qt natively supports this, and can connect to oracle databases.
You will want to look into:
http://qt-project.org/doc/qt-4.8/sql-driver.html
http://qt-project.org/doc/qt-4.8/sql-model.html
http://qt-project.org/doc/qt-4.8/qsqltablemodel.html
http://qt-project.org/doc/qt-4.8/qsqldatabase.html
Note this is c++ documentation, but it is fairly easy to translate to PyQt (I always use the c++ documentation despite never coding in c++). You may also want to subclass QSqlTableModel to provide slightly different behaviour to the standard interface!
Related
I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.
First of all, sorry for my english.
I have a C++ desktop application which gets rows from a database, and, for each row, the app creates an object which represent that row from that specific table. Each table has its corresponding class (I use ODB for that).
Once I've recovered the rows of a table, I show them in a table, which can be sorted by columns. Each column has a "sort" icon which allows to sort the table entries according to that column.
My question is, what is what quality apps usually do? Making other query each time the table must be sorted? or to sort the objects manually, using for example, a std::set? Which is faster?
I think sorting the entries using a std::set is faster because we avoid communication with the MySQL server, but at the same time perhaps the MySQL optimizer do some magic if we reorder multiple times the same database table, specially with index involved. I think that it could even depend on the frequency of these sort operations.
Anyway, I want to know pros and cons of both approaches.
Many applications let the database perform most of the work.
When tables are created, the application tells the database to set up columns for searching (indexing). The database will usually create an index table of . This makes searching faster because the order of the data in the table does not need to be sorted.
The applications would send the database a query statement to choose data from the database in a needed order. The application then iterates over the data.
When displaying data in a GUI grid, many frameworks perform the sorting for you. You tell the GUI which column to use for sorting and have the GUI resort and then display the data. Real applications use existing libraries and frameworks as much as possible.
If there is enough memory for your table, read the data in and sort the table. Otherwise, tell the database to generate a new view and reload the table in the GUI (as necessary).
Imagine an applications that displays data from a sqlite database.
The app is making use of model/view programming.
It can have multiple views acting in parallel on different subsets of the same data (subsets made by filtering the required data types).
(Sidenote: I am using Qt, so there is no controller part, of course, but I did not find a more suitable tag.)
I am not sure which approach to take:
1a. Load all database data into one single model
1b. Then apply the model to all views, filtering the data inside the view with a proxy model
2a. One model for each view, but filtering done inside sqlite database.
Pros/Cons:
Idea 1:
(+) one model, makes use of model/view advantages (e.g. updating all connected views)
(-) memory usage could get huge because all data is loaded into a model, but only a subset is shown
Idea 2:
(+) theoeretically lower memory usage because only the filtered data is loaded from the database
(-) the views can have filters that could lead to intersecting data, meaning the same data would be stored in more than one model -> perhaps practically even bigger memory usage than in Idea 1
The data being loaded here is just case metadata, e.g. title, description, datetime and so on. Bigger data like images, files are not being loaded here. So as the database could indeed grow big (big for this kind of application, say 200 gb for power users), this does not affect the topic of the present question, because the metadata is much, much smaller and is proportional to overall data count, not data size.
Do you have practical experience with such a configuration and can suggest which one to use? It seems to me that Idea 1 is the way to go, but I am not sure about it.
In my experience, the less data is loaded from the database into memory, the better. It is not just the memory usage, but also startup time. If the data is delivered over the network, loading a few gigabytes can take forever.
So I would go for a variant of your second solution, where each table view has its own model. The model is an implementation of QAbstractItemModel that lazily fetches only the rows that currently need to be displayed. The models could, however, share a common cache. This will also make sure that they all display the same data where it intersects.
I was wondering if you can help me out with my current problem which is to insert data into multiple tables in my relational database using a single form. I am fairly new to APEX but do have a little bit of background on mysql and php programming. In the past, I normally achieve such task by creating a view of all the columns from different table that I want to populate and using a simple insert commands but doing the same thing in apex gives me and error stating that "ORA-01779: cannot modify a column which maps to a non key-preserved table".
In Oracle you can not just update a view which has eg a JOIN clause. Oracle will not map all columns back to the source tables: one table might while the others won't. This isn't an apex problem: if you were to run an update against your view in the db you would get this error just as well.
If you want to have your apex screen remain as transparent as possible, then you may want to consider user an instead-of trigger on the view. You will have to write the correct dml statements in this trigger though in order to ensure your data is pushed through correctly to all tables.
Another option is to use the view only to fetch, and use different processes to push the data to the correct tables. Using data-layer packages might alleviate the use of code stored in apex (eg having a lot of plsql code in apex itself is usually not favored and is rather stored in packages).
Create items and get all the items values and use PL/SQL on submit button.
Eg: p1_party_Name, p2_Service_Name
BEGIN;
INSERT INTO par VALUES(par_party_uid_seq.nextval,:p1_Party_name);
INSERT INTO par VALUES(ser_service_uid_seq.nextval,:p2_Service_name);
END;
I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM.
Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.
I've taken the following measures:
Use #transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization
What else can I do to speed things up? Here are some of my thoughts:
Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function
Any pointers regarding these items or any other idea would be welcome :)
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
https://docs.djangoproject.com/en/dev/releases/1.4/
https://code.djangoproject.com/ticket/7596
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
Drop to DB-API and use cursor.executemany(). See PEP 249 for details.
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.
The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create