Increasing mass import speed to MS SQL 2008 database from client application - c++

I have a Qt application, that reads a special text file, parses it and inserts about 100000 rows into a temporary table in a firebird database. Then it starts a stored procedure to process this temporary table and apply some changes to permanent tables. Inserting 100000 rows into in-memory temporary table takes about 8 seconds on firebird.
Now I need to implement such behavior using MS SQL Server 2008. If I use simple serial inserts it takes about 76 seconds for 100000 rows. Unfortunately, it's too slow. I looked at the following ways:
Temporary tables (# and ##). Stored on the disk in tempdb scheme. So there is no speed increase.
Bulk Insert. Very nice insertion speed, but thre is a need to have client or server-side shared folder.
Table variables. MSDN says: "Do not use table variables to store large amounts of data (more than 100 rows)."
So, tell me please, what is the right way to increse insertion speed from client application to MSSSQL2008.
Thank you.

You can use the bulk copy operations available through OLE DB or ODBC interfaces.
This MSDN article seems to hold your hand through the process, for ODBC:
Allocate an environment handle and a connection handle.
Set SQL_COPT_SS_BCP and SQL_BCP_ON to enable bulk copy operations.
Connect to SQL Server.
Call bcp_init to set the following information:
The name of the table or view to bulk copy from or to.
Specify NULL for the name of the data file.
The name of an data file to receive any bulk copy error messages
(specify NULL if you do not want a message file).
The direction of the copy: DB_IN from the application to the view or
table or DB_OUT to the application from the table or view.
Call bcp_bind for each column in the bulk copy to bind the column to a
program variable.
Fill the program variables with data, and call bcp_sendrow to send a
row of data.
After several rows have been sent, call bcp_batch to checkpoint the
rows already sent. It is good practice to call bcp_batch at least once
per 1000 rows.
After all rows have been sent, call bcp_done to complete the
operation.
If you need a cross platform implementation of the bulk copy functions, take a look at FreeTDS.

Related

Which one is more performant in redshift - Truncate followed with Insert Into or Drop and Create Table As?

I have been working on AWS Redshift and kind of curious about which of the data loading (full reload) method is more performant.
Approach 1 (Using Truncate):
Truncate the existing table
Load the data using Insert Into Select statement
Approach 2 (Using Drop and Create):
Drop the existing table
Load the data using Create Table As Select statement
We have been using both in our ETL, but I am interested in understanding what's happening behind the scene on AWS side.
In my opinion - Drop and Create Table As statement should be more performant as it reduces the overhead of scanning/handling associated data blocks for table needed in Insert Into statement.
Moreover, truncate in AWS Redshift does not reseed identity columns - Redshift Truncate table and reset Identity?
Please share your thoughts.
Redshift operates on 1MB blocks as the base unit of storage and coherency. When changes are made to a table it is these blocks that are "published" for all to see when the changes are committed. A table is just a list (data structure) of block ids that compose it and since there can be many versions of a table in flight at any time (if it is being changed while others are viewing it).
For the sake of the is question let's assume that the table in question is large (contains a lot of data) which I expect is true. These two statements end up doing a common action - unlinking and freeing all the blocks in the table. The blocks is where all the data exists so you'd think that the speed of these two are the same and on idle systems they are close. Both automatically commit the results so the command doesn't complete until the work is done. In this idle system comparison I've seen DROP run faster but then you need to CREATE the table again so there is time needed to recreate the data structure of the table but this can be in a transaction block so do we need to include the COMMIT? The bottom line is that in the idle system these two approaches are quite close in runtime and when I last measured them out for a client the DROP approach was a bit faster. I would advise you to read on before making your decision.
However, in the real world Redshift clusters are rarely idle and in loaded cases these two statements can be quite different. DROP requires exclusive control over the table since it does not run inside of a transaction block. All other uses of the table must be closed (committed or rolled-back) before DROP can execute. So if you are performing this DROP/recreate procedure on a table others are using the DROP statement will be blocked until all these uses complete. This can take an in-determinant amount of time to happen. For ETL processing on "hidden" or "unpublished" tables the DROP/recreate method can work but you need to be really careful about what other sessions are accessing the table in question.
Truncate does run inside of a transaction but performs a commit upon completion. This means that it won't be blocked by others working with the table. It's just that one version of the table is full (for those who were looking at it before truncate ran) and one version is completely empty. The data structure of the table has versions for each session that has it open and each sees the blocks (or lack of blocks) that corresponds to their version. I suspect that it is managing these data structures and propagating these changes through the commit queue that slows TRUNCATE down slightly - bookkeeping. The upside for this bookkeeping is that TRUNCATE will not be blocked by other sessions reading the table.
The deciding factors on choosing between these approaches is often not performance, it is which one has the locking and coherency features that will work in your solution.

how's the database management software like navicat select a large number of data from table

I am writing a c++ program in Linux platform.This program is something like linux-mini-navicat,which can connect to different database(postgresql,mysql,mssql,oracle) and execute sql.And the program start an interface server(thrift) for client connect and execute sql command.
When I execute "select * from table" which have a lot of data,maybe a million or 10 million or more,my program is terminated by linux before returning data back to client, duing to it out of memory.
I am curious about how navicat achive that,and how can I achive that in my program?
Hope I make my question clear.
Usually there is no need to retrieve (and hold in memory) all data from the big table in one go. If you are displaying query results, you could fetch enough data to fill the table on the screen, and then download it when user scrolls the table. If you develop some analysis algorithm, you still could analyze table data in chunks. See documentation on scrollable cursors for your database engine.

Efficiency using triggers inside attached database with SQLite

Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.

SAS/ACCESS and data step on external DB

I have the following concern regarding SAS/ACCESS facility.
Let's imagine that we have an external DB (i.e. Oracle), which we have assigned to a certain libname.
Next, we do a simple operation on one of the tables within this DB, i.e.
data db.table_new;
set db.table_old(keep=var1 var2 var3);
if var1>0 then new_var1=5;
run;
My question is the following:
Will the whole table table_old be pulled from external DB to SAS Server in order to process the data?
Will SAS/ACCESS transform the data step into DBMS operation or SQL so the whole processing will be performed outside SAS?
The documentation is unclear about it . See page 62.
Usually the rule of thumb is: if the SAS functions that are used in DATA step can be converted to native db sql functions, then SAS will let the DB server do the data processing. In your case, this seems to be the situation.
You can answer this question on any piece of code through a set of non-syntax-highlighted options that need to be simplified:
options sastrace=',,,d' sastraceloc=saslog nostsuffix;
When you run the data step, check the log. You will see information about whether SAS is able to successfully translate the code or not. If it was unsuccessful, you will see:
ACCESS ENGINE: SQL statement was not passed to the DBMS, SAS will do the processing.
If this occurs, SAS will usually send out a select * to the server and pull everything before filtering. When you see that error, try doing explicit passthrough, or redesign your query so that it can do everything on the server. It is possible to bring down the SAS server, or severely degreade performance on the Oracle server, if the table is large enough.
Some common functions you'll want to avoid using directly in the query, especially with Oracle:
datepart()
intnx()
intck()
today()
put()
input()
If I have to use any of those functions, I usually play it safe and create a macro variable of static ones beforehand (e.g. today()), filter the raw data at the lowest level first to get it into the SAS server, or use explicit SQL passthrough.
In summary, I would say it depends on your method. On the second page of Chapter 1 of the SAS/Access 9.2 document in your above link, there are two methods (among the older DBLOAD procedure) of the SAS/ACCESS facility:
LIBNAME reference - assign SAS librefs to DBMS objects such
as schemas and databases; you can then work with the table or view as you would with a SAS data set...You can use such SAS procedures as PROC SQL or DATA step programming on any libref that references DBMS data.
SQL Pass-through facility - to interact with a data source using its
native SQL syntax without leaving your SAS session. SQL statements are passed directly to the data source for processing...The DBMS optimizer can take advantage of indexes on DBMS columns to process a query more quickly and
efficiently
Hence, for the first method SAS handles processing and second method DBMS handles processing. Like most clients (Java, C#, Python script or PHP webpage) that connect to external RDMS sources, unless a direct ODBC/OLEDB or other API connection is explicitly employed and request sent, processing is handled in the frontend (i.e., calculating parameters) and the end result is updated to the backend via transactions. All SAS's libraries would live in memory (or temporary hard disk) during the appointed session and depending on the code handles data itself and passes results to external source or passes data handling entirely to another source.
Comparative Example: Microsoft Access
One good comparative example would be Microsoft Access which like SAS too provides a linked table connection and pass-through query for any ODBC-compliant RDMS including SQL Server, Oracle, MySQL, etc. It is often a misnomer to tag Access as a database when actually it is a GUI program and collection of objects, one of which is the default Windows JET/ACE engine (a .dll file) not at all restricted to Access but available to all Office programs. Notice the world default as this can be switched out to any ODBC database source.
Linked tables are essentially Access GUI objects (specifically special tabledefs) not unlike SAS's libname refs that are loaded into a JET/ACE table container with data pointing externally. One can then use a linked table like any other Access local table and use anything of the ACE SQL dialect. This special linked table (much like SAS's libname refs are established by ODBC or other connection type) points to the external source and the driver translates query command for the migration action. Therefore, an exact same Access linked table query may perform differently than same RDMS query.
Analogy
I imagine SAS behaves the same way and exists as a front-end with libname ref as local objects with pointers to the backend. All data step handling is processed locally and simply the resultset are imported or extracted by the engine. To use an analogy. A database would be the home and SAS is the garbage man, home decorator, or move-in helper. SAS (like Java's JDBC, PHP's PDO, Python's cursors, R's libraries) knocks on the door which the database answers (annoyed by so many requests). "Hey buddy, we need to take out the garbage and here are the exact items...or we need to remodel the basement and here are the exact specs...or we have new furniture to add in the truck ready for drop off...with credentials signed please carry out immediately." And like in both, pass-through methods are requests carried out on the backend engine. So SAS leaves instructions, maybe a note on the door (without exactness) for homeowner to carry out.

C++ Builder - Multithreaded database update

I use ADO components in C++ Builder and I need to add about 200 000 records to my MS Access database. If I add those records one by one it takes a lot of time so I wanted to use threads. Each thread would create a TADOTable, connect to a database and insert it's own rows. But, when running the application it is even slower then using only one thread!
So, how to do it? I need to add many records to my Access database but want to avoid one-by-one insertion. A code would be useful.
Thank you.
First of all multithreading will not increase the speed of inserts. It will slow it down because of context switching and stuff. What you need is the way to have bulk inserts , that is sending multiple rows in a single transaction
Try searching for bulk inserts in acesss tables. There is a lot of information there.