I am trying to insert a large number of records into a SQLite database. I get the above error if I try to use the sqlite3_exec C-API.
The code looks like this:
ret = sqlite_exec(db_p,".import file.txt table", NULL, NULL, NULL);
I know that the .import is command line, but can there be any way that you can do a extremely large insert of records that takes minimal time. I have read through previous bulk insert code and attempted to make changes but these are not providing the desired results.
Is there not a way to directly insert the string into the tables without having intermediate API's being called?
.import is most probably not available via the API. However there's one crucial thing to speed up inserts: wrap them in a transaction.
BEGIN;
lots of insert statements here;
COMMIT;
Without this, sqlite will need to write to the file after each insert to keep the ACID principle. The transaction let's it write to file later in bulk.
The answer to the syntax error could well be, that your strings are not enclosed in quotes in your SQL statement.
Related
I have a list of lots of data (will be near 1000). I want to add it all in one go to a row. Is this straight forward like a for loop over list with multiple inserts?multiple commits? Is this bad practice?thanks
I haven’t tried yet as just setting up table columns which is many so need to know if feasible thanks
If you're using SQL to insert:
INSERT INTO 'tablename' ('column1', 'column2') VALUES
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2'),
('data1', 'data2');
If you're using code... generate that above query using a for loop then run it.
For a more efficient approach consider a union as shown in: Is it possible to insert multiple rows at a time in an SQLite database?
insert into 'tablename' ('column1','column2')
select data1 as 'column1',data2 as 'column2'
union select data3,data4
union...
In sqlite you don't have network latency, so it does not really matter performance wise to issue many small requests toward the engine. For more reference about that you can read this page from the official documentation: https://www.sqlite.org/np1queryprob.html
But in write mode (insert or update), each individual query will have to pay the cost of an implicit transaction. To avoid that you need to gather your insert queries in an explicit transaction. Depending of your programming language, how you do that may vary. Here is a code sample on how to do that in go. I've simplified error code management, to have a better view of the gist.
tx, _ := db.Begin()
for _, item := range items {
tx.Exec(`INSERT INTO testtable (col1, col2) VALUES (?, ?)`, item.Field1, item.Field2)
}
tx.Commit()
If you detect an error in your loop instead calling tx.Commit() you need to call tx.Rollback() in order to cancel all previous writes to your database so that the final state is as if no insert query has been issued at all.
I'm in the process of building new "ETL" pipelines with CTAS. Unfortunately, Quite often the CTAS query is too intensive which causes Athena to time out. As such, I use CTAS to create the initial table and populate with a small sample. I then write a script that queries the same table the CTAS was generated from (which is parquet format) for the remaining days that the CTAS couldn’t handle upfront. I write the output of these query results to the same directory that is holding the results of the CTAS query before repairing the table (to pick up new data). However, it seems to be a pretty clunky process for a number of reasons:
1) Query results written out with a standard SQL statements all end up being strings. For example, when I write out the number of DAUs (which is a count and cast to an int) the csv output is a string I.e. wrapped in “”.
Is it possible to write out Athena "query_results" (not the CTAS) as anything other than a string when in CSV format. The main problem with this is it means it can't be read back into the table produced by the CTAS since these column expect a bigint. This, of course, can be resolved with a lambda function but seems like a big overhead for something that should be trivial.
2) Can you put query results (not from CTAS) directly into parquet instead of CSV?
3) Is there any way to prevent metadata being generated with the query_results (not from CTAS). Again, it can be cleaned up with a lambda function, but it's just additional nonsense I need to handle.
Thanks in advance!
The data type of the result depends on the SQL used to create it and also on how you consume it. Based on your question I'm going to assume that you're creating a table using CTAS and that the output is CSV, and that you're then looking at the CSV data directly.
That CSV is going to have quotes in it, but that doesn't mean that it's not possible to read integer values as integers, and so on. Athena uses a schema-on-read approach, and as long as the serde can interpret a value as a particular type, that type will work as the type of the column.
If you query the table created by your CTAS operation you should get back integers for the integer columns.
Using CTAS you can also create output of different types, like JSON, Avro, Parquet, and ORC, that keep the type information. Just use the format property to select the output type.
I am a bit confused what you mean by your third question. With a normal query you get two files on S3, the data file and the metadata file, and they will be written to the output location given in the StartQueryExecution API call, but with a CTAS query you get the output data in a different location (given in the SQL) than the metadata file.
Are you actually using CTAS, or are you talking about the regular query result files?
Update after the question got clarified:
1) Athena is unfortunately unable to properly read it's own output in many situations. This is something that really surprises me that they never considered before launch. You might be able to set up a table that uses the regex serde.
2) No, unfortunately the only output of a regular query is CSV at this time.
3) No, the metadata is always written to the same prefix as the output.
I think your best bet is running multiple CTAS queries that select subsets of your source data, if there is a date column for example you could make one CTAS per month or some other time range that works. After the CTAS queries have completed you can move the result files into the same directory on S3 and create a final table that has that directory as its location.
I am currently developing server software in C++ with a MySQL data backend. I am using the official MySQL/connector library from Oracle to work with MySQL. The connection itself is working and I'm not having any issues with that.
My problem is that the database and the table schemas tend to change every once in a while because new tables and columns keep getting added. Also exiting column may be changed for the same reason. To make sure I recognize outdated server software quickly I wanted to add a warning when the database has changed.
My first idea was to hardcode how the database (and tables and such) should look and then check whether the current database matches the hardcoded data. But I have no clue how to achive that.
In summary I want to be able to detect whether
A table has been added or removed
A column in a table has been altered
A column in a table has been added or removed
with as little C++ code as possible. Also it should be quite easy to maintain.
Additional information will be added when required.
I would suggest the following approach:
1) fork and execute the mysql command line client. Set up a pair of pipes, to mysql's standard input and output.
2) At this point you should be able to execute simple commands by piping them to mysql via the standard input pipe, and read the output from the standard output pipe.
You will need to make careful notes as to the output format of each mysql command, so that you know when you finished reading its output, and you can send the next command.
3) As the first order of being, execute:
show tables;
The output that comes back will list all tables in the database. Parsing the output into a list of table names is trival. Then execute for each table:
show create table <tablename>;
The resulting output shows all fields in the table, its keys, and constraints. Pretty much all of this table's schema. Lather, rinse, repeat, for every table.
4) In this manner you can capture a basic schema of the entire database, for comparison purposes. If necessary, use the same approach to capture the triggers, and other objects. You'll likely need to do some minor massaging of the data, and exclude a few bits. "show create table", for example, will include the current AUTO_INCREMENT values, which you can ignore.
This general approach, of driving a mysql process via its standard input and output, is bit wobbly, of course. With a little bit of work, you can use mysql's native client library, and execute all of these commands, and capture their results, directly. This should be more reliable.
I am having an array of structure. I need to insert all the rows from that array to a table.
So I have simply used cfquery inside cfloop to insert into the database.
Some people suggested me not to use cfquery inside cfloop as each time it will make a new connection to the database.
But in my case Is there any way I can do this without using cfloop inside cfquery?
Its not so much about maintaining connections as hitting the server with 'n' requests to insert or update data for every iteration in the cfloop. This will seem ok with a test of a few records, but then when you throw it into production and your client pushes your application to look around a couple of hundred rows then you're going to hit the database server a couple of hundred times as well.
As Scott suggests you should see about looping around to build a single query rather than the multiple hits to the database. Looping around inside the cfquery has the benefit that you can use cfqueryparam, but if you can trust the data ie. it has already been sanatised, you might find it easier to use something like cfsavecontent to build up your query and output the string inside the cfquery at the end.
I have used both the query inside loop and loop inside query method. While having the loop inside the query is theoretically faster, it is not always the case. You have to try each method and see what works best in your situation.
Here is the syntax for loop inside query, using oracle for the sake of picking a database.
insert into table
(field1, field2, etc)
select null, null, etc
from dual
where 1 = 2
<cfloop>
union
select <cfqueryparam value="#value1#">
, <cfqueryparam value="#value2#">
etc
from dual
</cfloop>
Depending on the database, convert your array of structures to XML, then pass that as a single parameter to a stored procedure.
In the stored procedure, do an INSERT INTO SELECT, where the SELECT statement selects data from the XML packet. You could insert hundreds or thousands of records with a single INSERT statement this way.
Here's an example.
There is a limit to how many <CFQUERY><cfloop>... iterations you can do when using <cfqueryparam>. This is also vendor specific. If you do not know how many records you will be generating, it is best to remove <cfqueryparam>, if it is safe to do so. Make sure your data is coming from trusted sources & is sanitised. This approach can save huge amounts of processing time, because it is only make one call to the database server, unlike an outer loop.
I have an entire set of data i want to insert into a table. I am trying to have it insert/update everything OR rollback. I was going to do it in a transaction, but i wasnt sure if the sql_exec() command did the same thing.
My goal was to iterate through the list.
Select from each iteration based on the Primary Key.
If result was found:
append update to string;
else
append insert to string;
Then after iterating through the loop, i would have a giant string and say:
sql_exec(string);
sql_close(db);
Is that how i should do it? I was going to do it on each iteration of the loop, but i didnt think a global rollback if there was an error.
No, you should not append everything into a giant string. If you do, you will need to allocate a whole bunch of memory as you are going, and it will be harder to create good error messages for each individual statement, as you will just get a single error for the entire string. Why spend all of that effort, constructing one big string when SQLite is just going to have to parse it back down into its individual statements again?
Instead, as #Chad suggests, you should just use sqlite3_exec() on a BEGIN statement, which will begin a transaction. Then sqlite3_exec() each statement in turn, and finally sqlite3_exec() a COMMIT or ROLLBACK depending on how everything goes. The BEGIN statement will start a transaction, and all of the statements executed after that will be within that transaction, and so committed or rolled back together. That's what the "A" in ACID stands for; Atomic, as all of the statements in the transaction will be committed or rolled back as if they were a single atomic operation.
Furthermore, you probably shouldn't use sqlite3_exec() if some of the data varies within each statement, such as being read from a file. If you do, a mistake could easily leave you with an SQL injection bug. For instance, if you construct your query by appending strings, and you have strings like char *str = "it's a string" to insert, if you don't quote it properly, your statement could come out like INSERT INTO table VALUES ('it's a string');, which will be an error. Or if someone malicious could write data into this file, then they could cause you to execute any SQL statement they want (imagine if the string were "'); DROP TABLE my_important_table; --"). You may think that no one malicious is going to provide input, but you can still have accidental problems, if someone puts a character that confuses the SQL parser into a string.
Instead, you should use sqlite3_prepare_v2() and sqlite3_bind_...() (where ... is the type, like int or double or text). In order to do this, you use a statement like char *query = "INSERT INTO table VALUES (?)", where you substitute a ? for where you want your parameter to go, prepare it using sqlite3_prepare_v2(db, query, -1, &stmt, NULL), bind the parameter using sqlite3_bind_text(stmt, 1, str, -1, SQLITE_STATIC), then execute the statement with sqlite3_step(stmt). If the statement returns any data, you will get SQLITE_ROW, and can access the data using the various sqlite3_columne_...() functions. Be sure to read the documentation carefully; some of the example parameters I gave may need to change depending on how you use this.
Yes, this is a bit more of a pain than calling sqlite3_exec(), but if your query has any data loaded from external sources (files, user input), this is the only way to do it correctly. sqlite3_exec() is fine to call if the entire text of the query is contained within your source, such as the BEGIN and COMMIT or ROLLBACK statements, or pre-written queries with no parts coming from outside of your program, you just need prepare/bind if there's any chance that an unexpected string could get in.
Finally, you don't need to query whether something is in the database already, and then insert or update it. You can do a INSERT OR REPLACE query, which will either insert a record, or replace one with a matching primary key, which is the equivalent of selecting and then doing an INSERT or an UPDATE, but much quicker and simpler. See the INSERT and "on conflict" documentation for more details.