Better way to bulk insert in sqlite3 C++ - c++

I've bulk insert statements to be executed. I can create a single sql string with all the insert statements and execute once like,
std::string sql_string = "";
int i;
for(i=0;i<1000; i++) {
Transaction transaction = transactions[i];
std::string tx_sql = CREATE_SQL_STRING(transaction, lastTxId); // "INSERT INTO (...) VALUES (...);"
sql_string = sql_string+tx_sql;
lastTxId++;
}
sqlite3_exec(db, sql_string.c_str(), callback, 0, &zErrMsg);
Is there any other better-performing way to bulk insert with sqlite3 c++ API?
Insertions are not security critical. Only the performance is considered. Should I use prepared statements?

most important: put all inserts into a single transaction.
for more detailed stuff regarding performance see the official documentation: https://sqlite.org/speed.html (a bit outdated)
for more recent information i'd recomment https://medium.com/#JasonWyatt/squeezing-performance-from-sqlite-insertions-971aff98eef2
as for prepared statements: i am not sure about performance, but it is a good practice to always use them.

Related

RocksDB - Double db size after 2 Put operations of same KEY-VALUEs

I have a program that uses RocksDB that tries to write huge amount of pairs of KEY-VALUE to database:
int main() {
DB* db;
Options options;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism(12);
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;
// open DB
Status s = DB::Open(options, kDBPath, &db);
assert(s.ok());
for (int i = 0; i < 1000000; i++)
{
// Put key-value
s = db->Put(WriteOptions(), "key" + std::to_string(i), "a hard-coded string here");
assert(s.ok());
}
delete db;
return 0;
}
When I run the program for the first time, it generated about 2GB of database, and I tried running this program for several times, without any changes, I got N*2GB of database with N=number-of-run. Until a certain number of N, the database size started to reduce.
But whatever I expected is the new batch of data written to database should be overwritten after each run if the batch doesn't change -> Then the size of database should be ~2GB after each run instead.
QUESTION: is it an issue of RocksDB or if not, what is proper settings for it to keep database's size stable in case of similar written pairs of KEY-VALUE?
A full compaction can reduce the space usage, just add this line before delete db;:
db->CompactRange(CompactRangeOptions(), nullptr, nullptr);
Note: a full compaction do take some time, depends on the data size.
Space amplification is expected, all LSM tree data structure DBs have this issue: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#amplification-factors
Here is a great paper about space amplification research for rocksdb: http://cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf

How to buffer efficiently when writing to 1000s of files in C++

I am quite inexperienced when it comes to C++ I/O operations especially when dealing with buffers etc. so please bear with me.
I have a programme that has a vector of objects (1000s - 10,000s). At each time-step the state of the objects is updated. I want to have the functionality to log a complete state time history for each of these objects.
Currently I have a function that loops through my vector of objects, updates the state, and then calls a logging function which opens the file (ascii) for that object, writes the state to file, and closes the file (using std::ofstream). The problem is this signficantly slows down my run time.
I've been recommended a couple things to do to help speed this up:
Buffer my output to prevent extensive I/O calls to the disk
Write to binary not ascii files
My question mainly concerns 1. Specifically, how would I actually implement this? Would each object effectively require it's own buffer? or would this be a single buffer that somehow knows which file to send each bit of data? If the latter, what is the best way to achieve this?
Thanks!
Maybe the simplest idea first: instead of logging to separate files, why not log everything to an SQLite database?
Given the following table structure:
create table iterations (
id integer not null,
iteration integer not null,
value text not null
);
At the start of the program, prepare a statement once:
sqlite3_stmt *stmt;
sqlite3_prepare_v3(db, "insert into iterations values(?,?,?)", -1, SQLITE_PREPARE_PERSISTENT, &stmt, NULL);
The question marks here are placeholders for future values.
After every iteration of your simulation, you could walk your state vector and execute the stmt a number of times to actually insert rows into the database, like so:
for (int i = 0; i < objects.size(); i++) {
sqlite3_reset(stmt);
// Fill in the three placeholders and execute the query.
sqlite3_bind_int(stmt, 1, i);
sqlite3_bind_int(stmt, 2, current_iteration); // Could be done once, but here for illustration.
std::string state = objects[i].get_state();
sqlite3_bind_text(stmt, 3, state.c_str(), state.size(), SQLITE_STATIC); // SQLITE_STATIC means "no need to free this"
sqlite3_step(stmt); // Execute the query.
}
You can then easily query the history of each individual object using the SQLite command-line tool or any database manager that understands SQLite.

C++ Insert into MySQl database

I have one application in c++ and i'm using the C++ Mysql Connector (https://dev.mysql.com/downloads/connector/cpp/)
I need to save some logs inside one table.Depending on the times I may have large amounts of data on the order of thousands (for example, 80,000).
I already implement a function that iterate my std::vector<std::string> and save the std::string to my database.
For example:
std::vector<std::string> lines = explode(filedata, '\n');
for (int i = 0; i < lines.size(); i++)
{
std::vector<std::string> elements = explode(lines[i], ';');
ui64 timestamp = strtol(elements.at(0).c_str(), 0, 10);
std::string pointId = elements.at(6);
std::string pointName = elements.at(5);
std::string data = elements.at(3);
database->SetLogs(timestamp, pointId, pointName, data);
}
The logs come from csv file, i save all fields to my vector. After this i parse the vector (with explode) and get only the fields that i need to save.
But i have a problem. If i have e.g 80,000 i'm calling my function to save in database 80,000. It works and save correctly all data but it takes a lot of time.
Exists some way to save all data only calling one time the function to save without calling e.g 80,000 and thus optimize the time?
EDIT 1
I change the insert code to this:
std::string insertLog = "INSERT INTO Logs (timestamp,pointId,pointName,data) VALUES (?,?,?,?)";
pstmt->setString(1, timestampString);
pstmt->setString(2, pointId);
pstmt->setString(3, pointName);
pstmt->setString(4, data);
pstmt->executeUpdate();
You could change the code to do bulk inserts, rather than insert one row at a time.
I would recommend doing it by generating an insert statement as a string and passing the string to mysqlpp::Query. I expect you can use a prepared statement to do bulk inserts in a similar way.
If you do an insert statement for each row (which I assume the case here using explode()), I think there's a lot more traffic between the client and the server, which must slow things down.
EDIT
I doubt the function explode() for increasing the execution time as you are accessing the file twice when you call explode twice. It would be nice if you can produce the code for explode().

mapreduce program to read data from hive

I am new to hadoop mapreduce and hive.
I would like to read data from Hive using Mapreduce program(in java) and identify the average.
I am not sure how to implement in mapreduce. Please help me with sample program.
I am using ibm biginsights 64-bit to work on hadoop framework.
And I am unable to refer below link. Getting page cannot be found error.
https://cwiki.apache.org/Hive/tutorial.html#Tutorial-Custommap%252Freducescripts
enter code hereIs there a reason you are not simply using hql and
select avg(my_col) from my_table?
If you really need to do it in Java then you can use HiveClient and access via the hive jdbc api.
Here is a sample code snippet (elaborated from the HiveClient docs):
Connection con = null;
Statement stmt = null;
Resulset rs = null;
try {
con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
stmt = con.createStatement();
rs = stmt.executeQuery("select avg(my_col) as my_avg from my_table");
Double avg = rs.getDouble("my_avg");
// do something with it..
} finally {
// close rs, stmt, conn in reverse order
}
For further info: https://cwiki.apache.org/confluence/display/Hive/HiveClient
Note: you do NOT need to put this code into YOUR map/reduce Hive takes care of creating the map/reduce program (and the associated benefits of parallelization) itself.

Writing BLOB data to a SQL Server Database using ADO

I need to write a BLOB to a varbinary column in a SQL Server database. Sounds easy except that I have to do it in C++. I've been using ADO for the database operations (First question: is this the best technology to use?) So i've got the _Stream object, and a record set object created and the rest of the operation falls apart from there. If someone could provide a sample of how exactly to perform this seemingly simple operation that would be great!. My binary data is stored in a unsigned char array. Here is the codenstein that i've stitched together from what little I found on the internet:
_RecordsetPtr updSet;
updSet.CreateInstance(__uuidof(Recordset));
updSet->Open("SELECT TOP 1 * FROM [BShldPackets] Order by ChunkId desc",
_conPtr.GetInterfacePtr(), adOpenDynamic, adLockOptimistic, adCmdText);
_StreamPtr pStream ; //declare one first
pStream.CreateInstance(__uuidof(Stream)); //create it after
_variant_t varRecordset(updSet);
//pStream->Open(varRecordset, adModeReadWrite, adOpenStreamFromRecord, _bstr_t("n"), _bstr_t("n"));
_variant_t varOptional(DISP_E_PARAMNOTFOUND,VT_ERROR);
pStream->Open(
varOptional,
adModeUnknown,
adOpenStreamUnspecified,
_bstr_t(""),
_bstr_t(""));
_variant_t bytes(_compressStreamBuffer);
pStream->Write(_compressStreamBuffer);
updSet.GetInterfacePtr()->Fields->GetItem("Chunk")->Value = pStream->Read(1000);
updSet.GetInterfacePtr()->Update();
pStream->Close();
As far as ADO being the best technology in this case ... I'm not really sure. I personally think using ADO from C++ is a painful process. But it is pretty generic if you need that. I don't have a working example of using streams to write data at that level (although, somewhat ironically, I have code that I wrote using streams at the OLE DB level. However, that increases the pain level many times).
If, though, your data is always going to be loaded entirely in memory, I think using AppendChunk would be a simpler route:
ret = updSet.GetInterfacePtr()->Fields->
Item["Chunk"]->AppendChunk( L"some data" );