I'm attempting to read a plain text file line by line, construct an SQL INSERT statement, execute the query, and move on. Presently, I've got a solution that can complete about 200 lines per second on my ~4 year old desktop. However, I've got about 120 million lines to go through and was looking to implement this as a daily task. Taking a few hours to complete it would be fine, but taking nearly a week isn't an option.
The lines will contain one string and anywhere from 5 to 9 integers that range from boolean values (which I've encoded as TINYINT(1)) to microseconds since midnight (BIGINT).
Once read in from the file (via getline()) the lines are tokenized by this function:
#define MAX_TOKENS 10
#define MAX_TOKEN_LENGTH 32
char tokens[MAX_TOKENS][MAX_TOKEN_LENGTH];
//...
void split_line(const string &s)
{
char raw_string[MAX_TOKENS * MAX_TOKEN_LENGTH];
char *rest;
char *token_string;
strcpy(raw_string, s.c_str());
if(tokens[0][0] != '\0')
{
fill(tokens[0], tokens[0]+(MAX_TOKENS*MAX_TOKEN_LENGTH), '\0');
}
for(uint32_t token = 0; token < MAX_TOKENS; token++)
{
if(token == 0) token_string = strtok_r(raw_string, " ", &rest);
else token_string = strtok_r(nullptr, " ", &rest);
if(token_string == nullptr) break;
if(token >= 1)
{
//if it's not a number...
if(token_string[0] < 48 || token_string[0] > 57)
{
if(token_string[0] != 45) //negative numbers are allowed
{
clear_tokens();
break;
}
}
}
strcpy(tokens[token], token_string);
}
}
I had tried a more STL derived version of that tokenizer, but that was proving too slow. It still ranks high in the callgraph, but not as high as it did with proper STL strings.
Anyway, the next step is to build the SQL query. For this, I've tried a few things. One option was stringstreams.
string insert_query = "INSERT INTO data_20170222";
stringstream values;
string query;
while(getline(input_stream, input_stream_line))
{
split_line(input_stream_line);
if(tokens[5][0] != '\0') //the smallest line will have six tokens
{
try
{
query = insert_query;
uint32_t item_type = stoi(tokens[2]);
switch(item_type)
{
case 0: //one type of item
case 1: //another type of item
{
values << " (valueA, valueB, valueC, valueD, valueE, valueF,"
" valueG, valueH) values('"
<< tokens[0] << "', " << tokens[1] << ", "
<< tokens[2] << ", " << tokens[3] << ", "
<< tokens[4] << ", " << tokens[5] << ", "
<< tokens[6] << ", " << tokens[7] << ")";
break;
}
//...
}
query.append(values.str());
values.str(string());
values.clear();
if(mysql_query(conn, query.c_str()))
{
string error(mysql_error(conn));
mysql_close(conn);
throw runtime_error(error);
}
}
catch(exception &ex)
{
cerr << "Error parsing line\n '" << input_stream_line
<< "'\n" << " " << ex.what() << endl;
throw;
}
}
When I run this version, I see 30% of callgrind's samples being measured in std::operator<< within std::basic_ostream.
I originally tried doing this all with strings, ala:
string values;
values = " (valueA, valueB, valueC, valueD, valueE, valueF,"
" valueG, valueH) values('" +
string(tokens[0]) + "', " + tokens[1] + ", "
tokens[2] + ", " + tokens[3] + ", "
tokens[4] + ", " + tokens[5] + ", "
tokens[6] + ", " + tokens[7] + ")";
That proves to be effectively the same speed, but this time with the 30% of samples being allocated to std::operator+ from std::basic_string.
And finally, I switched to straight sprintf().
char values[MAX_TOKENS * MAX_TOKEN_LENGTH];
sprintf(values, " (valueA, valueB, valueC, valueD, valueE, valueF,"
" valueG, valueH) values('%s', %s, %s, %s, %s, %s, %s, %s)",
tokens[0], tokens[1], tokens[2], tokens[3],
tokens[4], tokens[5], tokens[6], tokens[7]);
stringstream was slightly faster than string (though, well within a reasonable margin of error). sprintf() was about 10% faster than both, but that's still not fast enough.
Surely there's a well-established method for accomplishing this task with such large data sets. I'd be grateful for any guidance at this point.
EDIT
Oh wow. I commented out the call to mysql_query() on a whim. Turns out, despite what valgrind says, that's where all my slowdown is. Without that block, it jumps from 200 lines per second to 1.2 million lines per second. That's more like it! Too bad I need the data in a database...
I guess this has become a question about why MariaDB seems to be operating so slowly, now. I've got a good SSD in this system, 16GB RAM, etc. It strikes me as unlikely that my hardware is holding it back.
All the more curious. Thanks in advance for any help!
Batch INSERTing 100 rows per INSERT statement will run 10 times as fast.
Do you need to insert 120M rows per day? That's 1400 per second. Have you calculated how soon before you run out of disk space?
Let's see SHOW CREATE TABLE. Don't use BIGINT (8 bytes) when INT (4 bytes) will suffice. Don't use INT when MEDIUMINT (3 bytes) will do. Etc.
What will you do with a zillion rows of data? Keep in mind that a poorly formed SELECT against a poorly indexed table will take a long time, even with SSDs.
Can the one string be normalized?
Think about packing a bunch of booleans into a SET (1 byte per 8 booleans) or some size int.
Let's see SHOW CREATE TABLE and the main SELECTs.
What is the value of innodb_flush_log_at_trx_commit? Use 2.
Related
I'm rather new to C++ and I'm struggling with the following problem:
I'm parsing syslog messages from iptables. Every message looks like:
192.168.1.1:20200:Dec 11 15:20:36 SRC=192.168.1.5 DST=8.8.8.8 LEN=250
And I need to quickly (since new messages are coming very fast) parse the string to get SRC, DST and LEN.
If it was a simple program, I'd use std::find to find index of STR substring, then in a loop add every next char to an array until I encounter a whitespace. Then I'd do the same for DST and LEN.
For example,
std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";
std::string substr;
std::cout << "Original string: \"" << x << "\"" << std::endl;
// Below "magic number" 4 means length of "SRC=" string
// which is the same for "DST=" and "LEN="
// For SRC
auto npos = x.find("SRC");
if (npos != std::string::npos) {
substr = x.substr(npos + 4, x.find(" ", npos) - (npos+4));
std::cout << "SRC: " << substr << std::endl;
}
// For DST
npos = x.find("DST");
if (npos != std::string::npos) {
substr = x.substr(npos + 4, x.find(" ", npos) - (npos + 4));
std::cout << "DST: " << substr << std::endl;
}
// For LEN
npos = x.find("LEN");
if (npos != std::string::npos) {
substr = x.substr(npos + 4, x.find('\0', npos) - (npos + 4));
std::cout << "LEN: " << substr << std::endl;
}
However, in my situation, I need to do this really quickly, ideally in one iteration.
Could you please give me some advice on this?
If your format is fixed and verified (you can accept undefined behavior as soon as the input string doesn't contain exactly the expected characters), then you might squeeze out some performance by writing larger parts by hand and skip the string termination tests that will be part of all standard functions.
// buf_ptr will be updated to point to the first character after the " SRC=x.x.x.x" sequence
unsigned long GetSRC(const char*& buf_ptr)
{
// Don't search like this unless you have a trusted input format that's guaranteed to contain " SRC="!!!
while (*buf_ptr != ' ' ||
*(buf_ptr + 1) != 'S' ||
*(buf_ptr + 2) != 'R' ||
*(buf_ptr + 3) != 'C' ||
*(buf_ptr + 4) != '=')
{
++buf_ptr;
}
buf_ptr += 5;
char* next;
long part = std::strtol(buf_ptr, &next, 10);
// part is now the first number of the IP. Depending on your requirements you may want to extract the string instead
unsigned long result = (unsigned long)part << 24;
// Don't use 'next + 1' like this unless you have a trusted input format!!!
part = std::strtol(next + 1, &next, 10);
// part is now the second number of the IP. Depending on your requirements ...
result |= (unsigned long)part << 16;
part = std::strtol(next + 1, &next, 10);
// part is now the third number of the IP. Depending on your requirements ...
result |= (unsigned long)part << 8;
part = std::strtol(next + 1, &next, 10);
// part is now the fourth number of the IP. Depending on your requirements ...
result |= (unsigned long)part;
// update the buf_ptr so searching for the next information ( DST=x.x.x.x) starts at the end of the currently parsed parts
buf_ptr = next;
return result;
}
Usage:
const char* x_str = x.c_str();
unsigned long srcIP = GetSRC(x_str);
// now x_str will point to " DST=15.15.15.15 LEN=255" for further processing
std::cout << "SRC=" << (srcIP >> 24) << "." << ((srcIP >> 16) & 0xff) << "." << ((srcIP >> 8) & 0xff) << "." << (srcIP & 0xff) << std::endl;
Note I decided to write the whole extracted source IP into a single 32 bit unsigned. You can decide on a completely different storage model if you want.
Even if you can't be optimistic about your format, using a pointer that is updated whenever a part is processed and continuing with the remaining string instead of starting at 0 might be a good idea to improve performance.
Ofcourse, I suppose your std::cout << ... lines are just for development testing, because otherwise all micro optimization becomes useless anyway.
"quickly, ideally in one iteration" - in reality, the speed of your program does not depend on the number of loops that are visible in your source code. Especially regex'es are a very good way to hide multiple nested loops.
Your solution is actually pretty good. It doesn't waste much time prior to finding "SRC", and doesn't search further than necessary to retrieve the IP address. Sure, when searching for `"SRC" it has a false positive on the first "S" of "Sep", but that is solved by the next compare. If you know for certain that the first occurrence of "SRC" is somewhere in column 20, you might save just a tiny bit of speed by skipping those first 20 characters. (Check your logs, I can't tell)
You can use std::regex, e.g.:
std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";
std::regex const r(R"(SRC=(\S+) DST=(\S+) LEN=(\S+))");
std::smatch matches;
if(regex_search(x, matches, r)) {
std::cout << "SRC " << matches.str(1) << '\n';
std::cout << "DST " << matches.str(2) << '\n';
std::cout << "LEN " << matches.str(3) << '\n';
}
Note that matches.str(idx) creates a new string with the match. Using matches[idx] you can get the iterators to the sub-string without creating a new string.
I'm working on an assignment where I need to print out the ASCII table in the table format exactly like the picture below.
http://i.gyazo.com/f1a8625aad1d55585df20f4dba920830.png
I currently can't get the special words/symbols to display (8, 9, 10, 13, 27, 32, 127).
Here it is running:
http://i.gyazo.com/80c8ad48ef2993e93ef9b8feb30e53af.png
Here is my current code:
#include <iomanip>
#include <iostream>
using namespace std;
int main()
{
cout<<"ASCII TABLE:"<<endl;
cout<<endl;
for (int i = 0; i < 128; i++)
{
if (i <= 32)
cout << "|" << setw(2)
<<i
<< setw(3)
<< "^" << char (64+i) <<"|";
if (i >= 33)
cout << "|" << setw(3)
<<i
<< setw(3)
<<char (i) << "|";
if((i+1)%8 == 0) cout << endl;
}
return 0;
}
8 Back Space
9 Horizontal Tab
10 New Line
13 carriage return
27 Escape (Esc)
32 Space
127 Del
As Above these ASCII characters doesn't display any visible or printed character. That's why you might be thinking you are not getting these values.
I'm no sure what's your real problem there, but you didn't get an answer yet about how to print the special codes.
Running your programme I see that you have some minor alignment problems. If that's the problem, note that setw(3) only applies to the next element:
cout << setw(3) << "^" << char (64+i); // prints " ^A" instead of " ^A".
If you try to correct into
cout << setw(3) << "^"+ char (64+i); // ouch !!!!
you'll get undefined behaviour (garbage) because "^" is a pointer to a string and adding char(64+i) is understood as adding an offset of 64+i to this pointer. As this is a rather random address, you'll get garbage. Use a std::string instead.
The other difference I see between your programme's output and the expected result is that you don't print the code of the special chars. If that's the problem, either use a switch statement (very repetitive here), or a lot of if/else or use an associative map.
Here an alternative proposal putting all this together:
map<char, string>special{ { 8, "BS " }, { 9, "\\t " }, { 10, "\\n " }, { 13, "CR " }, { 27, "ESC" }, { 32, "SP " }, { 127, "DEL" } };
cout << "ASCII TABLE:" << endl << endl;
for (int i = 0; i < 128; i++) {
cout << "|" << setw(3)<<i<<setw(4); // setw() only applies to next
if (iscntrl(i) || isspace(i)) { // if its a control char or a space
auto it = special.find(i); // look if there's a special translation
if (it != special.end()) // if yes, use it
cout << it->second;
else cout << string("^") + char(64 + i)+ string(" "); // if not, ^x, using strings
}
else if (isprint(i)) // Sorry I'm paranoïd: but I always imagine that there could be a non printable ctrl ;-)
cout << char(i)+string(" ") ; // print normal char
cout << "|";
if ((i + 1) % 8 == 0) cout << endl;
}
Now some additional advices:
take the effort to indent
instead of manual categorization of chars, use iscntrl(), isspace(), isprint(). As long as you only use ascii, it's manageable to do like you did. But as soons as you move to internationalisation and wide chars it becomes increasinlgy cumbersome to do that whereas there are easy wide equivalents like iswcntrl(), iswspace(), iswprint().
also be rigorous on two consecutive if: If you know that only one of the two should apply, make the effort to write if ... else if these four additional lettes can save you hours of debugging later.
I am trying to print the value of a const but it is not working. I am making a return to C++ after years so I know casting is a possible solution but I can't get that working either.
The code is as follows:
//the number of blanks surrounding the greeting
const int pad = 0;
//the number of rows and columns to write
const int rows = pad * 2 + 3;
const string::size_type cols = greeting.size() + pad * 2 + 2;
cout << endl << "Rows : " + rows;
I am trying to print the value of 'rows' without success.
You want:
cout << endl << "Rows : " << rows;
Note this has nothing to do with const - C++ does not allow you to concatenate strings and numbers with the + operator. What you were actually doing was that mysterious thing called pointer arithmetic.
You're almost there:
cout << endl << "Rows : " << rows;
The error is because "Rows : " is a string literal, thus is a constant, and generally speaking is not modified as you may think.
Going slightly further, you likely used + (colloquially used as a concatenation operation) assuming you needed to build a string to give to the output stream. Instead operator << returns the output stream when it is done, allowing chaining.
// It is almost as if you did:
(((cout << endl) << "Rows : ") << rows)
I think you want:
std::cout << std::endl << "Rows : " << rows << std::endl;
I make this mistake all the time as I also work with java a lot.
As others have pointed out, you need
std::cout << std::endl << "Rows : " << rows << std::endl;
The reason (or one of the reasons) is that "Rows : " is a char* and the + operator for char*s doesn't concatenate strings, like the one for std::string and strings in languages like Java and Python.
How can we split a std::string and a null terminated character array into two halves such that both have same length?
Please suggest an efficient method for the same.You may assume that the length of the original string/array is always an even number.
By efficiently I mean using less number of bytes in both the cases, since something using loops and buffer is not what I am looking for.
std::string s = "string_split_example";
std::string half = s.substr(0, s.length()/2);
std::string otherHalf = s.substr(s.length()/2);
cout << s.length() << " : " << s << endl;
cout << half.length() << " : " << half << endl;
cout << otherHalf .length() << " : " << otherHalf << endl;
Output:
20 : string_split_example
10 : string_spl
10 : it_example
Online Demo : http://www.ideone.com/fmYrO
You've already received a C++ answer, but here's a C answer:
int len = strlen(strA);
char *strB = malloc(len/2+1);
strncpy(strB, strA+len/2, len/2+1);
strA[len/2] = '\0';
Obviously, this uses malloc() to allocate memory for the second string, which you will have to free() at some point.
I am using the following function to loop through a couple of open CDB hash tables. Sometimes the value for a given key is returned along with an additional character (specifically a CTRL-P (a DLE character/0x16/0o020)).
I have checked the cdb key/value pairs with a couple of different utilities and none of them show any additional characters appended to the values.
I get the character if I use cdb_read() or cdb_getdata() (the commented out code below).
If I had to guess I would say I am doing something wrong with the buffer I create to get the result from the cdb functions.
Any advice or assistance is greatly appreciated.
char* HashReducer::getValueFromDb(const string &id, vector <struct cdb *> &myHashFiles)
{
unsigned char hex_value[BUFSIZ];
size_t hex_len;
//construct a real hex (not ascii-hex) value to use for database lookups
atoh(id,hex_value,&hex_len);
char *value = NULL;
vector <struct cdb *>::iterator my_iter = myHashFiles.begin();
vector <struct cdb *>::iterator my_end = myHashFiles.end();
try
{
//while there are more databases to search and we have not found a match
for(; my_iter != my_end && !value ; my_iter++)
{
//cerr << "\n looking for this MD5:" << id << " hex(" << hex_value << ") \n";
if (cdb_find(*my_iter, hex_value, hex_len)){
//cerr << "\n\nI found the key " << id << " and it is " << cdb_datalen(*my_iter) << " long\n\n";
value = (char *)malloc(cdb_datalen(*my_iter));
cdb_read(*my_iter,value,cdb_datalen(*my_iter),cdb_datapos(*my_iter));
//value = (char *)cdb_getdata(*my_iter);
//cerr << "\n\nThe value is:" << value << " len is:" << strlen(value)<< "\n\n";
};
}
}
catch (...){}
return value;
}
First, I am not familiar with CDB and I don't believe you include enough details about your software environment here.
But assuming it is like other database libraries I've used...
The values probably don't have to be NUL-terminated. That means that casting to char* and printing it will not work. You should add a 0 byte yourself.
So malloc cdb_datalen + 1 and set the last character to 0. Then print it.
Better yet, use calloc and it will allocate memory already set to zero.