How to quickly find and substring multible items from string in C++? - c++

I'm rather new to C++ and I'm struggling with the following problem:
I'm parsing syslog messages from iptables. Every message looks like:
192.168.1.1:20200:Dec 11 15:20:36 SRC=192.168.1.5 DST=8.8.8.8 LEN=250
And I need to quickly (since new messages are coming very fast) parse the string to get SRC, DST and LEN.
If it was a simple program, I'd use std::find to find index of STR substring, then in a loop add every next char to an array until I encounter a whitespace. Then I'd do the same for DST and LEN.
For example,
std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";
std::string substr;
std::cout << "Original string: \"" << x << "\"" << std::endl;
// Below "magic number" 4 means length of "SRC=" string
// which is the same for "DST=" and "LEN="
// For SRC
auto npos = x.find("SRC");
if (npos != std::string::npos) {
substr = x.substr(npos + 4, x.find(" ", npos) - (npos+4));
std::cout << "SRC: " << substr << std::endl;
}
// For DST
npos = x.find("DST");
if (npos != std::string::npos) {
substr = x.substr(npos + 4, x.find(" ", npos) - (npos + 4));
std::cout << "DST: " << substr << std::endl;
}
// For LEN
npos = x.find("LEN");
if (npos != std::string::npos) {
substr = x.substr(npos + 4, x.find('\0', npos) - (npos + 4));
std::cout << "LEN: " << substr << std::endl;
}
However, in my situation, I need to do this really quickly, ideally in one iteration.
Could you please give me some advice on this?

If your format is fixed and verified (you can accept undefined behavior as soon as the input string doesn't contain exactly the expected characters), then you might squeeze out some performance by writing larger parts by hand and skip the string termination tests that will be part of all standard functions.
// buf_ptr will be updated to point to the first character after the " SRC=x.x.x.x" sequence
unsigned long GetSRC(const char*& buf_ptr)
{
// Don't search like this unless you have a trusted input format that's guaranteed to contain " SRC="!!!
while (*buf_ptr != ' ' ||
*(buf_ptr + 1) != 'S' ||
*(buf_ptr + 2) != 'R' ||
*(buf_ptr + 3) != 'C' ||
*(buf_ptr + 4) != '=')
{
++buf_ptr;
}
buf_ptr += 5;
char* next;
long part = std::strtol(buf_ptr, &next, 10);
// part is now the first number of the IP. Depending on your requirements you may want to extract the string instead
unsigned long result = (unsigned long)part << 24;
// Don't use 'next + 1' like this unless you have a trusted input format!!!
part = std::strtol(next + 1, &next, 10);
// part is now the second number of the IP. Depending on your requirements ...
result |= (unsigned long)part << 16;
part = std::strtol(next + 1, &next, 10);
// part is now the third number of the IP. Depending on your requirements ...
result |= (unsigned long)part << 8;
part = std::strtol(next + 1, &next, 10);
// part is now the fourth number of the IP. Depending on your requirements ...
result |= (unsigned long)part;
// update the buf_ptr so searching for the next information ( DST=x.x.x.x) starts at the end of the currently parsed parts
buf_ptr = next;
return result;
}
Usage:
const char* x_str = x.c_str();
unsigned long srcIP = GetSRC(x_str);
// now x_str will point to " DST=15.15.15.15 LEN=255" for further processing
std::cout << "SRC=" << (srcIP >> 24) << "." << ((srcIP >> 16) & 0xff) << "." << ((srcIP >> 8) & 0xff) << "." << (srcIP & 0xff) << std::endl;
Note I decided to write the whole extracted source IP into a single 32 bit unsigned. You can decide on a completely different storage model if you want.
Even if you can't be optimistic about your format, using a pointer that is updated whenever a part is processed and continuing with the remaining string instead of starting at 0 might be a good idea to improve performance.
Ofcourse, I suppose your std::cout << ... lines are just for development testing, because otherwise all micro optimization becomes useless anyway.

"quickly, ideally in one iteration" - in reality, the speed of your program does not depend on the number of loops that are visible in your source code. Especially regex'es are a very good way to hide multiple nested loops.
Your solution is actually pretty good. It doesn't waste much time prior to finding "SRC", and doesn't search further than necessary to retrieve the IP address. Sure, when searching for `"SRC" it has a false positive on the first "S" of "Sep", but that is solved by the next compare. If you know for certain that the first occurrence of "SRC" is somewhere in column 20, you might save just a tiny bit of speed by skipping those first 20 characters. (Check your logs, I can't tell)

You can use std::regex, e.g.:
std::string x = "15:30:20 SRC=192.168.1.1 DST=15.15.15.15 LEN=255";
std::regex const r(R"(SRC=(\S+) DST=(\S+) LEN=(\S+))");
std::smatch matches;
if(regex_search(x, matches, r)) {
std::cout << "SRC " << matches.str(1) << '\n';
std::cout << "DST " << matches.str(2) << '\n';
std::cout << "LEN " << matches.str(3) << '\n';
}
Note that matches.str(idx) creates a new string with the match. Using matches[idx] you can get the iterators to the sub-string without creating a new string.

Related

Getting coefficients from a string

I have a project to write a program that receives a polynomial string from the user up to the 5th power (ex. x^3+6x^2+9x+24) and prints out all the real and imaginary roots. The coefficients should be stored in a dynamic array.
The problem is getting these coefficients from the string. One of the coefficients can be a 0 (ex. 2x^2-18) so I can't store the coefficients from left to right by using an increment, because in this case a=2, b=-18, and c has no value, which is wrong.
Another problem is if the coefficient is 1, because in this case nothing will be written beside the x for the program to read (ex. x^2-x+14). Another problem is if the user adds a space, several, or none (ex. x ^3 +4x^ 2- 12 x + 1 3).
I have been thinking of pseudocode for a long time now, but nothing is coming to mind. I thought of detecting numbers from left to right and reading numbers and stopping at x, but the first and second problems occur. I thought of finding each x and then checking the numbers before it, but the second problem occurs, and also I don't know how big the number the user inputs.
Here is another Regex that you can use to get your coefficients after deleting whitespace characters:
(\d*)(x?\^?)(\d*)
It uses groups (indicated by the brackets). Every match has 3 groups:
Your coefficient
x^n, x or nothing
The exponent
If (1) is null (e.g. does not exist), it means your coefficient is 1.
If (2) and (3) are null, you have the last single number without x.
If only (3) is null, you have a single x without ^n.
You can try some examples on online regex sites like this one, where you can see the results on the right.
There are many tutorials online how to use Regex with C++.
You should normalize your input string, for example, remove all space then parse coefficients.
Let see my example. Please change it for your case.
#include <iostream>
#include <regex>
#include <iterator>
#include <string>
#include <vector>
#include <algorithm>
int main(int argc, char *argv[]) {
std::string input {argv[1]};
input.erase(remove_if(input.begin(), input.end(), isspace), input.end());
std::cout << input << std::endl;
std::vector<int> coeffs;
std::regex poly_regex(R"(\s*\+?\-?\s*\d*\s*x*\^*\s*\d*)");
auto coeff_begin = std::sregex_iterator(input.begin(), input.end(), poly_regex);
auto coeff_end = std::sregex_iterator();
for (std::sregex_iterator i = coeff_begin; i != coeff_end; ++i) {
std::smatch match = *i;
std::string match_str = match.str();
// std::cout << " " << match_str << "\n";
std::size_t plus_pos = match_str.find('+');
std::size_t minus_pos = match_str.find('-');
std::size_t x_pos = match_str.find('x');
if (x_pos == std::string::npos) {
std::cout << match_str.substr(plus_pos + 1) << std::endl;
} else if (x_pos == 0) {
std::cout << 1 << std::endl;
} else if (minus_pos != std::string::npos) {
if (x_pos - minus_pos == 1) std::cout << -1 << std::endl;
else std::cout << match_str.substr(minus_pos, x_pos - minus_pos) << std::endl;
}
else {
std::cout << match_str.substr(plus_pos + 1, x_pos - plus_pos - 1) << std::endl;
}
}
for (auto i: coeffs) std::cout << i << " ";
return 0;
}

C++ String to byte

so i have a string like this:std::string MyString = "\\xce\\xc6";
where when i print it like this:std::cout << MyString.c_str()[0] << std::endl;
as output i get:\
and i want it to be like this:std::string MyDesiredString = "\xce\xc6";
so when i do:
std::cout << MyDesiredString.c_str()[0] << std::endl;
// OUTPUT: \xce (the whole byte)
so basically i want to identify the string(that represents bytes) and convert it to an array of real bytes
i came up with a function like this:
// this is a pseudo code i'm sure it has a lot of bugs and may not even work
// just for example for what i think
char str_to_bytes(const char* MyStr) { // MyStr length == 4 (\\xc6)
std::map<char*, char> MyMap = { {"\\xce", '\xce'}, {"\\xc6", 'xc6'} } // and so on
return MyMap[MyStr]
}
//if the provided char* is "\\xc6" it should return the char '\xc6'
but i believe there must be a better way to do it.
as much as i have searched i haven't found anything useful
thanks in advance
Try something like this:
std::string teststr = "\\xce\\xc6";
std::string delimiter = "\\x";
size_t pos = 0;
std::string token;
std::string res;
while ((pos = teststr.find(delimiter)) != std::string::npos) {
token = teststr.substr(pos + delimiter.length(), 2);
res.push_back((char)stol(token, nullptr, 16));
std::cout << stol(token, nullptr, 16) << std::endl;
teststr.erase(pos, pos + delimiter.length() + 2);
}
std::cout << res << std::endl;
Take your string, split it up by the literals indicating a hex. value is provided (\x) and then parse the two hex. characters with the stol function as Igor Tandetnik mentioned. You can then of course add those byte values to a string.

SQL using C++ - Where is the bottleneck?

I'm attempting to read a plain text file line by line, construct an SQL INSERT statement, execute the query, and move on. Presently, I've got a solution that can complete about 200 lines per second on my ~4 year old desktop. However, I've got about 120 million lines to go through and was looking to implement this as a daily task. Taking a few hours to complete it would be fine, but taking nearly a week isn't an option.
The lines will contain one string and anywhere from 5 to 9 integers that range from boolean values (which I've encoded as TINYINT(1)) to microseconds since midnight (BIGINT).
Once read in from the file (via getline()) the lines are tokenized by this function:
#define MAX_TOKENS 10
#define MAX_TOKEN_LENGTH 32
char tokens[MAX_TOKENS][MAX_TOKEN_LENGTH];
//...
void split_line(const string &s)
{
char raw_string[MAX_TOKENS * MAX_TOKEN_LENGTH];
char *rest;
char *token_string;
strcpy(raw_string, s.c_str());
if(tokens[0][0] != '\0')
{
fill(tokens[0], tokens[0]+(MAX_TOKENS*MAX_TOKEN_LENGTH), '\0');
}
for(uint32_t token = 0; token < MAX_TOKENS; token++)
{
if(token == 0) token_string = strtok_r(raw_string, " ", &rest);
else token_string = strtok_r(nullptr, " ", &rest);
if(token_string == nullptr) break;
if(token >= 1)
{
//if it's not a number...
if(token_string[0] < 48 || token_string[0] > 57)
{
if(token_string[0] != 45) //negative numbers are allowed
{
clear_tokens();
break;
}
}
}
strcpy(tokens[token], token_string);
}
}
I had tried a more STL derived version of that tokenizer, but that was proving too slow. It still ranks high in the callgraph, but not as high as it did with proper STL strings.
Anyway, the next step is to build the SQL query. For this, I've tried a few things. One option was stringstreams.
string insert_query = "INSERT INTO data_20170222";
stringstream values;
string query;
while(getline(input_stream, input_stream_line))
{
split_line(input_stream_line);
if(tokens[5][0] != '\0') //the smallest line will have six tokens
{
try
{
query = insert_query;
uint32_t item_type = stoi(tokens[2]);
switch(item_type)
{
case 0: //one type of item
case 1: //another type of item
{
values << " (valueA, valueB, valueC, valueD, valueE, valueF,"
" valueG, valueH) values('"
<< tokens[0] << "', " << tokens[1] << ", "
<< tokens[2] << ", " << tokens[3] << ", "
<< tokens[4] << ", " << tokens[5] << ", "
<< tokens[6] << ", " << tokens[7] << ")";
break;
}
//...
}
query.append(values.str());
values.str(string());
values.clear();
if(mysql_query(conn, query.c_str()))
{
string error(mysql_error(conn));
mysql_close(conn);
throw runtime_error(error);
}
}
catch(exception &ex)
{
cerr << "Error parsing line\n '" << input_stream_line
<< "'\n" << " " << ex.what() << endl;
throw;
}
}
When I run this version, I see 30% of callgrind's samples being measured in std::operator<< within std::basic_ostream.
I originally tried doing this all with strings, ala:
string values;
values = " (valueA, valueB, valueC, valueD, valueE, valueF,"
" valueG, valueH) values('" +
string(tokens[0]) + "', " + tokens[1] + ", "
tokens[2] + ", " + tokens[3] + ", "
tokens[4] + ", " + tokens[5] + ", "
tokens[6] + ", " + tokens[7] + ")";
That proves to be effectively the same speed, but this time with the 30% of samples being allocated to std::operator+ from std::basic_string.
And finally, I switched to straight sprintf().
char values[MAX_TOKENS * MAX_TOKEN_LENGTH];
sprintf(values, " (valueA, valueB, valueC, valueD, valueE, valueF,"
" valueG, valueH) values('%s', %s, %s, %s, %s, %s, %s, %s)",
tokens[0], tokens[1], tokens[2], tokens[3],
tokens[4], tokens[5], tokens[6], tokens[7]);
stringstream was slightly faster than string (though, well within a reasonable margin of error). sprintf() was about 10% faster than both, but that's still not fast enough.
Surely there's a well-established method for accomplishing this task with such large data sets. I'd be grateful for any guidance at this point.
EDIT
Oh wow. I commented out the call to mysql_query() on a whim. Turns out, despite what valgrind says, that's where all my slowdown is. Without that block, it jumps from 200 lines per second to 1.2 million lines per second. That's more like it! Too bad I need the data in a database...
I guess this has become a question about why MariaDB seems to be operating so slowly, now. I've got a good SSD in this system, 16GB RAM, etc. It strikes me as unlikely that my hardware is holding it back.
All the more curious. Thanks in advance for any help!
Batch INSERTing 100 rows per INSERT statement will run 10 times as fast.
Do you need to insert 120M rows per day? That's 1400 per second. Have you calculated how soon before you run out of disk space?
Let's see SHOW CREATE TABLE. Don't use BIGINT (8 bytes) when INT (4 bytes) will suffice. Don't use INT when MEDIUMINT (3 bytes) will do. Etc.
What will you do with a zillion rows of data? Keep in mind that a poorly formed SELECT against a poorly indexed table will take a long time, even with SSDs.
Can the one string be normalized?
Think about packing a bunch of booleans into a SET (1 byte per 8 booleans) or some size int.
Let's see SHOW CREATE TABLE and the main SELECTs.
What is the value of innodb_flush_log_at_trx_commit? Use 2.

How to speed up counting the occurences of a word in large files?

I need to count the occurrences of the string "<page>" in a 104gb file, for getting the number of articles in a given Wikipedia dump. First, I've tried this.
grep -F '<page>' enwiki-20141208-pages-meta-current.xml | uniq -c
However, grep crashes after a while. Therefore, I wrote the following program. However, it only processes 20mb/s of the input file on my machine which is about 5% workload of my HDD. How can I speed up this code?
#include <iostream>
#include <fstream>
#include <string>
int main()
{
// Open up file
std::ifstream in("enwiki-20141208-pages-meta-current.xml");
if (!in.is_open()) {
std::cout << "Could not open file." << std::endl;
return 0;
}
// Statistics counters
size_t chars = 0, pages = 0;
// Token to look for
const std::string token = "<page>";
size_t token_length = token.length();
// Read one char at a time
size_t matching = 0;
while (in.good()) {
// Read one char at a time
char current;
in.read(&current, 1);
if (in.eof())
break;
chars++;
// Continue matching the token
if (current == token[matching]) {
matching++;
// Reached full token
if (matching == token_length) {
pages++;
matching = 0;
// Print progress
if (pages % 1000 == 0) {
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
}
// Start over again
else {
matching = 0;
}
}
// Print result
std::cout << "Overall pages: " << pages << std::endl;
// Cleanup
in.close();
return 0;
}
Assuming there are no insanely large lines in the file using something like
for (std::string line; std::getline(in, line); } {
// find the number of "<page>" strings in line
}
is bound to be a lot faster! Reading each characters as a string of one character is about the worst thing you can possibly do. It is really hard to get any slower. For each character, there stream will do something like this:
Check if there is a tie()ed stream which needs flushing (there isn't, i.e., that's pointless).
Check if the stream is in good shape (except when having reached the end it is but this check can't be omitted entirely).
Call xsgetn() on the stream's stream buffer.
This function first checks if there is another character in the buffer (that's similar to the eof check but different; in any case, doing the eof check only after the buffer was empty removes a lot of the eof checks)
Transfer the character to the read buffer.
Have the stream check if it reached all (1) characters and set stream flags as needed.
There is a lot of waste in there!
I can't really imagine why grep would fail except that some line blows massively over the expected maximum line length. Although the use of std::getline() and std::string() is likely to have a much bigger upper bound, it is still not effective to process huge lines. If the file may contain lines which are massive, it may be more reasonable to use something along the lines of this:
for (std::istreambuf_iterator<char> it(in), end;
(it = std::find(it, end, '<') != end; ) {
// match "<page>" at the start of of the sequence [it, end)
}
For a bad implementation of streams that's still doing too much. Good implementations will do the calls to std::find(...) very efficiently and will probably check multiple characters at one, adding a check and loop only for something like every 16th loop iteration. I'd expect the above code to turn your CPU-bound implementation into an I/O-bound implementation. Bad implementation may still be CPU-bound but it should still be a lot better.
In any case, remember to enable optimizations!
I'm using this file to test with: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-meta-current1.xml-p000000010p000010000.bz2
It takes roughly 2.4 seconds versus 11.5 using your code. The total character count is slightly different due to not counting newlines, but I assume that's acceptable since it's only used to display progress.
void parseByLine()
{
// Open up file
std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
if(!in)
{
std::cout << "Could not open file." << std::endl;
return;
}
size_t chars = 0;
size_t pages = 0;
const std::string token = "<page>";
std::string line;
while(std::getline(in, line))
{
chars += line.size();
size_t pos = 0;
for(;;)
{
pos = line.find(token, pos);
if(pos == std::string::npos)
{
break;
}
pos += token.size();
if(++pages % 1000 == 0)
{
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
}
// Print result
std::cout << "Overall pages: " << pages << std::endl;
}
Here's an example that adds each line to a buffer and then processes the buffer when it reaches a threshold. It takes 2 seconds versus ~2.4 from the first version. I played with several different thresholds for the buffer size and also processing after a fixed number (16, 32, 64, 4096) of lines and it all seems about the same as long as there is some batching going on. Thanks to Dietmar for the idea.
int processBuffer(const std::string& buffer)
{
static const std::string token = "<page>";
int pages = 0;
size_t pos = 0;
for(;;)
{
pos = buffer.find(token, pos);
if(pos == std::string::npos)
{
break;
}
pos += token.size();
++pages;
}
return pages;
}
void parseByMB()
{
// Open up file
std::ifstream in("enwiki-latest-pages-meta-current1.xml-p000000010p000010000");
if(!in)
{
std::cout << "Could not open file." << std::endl;
return;
}
const size_t BUFFER_THRESHOLD = 16 * 1024 * 1024;
std::string buffer;
buffer.reserve(BUFFER_THRESHOLD);
size_t pages = 0;
size_t chars = 0;
size_t progressCount = 0;
std::string line;
while(std::getline(in, line))
{
buffer += line;
if(buffer.size() > BUFFER_THRESHOLD)
{
pages += processBuffer(buffer);
chars += buffer.size();
buffer.clear();
}
if((pages / 1000) > progressCount)
{
++progressCount;
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}
if(!buffer.empty())
{
pages += processBuffer(buffer);
chars += buffer.size();
std::cout << pages << " pages, ";
std::cout << (chars / 1024 / 1024) << " mb" << std::endl;
}
}

Anything like substr but instead of stopping at the byte you specified, it stops at a specific string [duplicate]

This question already has answers here:
How do you search a std::string for a substring in C++?
(6 answers)
Closed 8 years ago.
I have a client for a pre-existing server. Let's say I get some packets "MC123, 456!##".
I store these packets in a char called message. To print out a specific part of them, in this case the numbers part of them, I would do something like "cout << message.substr(3, 7) << endl;".
But what if I receive another message "MC123, 456, 789!##". "cout << message.substr(3,7)" would only print out "123, 456", whereas I want "123, 456, 789". How would I do this assuming I know that every message ends with "!##".
First - Sketch out the indexing.
std::string packet1 = "MC123, 456!##";
// 0123456789012345678
// ^------^ desired text
std::string packet2 = "MC123, 456, 789!##";
// 0123456789012345678
// ^-----------^ desired text
The others answers are ok. If you wish to use std::string find,
consider rfind and find_first_not_of, as in the following code:
// forward
void messageShow(std::string packet,
size_t startIndx = 2);
// /////////////////////////////////////////////////////////////////////////////
int main (int, char** )
{
// 012345678901234567
// |
messageShow("MC123, 456!##");
messageShow("MC123, 456, 789!##");
messageShow("MC123, 456, 789, 987, 654!##");
// error test cases
messageShow("MC123, 456, 789##!"); // missing !##
messageShow("MC123x 456, 789!##"); // extraneous char in packet
return(0);
}
void messageShow(std::string packet,
size_t startIndx) // default value 2
{
static size_t seq = 0;
seq += 1;
std::cout << packet.size() << " packet" << seq << ": '"
<< packet << "'" << std::endl;
do
{
size_t bangAtPound_Indx = packet.rfind("!##");
if(bangAtPound_Indx == std::string::npos){ // not found, can't do anything more
std::cerr << " '!##' not found in packet " << seq << std::endl;
break;
}
size_t printLength = bangAtPound_Indx - startIndx;
const std::string DIGIT_SPACE = "0123456789, ";
size_t allDigitSpace = packet.find_first_not_of(DIGIT_SPACE, startIndx);
if(allDigitSpace != bangAtPound_Indx) {
std::cerr << " extraneous char found in packet " << seq << std::endl;
break; // something extraneous in string
}
std::cout << bangAtPound_Indx << " message" << seq << ": '"
<< packet.substr(startIndx, printLength) << "'" << std::endl;
}while(0);
std::cout << std::endl;
}
This outputs
13 packet1: 'MC123, 456!##'
10 message1: '123, 456'
18 packet2: 'MC123, 456, 789!##'
15 message2: '123, 456, 789'
28 packet3: 'MC123, 456, 789, 987, 654!##'
25 message3: '123, 456, 789, 987, 654'
18 packet4: 'MC123, 456, 789##!'
'!##' not found in packet 4
18 packet5: 'MC123x 456, 789!##'
extraneous char found in packet 5
Note: String indexes start at 0. The index of the digit '1' is 2.
The correct approach is to look for existence / location of the "known termination" string, then take the substring up to (but not including) that substring.
Something like
str::string termination = "!#$";
std::size_t position = inputstring.find(termination);
std::string importantBit = message.substr(0, position);
You could check the front of the string separately as well. Combining these, you could use regular expressions to make your code more robust, using a regex like
MC([0-9,]+)!#\$
This will return the bit between MC and !#$ but only if it consists entirely of numbers and commas. Obviously you can adapt this as needed.
UPDATE you asked in your comment how to use the regular expression. Here is a very simple program. Note - this is using C++11: you need to make sure our compiler supports it.
#include <iostream>
#include <regex>
int main(void) {
std::string s ("ABC123,456,789!#$");
std::smatch m;
std::regex e ("ABC([0-9,]+)!#\\$"); // matches the kind of pattern you are looking for
if (std::regex_search (s,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
}
}
On my Mac, I can compile the above program with
clang++ -std=c++0x -stdlib=libc++ match.cpp -o match
If instead of just digits and commas you want "anything" in your expression (but it's still got fixed characters in front and behind) you can simply do
std::regex e ("ABC(.*)!#\\$");
Here, .+ means "zero or more of 'anything'" - but followed by !#$. The double backslash has to be there to "escape" the dollar sign, which has special meaning in regular expressions (it means "the end of the string").
The more accurately your regular expression reflects exactly what you expect, the better you will be able to trap any errors. This is usually a very good thing in programming. "Always check your inputs".
One more thing - I just noticed you mentioned that you might have "more stuff" in your string. This is where using regular expressions quickly becomes the best. You mentioned a string
MC123, 456!##*USRChester.
and wanted to extract 123, 456 and Chester. That is - stuff between MC and !#$, and more stuff after USR (if that is even there). Here is the code that shows how that is done:
#include <iostream>
#include <regex>
int main(void) {
std::string s1 ("MC123, 456!#$");
std::string s2 ("MC123, 456!#$USRChester");
std::smatch m;
std::regex e ("MC([0-9, ]+)!#\\$(?:USR)?(.*)$"); // matches the kind of pattern you are looking for
if (std::regex_search (s1,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
std::cout << "match[2] = " << m[2] << std::endl;
}
if (std::regex_search (s2,m,e)) {
std::cout << "match[0] = " << m[0] << std::endl;
std::cout << "match[1] = " << m[1] << std::endl;
std::cout << "match[2] = " << m[2] << std::endl;
if (match[2].length() > 0) {
std::cout << m[2] << ": " << m[1] << std::endl;
}
}
}
Output:
match[0] = MC123, 456!#$
match[1] = 123, 456
match[2] =
match[0] = MC123, 456!#$USRChester
match[1] = 123, 456
match[2] = Chester
Chester: 123, 456
The matches are:
match[0] : "everything in the input string that was consumed by the Regex"
match[1] : "the thing in the first set of parentheses"
match[2] : "The thing in the second set of parentheses"
Note the use of the slightly tricky (?:USR)? expression. This says "This might (that's the ()? ) be followed by the characters USR. If it is, skip them (that's the ?: part) and match what follows.
As you can see, simply testing whether m[2] is empty will tell you whether you have just numbers, or number plus "the thing after the USR". I hope this gives you an inkling of the power of regular expressions for chomping through strings like yours.
If you are sure about the ending of the message, message.substr(3, message.size()-6) will do the trick.
However, it is good practice to check everything, just to avoid surprises.
Something like this:
if (message.size() < 6)
throw error;
if (message.substr(0,3) != "MCX") //the exact numbers do not match in your example, but you get the point...
throw error;
if (message.substr(message.size()-3) != "!##")
throw error;
string data = message.substr(3, message.size()-6);
Just calculate the offset first.
string str = ...;
size_t start = 3;
size_t end = str.find("!##");
assert(end != string::npos);
return str.substr(start, end - start);
You can get the index of "!##" by using:
message.find("!##")
Then use that answer instead of 7. You should also check for it equalling std::string::npos which indicates that the substring was not found, and take some different action.
string msg = "MC4,512,541,3123!##";
for (int i = 2; i < msg.length() - 3; i++) {
if (msg[i] != '!' && msg[i + 1] != '#' && msg[i + 2] != '#')
cout << msg[i];
}
or use char[]
char msg[] = "MC4,123,54!##";
sizeof(msg -1 ); //instead of msg.length()
// -1 for the null byte at the end (each char takes 1 byte so the size -1 == number of chars)