C++ backward regex search

C++ backward regex search - c++

I need to build an ultra-efficient log parser (~1GB/s). I implemented Hyperscan library (https://www.hyperscan.io) from Intel, and it works well to:
count a number of occurence of specified events
give the end position of the matches
One of the limitation is that no capture groups can be reported, only end offsets. For most matches, I only use the count, but for 10% of them, the match must be parsed to compute further statistics.
The challenge is to efficiently run a regex to get the Hyperscan match, knowing only the end offset. Currently, I tried:
string data(const string * block) const {
std::regex nlexpr("\n(.*)\n$");
std::smatch match;
std::regex_search((*block).begin(), (*block).begin() + end, match, nlexpr);
return match[1];
}
block points to the file loaded in memory (2GB, so no copy possible).
end is the known offset matching the regex.
But it is extremely inefficient when the string to match is far in the block. I would have expected the "$" to make the operation very quick as the offset is given as end position, but it is definitely not. The operation take ~1s if end = 100000000.
It is possible to get the start of the matches from Hyperscan, however performance impact is very high (approximately divided per 2 after testing), so that is not an option.
Any idea how to achieve this ? I am using C++ 11 (so std implements the boost regex).
Best regards
Edit :
As the question came in the comments, I do not have any control over the regexs to be used.

I have not enough reputation to comment XD. I don't see the following as an answer, its more an alternative, nevertheless I have to make an answer, else I won't reach you.
I guess you won't find a trick to make performance independent of the position (guess its going linear for such simple regex or whatever).
A very simple solution is to replace this horrible regex lib with e.g. the posix regex.h (old but gold ;) or boost regex.
Here is an example:
#include <iostream>
#include <regex>
#include <regex.h>
#include <chrono>
#include <boost/regex.hpp>
inline auto now = std::chrono::steady_clock::now;
inline auto toMs = [](auto &&x){
return std::chrono::duration_cast<std::chrono::milliseconds>(x).count();
};
void cregex(std::string const&s, std::string const&p)
{
auto start = now();
regex_t r;
regcomp(&r,p.data(),REG_EXTENDED);
std::vector<regmatch_t> m(r.re_nsub+1);
regexec(&r,s.data(),m.size(),m.data(),0);
regfree(&r);
std::cout << toMs(now()-start) << "ms " << std::string{s.cbegin()+m[1].rm_so,s.cbegin()+m[1].rm_eo} << std::endl;
}
void cxxregex(std::string const&s, std::string const&p)
{
using namespace std;
auto start = now();
regex r(p.data(),regex::extended);
smatch m;
regex_search(s.begin(),s.end(),m,r);
std::cout << toMs(now()-start) << "ms " << m[1] << std::endl;
}
void boostregex(std::string const&s, std::string const&p)
{
using namespace boost;
auto start = now();
regex r(p.data(),regex::extended);
smatch m;
regex_search(s.begin(),s.end(),m,r);
std::cout << toMs(now()-start) << "ms " << m[1] << std::endl;
}
int main()
{
std::string s(100000000,'x');
std::string s1 = "yolo" + s;
std::string s2 = s + "yolo";
std::cout << "yolo + ... -> cregex "; cregex(s1,"^(yolo)");
std::cout << "yolo + ... -> cxxregex "; cxxregex(s1,"^(yolo)");
std::cout << "yolo + ... -> boostregex "; boostregex(s1,"^(yolo)");
std::cout << "... + yolo -> cregex "; cregex(s2,"(yolo)$");
std::cout << "... + yolo -> cxxregex "; cxxregex(s2,"(yolo)$");
std::cout << "... + yolo -> boostregex "; boostregex(s2,"(yolo)$");
}
Gives:
yolo + ... -> cregex 5ms yolo
yolo + ... -> cxxregex 0ms yolo
yolo + ... -> boostregex 0ms yolo
... + yolo -> cregex 69ms yolo
... + yolo -> cxxregex 2594ms yolo
... + yolo -> boostregex 62ms yolo

I just realized...
That my solutions proposed below does not work. Well, at least if there are multiple "yolo" in the text. It does not return the "first instance found in the string", but it returns the "first instance found in a substring of the string". So if you have 4 CPUs, the string is split into 4 substrings. The first to return "yolo" 'wins'. This might be OK if you only want to see if "yolo" is anywhere in the text, but not if you want to get the position of the first instance.
Old answer
Building on OZ's answer, I've written a parallel version. edit: now using semaphores to finish early.
#include <mutex>
#include <condition_variable>
std::mutex g_mtx;
std::condition_variable g_cv;
int g_found_at = -1;
void thread(
int id,
std::string::const_iterator begin,
std::string::const_iterator end,
const boost::regex& r,
boost::smatch* const m)
{
boost::smatch m_i;
if (regex_search(begin, end, m_i, r))
{
*m = m_i;
std::unique_lock<std::mutex> lk(g_mtx);
g_found_at = id;
lk.unlock();
g_cv.notify_one();
}
}
#include <thread>
#include <vector>
#include <memory>
#include <algorithm>
#include <chrono>
using namespace std::chrono_literals;
void boostparregex(std::string const &s, std::string const &p)
{
{
std::unique_lock<std::mutex> lk(g_mtx);
g_found_at = -1;
}
auto nrOfCpus = std::thread::hardware_concurrency() / 2;
std::cout << "(Nr of CPUs: " << nrOfCpus << ") ";
auto start = steady_clock::now();
boost::regex r(p.data(), boost::regex::extended);
std::vector<std::shared_ptr<boost::smatch>> m; m.reserve(nrOfCpus);
std::generate_n(std::back_inserter(m), nrOfCpus, []() { return std::make_shared<boost::smatch>(); });
std::vector<std::thread> t; t.reserve(nrOfCpus);
auto sizePerThread = s.length() / nrOfCpus;
for (size_t tId = 0; tId < nrOfCpus; tId++) {
auto begin = s.begin() + (tId * sizePerThread);
auto end = tId == nrOfCpus - 1 ? s.end() : s.begin() + ((tId + 1) * sizePerThread) - 1;
t.push_back(std::thread(thread, (int)tId, begin, end, r, m[tId].get()));
}
{
std::unique_lock<std::mutex> lk(g_mtx);
g_cv.wait_for(lk, 10s, []() { return g_found_at >= 0; });
}
{
std::unique_lock<std::mutex> lk(g_mtx);
if (g_found_at < 0) std::cout << "Not found! "; else std::cout << m[g_found_at]->str() << " ";
}
std::cout << toMs(steady_clock::now() - start) << "ms " << std::endl;
for (auto& thr : t) thr.join();
}
Which gives me this output (don't have posix under vs2017)
yolo + ... -> cxxregex 0ms yolo
yolo + ... -> boostregex 1ms yolo
yolo + ... -> boostparregex (Nr of CPUs: 4) yolo 13ms
... + yolo -> cxxregex 5014ms yolo
... + yolo -> boostregex 837ms yolo
... + yolo -> boostparregex (Nr of CPUs: 4) yolo 222ms
I get an up to 4 times speedup on 4 CPUs. There is some overhead for starting up the threads
p.s. this is my first C++ thread program and first regex, so there could be some optimizations possible.

Related

Getting coefficients from a string

I have a project to write a program that receives a polynomial string from the user up to the 5th power (ex. x^3+6x^2+9x+24) and prints out all the real and imaginary roots. The coefficients should be stored in a dynamic array.
The problem is getting these coefficients from the string. One of the coefficients can be a 0 (ex. 2x^2-18) so I can't store the coefficients from left to right by using an increment, because in this case a=2, b=-18, and c has no value, which is wrong.
Another problem is if the coefficient is 1, because in this case nothing will be written beside the x for the program to read (ex. x^2-x+14). Another problem is if the user adds a space, several, or none (ex. x ^3 +4x^ 2- 12 x + 1 3).
I have been thinking of pseudocode for a long time now, but nothing is coming to mind. I thought of detecting numbers from left to right and reading numbers and stopping at x, but the first and second problems occur. I thought of finding each x and then checking the numbers before it, but the second problem occurs, and also I don't know how big the number the user inputs.

Here is another Regex that you can use to get your coefficients after deleting whitespace characters:
(\d*)(x?\^?)(\d*)
It uses groups (indicated by the brackets). Every match has 3 groups:
Your coefficient
x^n, x or nothing
The exponent
If (1) is null (e.g. does not exist), it means your coefficient is 1.
If (2) and (3) are null, you have the last single number without x.
If only (3) is null, you have a single x without ^n.
You can try some examples on online regex sites like this one, where you can see the results on the right.
There are many tutorials online how to use Regex with C++.

You should normalize your input string, for example, remove all space then parse coefficients.
Let see my example. Please change it for your case.
#include <iostream>
#include <regex>
#include <iterator>
#include <string>
#include <vector>
#include <algorithm>
int main(int argc, char *argv[]) {
std::string input {argv[1]};
input.erase(remove_if(input.begin(), input.end(), isspace), input.end());
std::cout << input << std::endl;
std::vector<int> coeffs;
std::regex poly_regex(R"(\s*\+?\-?\s*\d*\s*x*\^*\s*\d*)");
auto coeff_begin = std::sregex_iterator(input.begin(), input.end(), poly_regex);
auto coeff_end = std::sregex_iterator();
for (std::sregex_iterator i = coeff_begin; i != coeff_end; ++i) {
std::smatch match = *i;
std::string match_str = match.str();
// std::cout << " " << match_str << "\n";
std::size_t plus_pos = match_str.find('+');
std::size_t minus_pos = match_str.find('-');
std::size_t x_pos = match_str.find('x');
if (x_pos == std::string::npos) {
std::cout << match_str.substr(plus_pos + 1) << std::endl;
} else if (x_pos == 0) {
std::cout << 1 << std::endl;
} else if (minus_pos != std::string::npos) {
if (x_pos - minus_pos == 1) std::cout << -1 << std::endl;
else std::cout << match_str.substr(minus_pos, x_pos - minus_pos) << std::endl;
}
else {
std::cout << match_str.substr(plus_pos + 1, x_pos - plus_pos - 1) << std::endl;
}
}
for (auto i: coeffs) std::cout << i << " ";
return 0;
}

Chrono C++ timings not correct

I'm just comparing the speed of a couple Fibonacci functions, one gives an output almost immediately and reads it got done in 500 nanoseconds, while the other, depending on the depth, may sit there loading for many seconds, yet when it is done, it will read that it took it only 100 nanoseconds... After I just sat there and waited like 20 seconds for it.
It's not a big deal as I can prove the other is slower just with raw human perception, but why would chrono not be working? Something to do with recursion?
PS I know that fibonacci2() doesn't give the correct output on odd numbered depths, I'm just testing some things and the output is actually just there so the compiler doesn't optimize it away or something. Go ahead and just copy this code and you'll see fibonacci2() immediately output but you'll have to wait like 5 seconds for fibonacci(). Thank you.
#include <iostream>
#include <chrono>
int fibonacci2(int depth) {
static int a = 0;
static int b = 1;
if (b > a) {
a += b; //std::cout << a << '\n';
}
else {
b += a; //std::cout << b << '\n';
}
if (depth > 1) {
fibonacci2(depth - 1);
}
return a;
}
int fibonacci(int n) {
if (n <= 1) {
return n;
}
return fibonacci(n - 1) + fibonacci(n - 2);
}
int main() {
int f = 0;
auto start2 = std::chrono::steady_clock::now();
f = fibonacci2(44);
auto stop2 = std::chrono::steady_clock::now();
std::cout << f << '\n';
auto duration2 = std::chrono::duration_cast<std::chrono::nanoseconds>(stop2 - start2);
std::cout << "faster function time: " << duration2.count() << '\n';
auto start = std::chrono::steady_clock::now();
f = fibonacci(44);
auto stop = std::chrono::steady_clock::now();
std::cout << f << '\n';
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
std::cout << "way slower function with incorrect time: " << duration.count() << '\n';
}

I don't know what compiler you are using and with which compiler options, but I tested x64 msvc v19.28 using /O2 in godbolt. Here the compiled instructions are reordered such that it queries the perf_counter twice before invoking the fibonacci(int) function, which in code would look like
auto start = ...;
auto stop = ...;
f = fibonacci(44);
A solution to disallow this reordering might be to use a atomic_thread_fence just before and after the fibonacci function call.

As Mestkon answered the compiler can reorder your code.
Examples of how to prevent the compiler from reordering Memory Ordering - Compile Time Memory Barrier
It would be beneficial in the future if you provided information on what compiler you were using.
gcc 7.5 with -O2 for example does not reorder the timer instructions in this given scenario.

Most 'functional' way to sum pairs of elements from a vector using C++17 or later?

I'd like some suggestions for the most terse and 'functional' way to gather pairs of successive elements from a vector (1st and 2nd, 3rd and 4th, etc.) using modern C++. Assume the vector is of arbitrary but even length. For the examples I'm pulling together, I'm summing the elements of each pair but that's not the main problem. I should add I'll use STL only, no Boost.
In Python I can zip them into 2-tuples via an iterator with
s = range(1,11)
print([(x + y) for x,y in zip(*[iter(s)] * 2)])
In Perl 5 I can peel off pairs with
use List::Util qw/pairs sum/;
use feature 'say';
#s = 1 .. 10;
say sum #$_ foreach (pairs #s);
In Perl 6 I can shove them two at a time into a block with
my #s = 1 .. 10;
for #s -> $x, $y { say $x + $y; }
and in R I can wrap the vector into a 2-column array and sum the rows with
s <- 1:10
print(apply(matrix(s, ncol=2, byrow=TRUE), 1, sum))
I am not fluent in C++ and my solution uses for(;;). That feels too much like C.
#include <iostream>
#include <vector>
#include <numeric> // std::iota
int main() {
std::vector<int> s(10);
std::iota(s.begin(), s.end(), 1);
for (auto p = s.cbegin(); p != s.cend(); p += 2)
std::cout << (*p + *(p + 1)) << std::endl;
}
The output of course should be some variant of
3
7
11
15
19

Using range-v3:
for (auto v : view::iota(1, 11) | view::chunk(2)) {
std::cout << v[0] + v[1] << '\n';
}
Note that chunk(2) doesn't give you a compile-time-fixed size view, so you can't do:
for (auto [x,y] : view::iota(1, 11) | view::chunk(2)) { ... }

Without using range-v3 I was able to do this with either a function or a lambda template. I'll show the lambda version here.
#include <iostream>
#include <string>
#include <vector>
template<typename T>
auto lambda = [](const std::vector<T>& values, std::vector<T>& results) {
std::vector<T> temp1, temp2;
for ( std::size_t i = 0; i < values.size(); i++ ) {
if ( i & 1 ) temp2.push_back(values[i]); // odd index
else temp1.push_back(values[i]); // even index
}
for ( std::size_t i = 0; i < values.size() / 2; i++ )
results.push_back(temp[i] + temp[2]);
};
int main() {
std::vector<int> values{ 1,2,3,4,5,6 };
for (auto i : values)
std::cout << i << " ";
std::cout << '\n';
std::vector<int> results;
lambda<int>(values, results);
for (auto i : results)
std::cout << i << " ";
std::cout << '\n';
std::vector<float> values2{ 1.1f, 2.2f, 3.3f, 4.4f };
for (auto f : values2)
std::cout << f << " ";
std::cout << '\n';
std::vector<float> results2;
lambda<float>(values2, results2);
for (auto f : results2)
std::cout << f << " ";
std::cout << '\n';
std::vector<char> values3{ 'a', 'd' };
for (auto c : values3)
std::cout << c << " ";
std::cout << '\n';
std::vector<char> results3;
lambda<char>(values3, results3);
for (auto c : results3)
std::cout << c << " ";
std::cout << '\n';
std::vector<std::string> values4{ "Hello", " ", "World", "!" };
for (auto s : values4)
std::cout << s;
std::cout << '\n';
std::vector<std::string> results4;
lambda<std::string>(values4, results4);
for (auto s : results4)
std::cout << s;
std::cout << '\n';
return EXIT_SUCCESS;
}
Output
1 2 3 4 5 6
3 7 11
1.1 2.2 3.3 4.4
3.3 7.7
a d
┼
Hello World!
Hello World!

At the risk of sounding like I'm trying to be clever or annoying, I say this is the answer:
print(sums(successive_pairs(range(1,11))));
Now, of course, those aren't built-in functions, so you would have to define them, but I don't think that is a bad thing. The code clearly expresses what you want in a functional style. Also, the responsibility of each of those functions is well separated, easily testable, and reusable. It isn't necessary to use a lot of tricky specialized syntax to write code in a functional style.

Absolute wall positions in IFC

Because I want to associate windows with walls, I am trying to find the endpoints of a wall using IFCOpenShell.
#include <ifcgeom/IfcGeom.h>
#include <ifcparse/IfcFile.h>
using namespace Ifc2x3;
using namespace IfcSchema;
using namespace IfcParse;
using namespace IfcUtil;
using namespace IfcGeom;
std::string tostr(gp_XYZ a) {
std::stringstream ss;
ss << a.X() << "," << a.Y() << "," << a.Z();
return ss.str();
}
int main(int argc, char** argv) {
IfcFile file;
if (!file.Init(argc > 1 ? argv[1] : "windows connected to walls Archicad.ifc")) abort();
Kernel k;
gp_Pnt start;
gp_Pnt end;
auto list = file.entitiesByType<IfcWall>();
for ( auto it = list->begin(); it != list->end(); ++it) {
auto w1 = *it;
std::cout << "wall " << w1->entity->id() << std::endl;
bool suc = k.find_wall_end_points(w1, start, end);
std::cout << suc << std::endl;
std::cout << tostr(start.XYZ()) << std::endl;
std::cout << tostr(end.XYZ()) << std::endl;
}
}
This currently yields 10.1596,0,0 and 9.08038,0,0 as endpoints for two walls, and 0 as the starting point for both. This means they are parallel. But when I open the IFC file in Archicad, they are orthogonal. There is even an IFC tag in the file specifying that one wall connects to the other ATSTART/ATEND. How can I derive the correct coordinates?
Compilation command line: g++ -I/usr/include/oce wallpoints.cpp -l IfcGeom -l IfcParse $(for i in $(cd /usr/lib/x86_64-linux-gnu/; ls libTK*.so | cut -d. -f1 | cut -c4-); do echo "-l $i"; done) $(icu-config --ldflags) -ggdb3
IFC file: https://hushfile.it/592c22c69af2a#AwrpRI2O3ecDuAmP_pIjNmCAV3BWvwr30tfrazqJ

As stated in the standard, the placement is relative, so I have to multiply this with a couple of matrices.

Check if a string contains a string in C++

I have a variable of type std::string. I want to check if it contains a certain std::string. How would I do that?
Is there a function that returns true if the string is found, and false if it isn't?

Use std::string::find as follows:
if (s1.find(s2) != std::string::npos) {
std::cout << "found!" << '\n';
}
Note: "found!" will be printed if s2 is a substring of s1, both s1 and s2 are of type std::string.

You can try using the find function:
string str ("There are two needles in this haystack.");
string str2 ("needle");
if (str.find(str2) != string::npos) {
//.. found.
}

Starting from C++23 you can use std::string::contains
#include <string>
const auto haystack = std::string("haystack with needles");
const auto needle = std::string("needle");
if (haystack.contains(needle))
{
// found!
}

Actually, you can try to use boost library,I think std::string doesn't supply enough method to do all the common string operation.In boost,you can just use the boost::algorithm::contains:
#include <string>
#include <boost/algorithm/string.hpp>
int main() {
std::string s("gengjiawen");
std::string t("geng");
bool b = boost::algorithm::contains(s, t);
std::cout << b << std::endl;
return 0;
}

You can try this
string s1 = "Hello";
string s2 = "el";
if(strstr(s1.c_str(),s2.c_str()))
{
cout << " S1 Contains S2";
}

In the event if the functionality is critical to your system, it is actually beneficial to use an old strstr method. The std::search method within algorithm is the slowest possible. My guess would be that it takes a lot of time to create those iterators.
The code that i used to time the whole thing is
#include <string>
#include <cstring>
#include <iostream>
#include <algorithm>
#include <random>
#include <chrono>
std::string randomString( size_t len );
int main(int argc, char* argv[])
{
using namespace std::chrono;
const size_t haystacksCount = 200000;
std::string haystacks[haystacksCount];
std::string needle = "hello";
bool sink = true;
high_resolution_clock::time_point start, end;
duration<double> timespan;
int sizes[10] = { 10, 20, 40, 80, 160, 320, 640, 1280, 5120, 10240 };
for(int s=0; s<10; ++s)
{
std::cout << std::endl << "Generating " << haystacksCount << " random haystacks of size " << sizes[s] << std::endl;
for(size_t i=0; i<haystacksCount; ++i)
{
haystacks[i] = randomString(sizes[s]);
}
std::cout << "Starting std::string.find approach" << std::endl;
start = high_resolution_clock::now();
for(size_t i=0; i<haystacksCount; ++i)
{
if(haystacks[i].find(needle) != std::string::npos)
{
sink = !sink; // useless action
}
}
end = high_resolution_clock::now();
timespan = duration_cast<duration<double>>(end-start);
std::cout << "Processing of " << haystacksCount << " elements took " << timespan.count() << " seconds." << std::endl;
std::cout << "Starting strstr approach" << std::endl;
start = high_resolution_clock::now();
for(size_t i=0; i<haystacksCount; ++i)
{
if(strstr(haystacks[i].c_str(), needle.c_str()))
{
sink = !sink; // useless action
}
}
end = high_resolution_clock::now();
timespan = duration_cast<duration<double>>(end-start);
std::cout << "Processing of " << haystacksCount << " elements took " << timespan.count() << " seconds." << std::endl;
std::cout << "Starting std::search approach" << std::endl;
start = high_resolution_clock::now();
for(size_t i=0; i<haystacksCount; ++i)
{
if(std::search(haystacks[i].begin(), haystacks[i].end(), needle.begin(), needle.end()) != haystacks[i].end())
{
sink = !sink; // useless action
}
}
end = high_resolution_clock::now();
timespan = duration_cast<duration<double>>(end-start);
std::cout << "Processing of " << haystacksCount << " elements took " << timespan.count() << " seconds." << std::endl;
}
return 0;
}
std::string randomString( size_t len)
{
static const char charset[] = "abcdefghijklmnopqrstuvwxyz";
static const int charsetLen = sizeof(charset) - 1;
static std::default_random_engine rng(std::random_device{}());
static std::uniform_int_distribution<> dist(0, charsetLen);
auto randChar = [charset, &dist, &rng]() -> char
{
return charset[ dist(rng) ];
};
std::string result(len, 0);
std::generate_n(result.begin(), len, randChar);
return result;
}
Here i generate random haystacks and search in them the needle. The haystack count is set, but the length of strings within each haystack is increased from 10 in the beginning to 10240 in the end. Most of the time the program spends actually generating random strings, but that is to be expected.
The output is:
Generating 200000 random haystacks of size 10
Starting std::string.find approach
Processing of 200000 elements took 0.00358503 seconds.
Starting strstr approach
Processing of 200000 elements took 0.0022727 seconds.
Starting std::search approach
Processing of 200000 elements took 0.0346258 seconds.
Generating 200000 random haystacks of size 20
Starting std::string.find approach
Processing of 200000 elements took 0.00480959 seconds.
Starting strstr approach
Processing of 200000 elements took 0.00236199 seconds.
Starting std::search approach
Processing of 200000 elements took 0.0586416 seconds.
Generating 200000 random haystacks of size 40
Starting std::string.find approach
Processing of 200000 elements took 0.0082571 seconds.
Starting strstr approach
Processing of 200000 elements took 0.00341435 seconds.
Starting std::search approach
Processing of 200000 elements took 0.0952996 seconds.
Generating 200000 random haystacks of size 80
Starting std::string.find approach
Processing of 200000 elements took 0.0148288 seconds.
Starting strstr approach
Processing of 200000 elements took 0.00399263 seconds.
Starting std::search approach
Processing of 200000 elements took 0.175945 seconds.
Generating 200000 random haystacks of size 160
Starting std::string.find approach
Processing of 200000 elements took 0.0293496 seconds.
Starting strstr approach
Processing of 200000 elements took 0.00504251 seconds.
Starting std::search approach
Processing of 200000 elements took 0.343452 seconds.
Generating 200000 random haystacks of size 320
Starting std::string.find approach
Processing of 200000 elements took 0.0522893 seconds.
Starting strstr approach
Processing of 200000 elements took 0.00850485 seconds.
Starting std::search approach
Processing of 200000 elements took 0.64133 seconds.
Generating 200000 random haystacks of size 640
Starting std::string.find approach
Processing of 200000 elements took 0.102082 seconds.
Starting strstr approach
Processing of 200000 elements took 0.00925799 seconds.
Starting std::search approach
Processing of 200000 elements took 1.26321 seconds.
Generating 200000 random haystacks of size 1280
Starting std::string.find approach
Processing of 200000 elements took 0.208057 seconds.
Starting strstr approach
Processing of 200000 elements took 0.0105039 seconds.
Starting std::search approach
Processing of 200000 elements took 2.57404 seconds.
Generating 200000 random haystacks of size 5120
Starting std::string.find approach
Processing of 200000 elements took 0.798496 seconds.
Starting strstr approach
Processing of 200000 elements took 0.0137969 seconds.
Starting std::search approach
Processing of 200000 elements took 10.3573 seconds.
Generating 200000 random haystacks of size 10240
Starting std::string.find approach
Processing of 200000 elements took 1.58171 seconds.
Starting strstr approach
Processing of 200000 elements took 0.0143111 seconds.
Starting std::search approach
Processing of 200000 elements took 20.4163 seconds.

If the size of strings is relatively big (hundreds of bytes or more) and c++17 is available, you might want to use Boyer-Moore-Horspool searcher (example from cppreference.com):
#include <iostream>
#include <string>
#include <algorithm>
#include <functional>
int main()
{
std::string in = "Lorem ipsum dolor sit amet, consectetur adipiscing elit,"
" sed do eiusmod tempor incididunt ut labore et dolore magna aliqua";
std::string needle = "pisci";
auto it = std::search(in.begin(), in.end(),
std::boyer_moore_searcher(
needle.begin(), needle.end()));
if(it != in.end())
std::cout << "The string " << needle << " found at offset "
<< it - in.begin() << '\n';
else
std::cout << "The string " << needle << " not found\n";
}

If you don't want to use standard library functions, below is one solution.
#include <iostream>
#include <string>
bool CheckSubstring(std::string firstString, std::string secondString){
if(secondString.size() > firstString.size())
return false;
for (int i = 0; i < firstString.size(); i++){
int j = 0;
// If the first characters match
if(firstString[i] == secondString[j]){
int k = i;
while (firstString[i] == secondString[j] && j < secondString.size()){
j++;
i++;
}
if (j == secondString.size())
return true;
else // Re-initialize i to its original value
i = k;
}
}
return false;
}
int main(){
std::string firstString, secondString;
std::cout << "Enter first string:";
std::getline(std::cin, firstString);
std::cout << "Enter second string:";
std::getline(std::cin, secondString);
if(CheckSubstring(firstString, secondString))
std::cout << "Second string is a substring of the frist string.\n";
else
std::cout << "Second string is not a substring of the first string.\n";
return 0;
}

Good to use std::regex_search also. Stepping stone for making the search more generic. Below is an example with comments.
//THE STRING IN WHICH THE SUBSTRING TO BE FOUND.
std::string testString = "Find Something In This Test String";
//THE SUBSTRING TO BE FOUND.
auto pattern{ "In This Test" };
//std::regex_constants::icase - TO IGNORE CASE.
auto rx = std::regex{ pattern,std::regex_constants::icase };
//SEARCH THE STRING.
bool isStrExists = std::regex_search(testString, rx);
Need to include #include <regex>
For some reason, suppose the input string is observed something like "Find Something In This Example String", and interested to search either "In This Test" or "In This Example" then the search can be enhanced by simply adjusting the pattern as shown below.
//THE SUBSTRING TO BE FOUND.
auto pattern{ "In This (Test|Example)" };

what about
string response = "hello world";
string findMe = "world";
if(response.find(findMe) != string::npos)
{
//found
}

#include <algorithm> // std::search
#include <string>
using std::search; using std::count; using std::string;
int main() {
string mystring = "The needle in the haystack";
string str = "needle";
string::const_iterator it;
it = search(mystring.begin(), mystring.end(),
str.begin(), str.end()) != mystring.end();
// if string is found... returns iterator to str's first element in mystring
// if string is not found... returns iterator to mystring.end()
if (it != mystring.end())
// string is found
else
// not found
return 0;
}

From so many answers in this website I didn't find out a clear answer so in 5-10 minutes I figured it out the answer myself.
But this can be done in two cases:
Either you KNOW the position of the sub-string you search for in the string
Either you don't know the position and search for it, char by char...
So, let's assume we search for the substring "cd" in the string "abcde", and we use the simplest substr built-in function in C++
for 1:
#include <iostream>
#include <string>
using namespace std;
int i;
int main()
{
string a = "abcde";
string b = a.substr(2,2); // 2 will be c. Why? because we start counting from 0 in a string, not from 1.
cout << "substring of a is: " << b << endl;
return 0;
}
for 2:
#include <iostream>
#include <string>
using namespace std;
int i;
int main()
{
string a = "abcde";
for (i=0;i<a.length(); i++)
{
if (a.substr(i,2) == "cd")
{
cout << "substring of a is: " << a.substr(i,2) << endl; // i will iterate from 0 to 5 and will display the substring only when the condition is fullfilled
}
}
return 0;
}

This is a simple function
bool find(string line, string sWord)
{
bool flag = false;
int index = 0, i, helper = 0;
for (i = 0; i < line.size(); i++)
{
if (sWord.at(index) == line.at(i))
{
if (flag == false)
{
flag = true;
helper = i;
}
index++;
}
else
{
flag = false;
index = 0;
}
if (index == sWord.size())
{
break;
}
}
if ((i+1-helper) == index)
{
return true;
}
return false;
}

You can also use the System namespace.
Then you can use the contains method.
#include <iostream>
using namespace System;
int main(){
String ^ wholeString = "My name is Malindu";
if(wholeString->ToLower()->Contains("malindu")){
std::cout<<"Found";
}
else{
std::cout<<"Not Found";
}
}

Note: I know that the question requires a function, which means the user is trying to find something simpler. But still I post it in case anyone finds it useful.
Approach using a Suffix Automaton. It accepts a string (haystack), and after that you can input hundreds of thousands of queries (needles) and the response will be very fast, even if the haystack and/or needles are very long strings.
Read about the data structure being used here: https://en.wikipedia.org/wiki/Suffix_automaton
#include <bits/stdc++.h>
using namespace std;
struct State {
int len, link;
map<char, int> next;
};
struct SuffixAutomaton {
vector<State> st;
int sz = 1, last = 0;
SuffixAutomaton(string& s) {
st.assign(s.size() * 2, State());
st[0].len = 0;
st[0].link = -1;
for (char c : s) extend(c);
}
void extend(char c) {
int cur = sz++, p = last;
st[cur].len = st[last].len + 1;
while (p != -1 && !st[p].next.count(c)) st[p].next[c] = cur, p = st[p].link;
if (p == -1)
st[cur].link = 0;
else {
int q = st[p].next[c];
if (st[p].len + 1 == st[q].len)
st[cur].link = q;
else {
int clone = sz++;
st[clone].len = st[p].len + 1;
st[clone].next = st[q].next;
st[clone].link = st[q].link;
while (p != -1 && st[p].next[c] == q) st[p].next[c] = clone, p = st[p].link;
st[q].link = st[cur].link = clone;
}
}
last = cur;
}
};
bool is_substring(SuffixAutomaton& sa, string& query) {
int curr = 0;
for (char c : query)
if (sa.st[curr].next.count(c))
curr = sa.st[curr].next[c];
else
return false;
return true;
}
// How to use:
// Execute the code
// Type the first string so the program reads it. This will be the string
// to search substrings on.
// After that, type a substring. When pressing enter you'll get the message showing the
// result. Continue typing substrings.
int main() {
string S;
cin >> S;
SuffixAutomaton sa(S);
string query;
while (cin >> query) {
cout << "is substring? -> " << is_substring(sa, query) << endl;
}
}

We can use this method instead.
Just an example from my projects.
Refer the code.
Some extras are also included.
Look to the if statements!
/*
Every C++ program should have an entry point. Usually, this is the main function.
Every C++ Statement ends with a ';' (semi-colon)
But, pre-processor statements do not have ';'s at end.
Also, every console program can be ended using "cin.get();" statement, so that the console won't exit instantly.
*/
#include <string>
#include <bits/stdc++.h> //Can Use instead of iostream. Also should be included to use the transform function.
using namespace std;
int main(){ //The main function. This runs first in every program.
string input;
while(input!="exit"){
cin>>input;
transform(input.begin(),input.end(),input.begin(),::tolower); //Converts to lowercase.
if(input.find("name") != std::string::npos){ //Gets a boolean value regarding the availability of the said text.
cout<<"My Name is AI \n";
}
if(input.find("age") != std::string::npos){
cout<<"My Age is 2 minutes \n";
}
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++ backward regex search - c++

Related

Getting coefficients from a string

Chrono C++ timings not correct

Most 'functional' way to sum pairs of elements from a vector using C++17 or later?

Absolute wall positions in IFC

Check if a string contains a string in C++

Categories

Resources