std::string comparison (check whether string begins with another string) - c++

I need to check whether an std:string begins with "xyz". How do I do it without searching through the whole string or creating temporary strings with substr().

I would use compare method:
std::string s("xyzblahblah");
std::string t("xyz")
if (s.compare(0, t.length(), t) == 0)
{
// ok
}

An approach that might be more in keeping with the spirit of the Standard Library would be to define your own begins_with algorithm.
#include <algorithm>
using namespace std;
template<class TContainer>
bool begins_with(const TContainer& input, const TContainer& match)
{
return input.size() >= match.size()
&& equal(match.begin(), match.end(), input.begin());
}
This provides a simpler interface to client code and is compatible with most Standard Library containers.

Look to the Boost's String Algo library, that has a number of useful functions, such as starts_with, istart_with (case insensitive), etc. If you want to use only part of boost libraries in your project, then you can use bcp utility to copy only needed files

It seems that std::string::starts_with is inside C++20, meanwhile std::string::find can be used
std::string s1("xyzblahblah");
std::string s2("xyz")
if (s1.find(s2) == 0)
{
// ok, s1 starts with s2
}

I feel I'm not fully understanding your question. It looks as though it should be trivial:
s[0]=='x' && s[1]=='y' && s[2]=='z'
This only looks at (at most) the first three characters. The generalisation for a string which is unknown at compile time would require you to replace the above with a loop:
// look for t at the start of s
for (int i=0; i<s.length(); i++)
{
if (s[i]!=t[i])
return false;
}

Related

std::regex escape special characters for use in regex

I'm string to create a std::regex(__FILE__) as part of a unit test which checks some exception output that prints the file name.
On Windows it fails with:
regex_error(error_escape): The expression contained an invalid escaped character, or a trailing escape.
because the __FILE__ macro expansion contains un-escaped backslashes.
Is there a more elegant way to escape the backslashes than to loop through the resulting string (i.e. with a std algorithm or some std::string function)?
File paths can contain many characters that have special meaning in regular expression patterns. Escaping just the backslashes is not enough for robust checking in the general case.
Even a simple path, like C:\Program Files (x86)\Vendor\Product\app.exe, contains several special characters. If you want to turn that into a regular expression (or part of a regular expression), you would need to escape not only the backslashes but also the parentheses and the period (dot).
Fortunately, we can solve our regular expression problem with more regular expressions:
std::string EscapeForRegularExpression(const std::string &s) {
static const std::regex metacharacters(R"([\.\^\$\-\+\(\)\[\]\{\}\|\?\*)");
return std::regex_replace(s, metacharacters, "\\$&");
}
(File paths can't contain * or ?, but I've included them to keep the function general.)
If you don't abide by the "no raw loops" guideline, a probably faster implementation would avoid regular expressions:
std::string EscapeForRegularExpression(const std::string &s) {
static const char metacharacters[] = R"(\.^$-+()[]{}|?*)";
std::string out;
out.reserve(s.size());
for (auto ch : s) {
if (std::strchr(metacharacters, ch))
out.push_back('\\');
out.push_back(ch);
}
return out;
}
Although the loop adds some clutter, this approach allows us to drop a level of escaping on the definition of metacharacters, which is a readability win over the regex version.
Here is polymapper.
It takes an operation that takes and element and returns a range, the "map operation".
It produces a function object that takes a container, and applies the "map operation" to each element. It returns the same type as the container, where each element has been expanded/contracted by the "map operation".
template<class Op>
auto polymapper( Op&& op ) {
return [op=std::forward<Op>(op)](auto&& r) {
using std::begin;
using R=std::decay_t<decltype(r)>;
using iterator = decltype( begin(r) );
using T = typename std::iterator_traits<iterator>::value_type;
std::vector<T> data;
for (auto&& e:decltype(r)(r)) {
for (auto&& out:op(e)) {
data.push_back(out);
}
}
return R{ data.begin(), data.end() };
};
}
Here is escape_stuff:
auto escape_stuff = polymapper([](char c)->std::vector<char> {
if (c != '\\') return {c};
else return {c,c};
});
live example.
int main() {
std::cout << escape_stuff(std::string(__FILE__)) << "\n";
}
The advantage of this approach is that the action of messing with the guts of the container is factored out. You write code that messes with the characters or elements, and the overall logic is not your problem.
The disadvantage is polymapper is a bit strange, and needless memory allocations are done. (Those could be optimized out, but that makes the code more convoluted).
EDIT
In the end, I switched to #AdrianMcCarthy 's more robust approach.
Here's the inelegant method in which I solved the problem in case someone stumbles on this actually looking for a workaround:
std::string escapeBackslashes(const std::string& s)
{
std::string out;
for (auto c : s)
{
out += c;
if (c == '\\')
out += c;
}
return out;
}
and then
std::regex(escapeBackslashes(__FILE__));
It's O(N) which is probably as good as you can do here, but involves a lot of string copying which I'd like to think isn't strictly necessary.

Does boost or STL have a predicate for comparing two char values?

I need to call boost::trim_left_if on a single char value:
// pseudo code; not tested.
std::string to_trim{"hello"};
char left_char = 'h';
boost::algorithm::trim_left_if(to_trim, /*left_char?*/);
In the last line above, I need a way to pass in the char value. I looked around but I didn't see a generic predicate in Boost or STL to simply compare two arbitrary values. I could use lambdas for this but would prefer the predicate, if one exists.
One thing I want to avoid here is using boost::is_any_of() or any other predicates that would require left_char to be converted to a string.
Since C++11, the idiomatic way to compare for equality to a fixed value is to use a bind-expression with std::equal_to:
boost::algorithm::trim_left_if(to_trim,
std::bind(std::equal_to<>{}, left_char, std::placeholders::_1));
This uses the transparent predicate std::equal_to<void> (since C++14); in C++11 use std::equal_to<char>.
Prior to C++11 (and, likely, until C++17) you can use std::bind1st in place of std::bind and std::placeholders::_1.
Remaining within Boost, you could also use boost::algorithm::is_any_of with a unitary range; I find boost::assign::list_of works well:
boost::algorithm::trim_left_if(to_trim,
boost::algorithm::is_any_of(boost::assign::list_of(left_char)));
Why not just write one?
#include <iostream>
#include <boost/algorithm/string/trim.hpp>
struct Pred
{
Pred(char ch) : ch_(ch) {}
bool operator () ( char c ) const { return ch_ == c; }
char ch_;
};
int main ()
{
std::string to_trim{"hello"};
char left_char = 'h';
boost::algorithm::trim_left_if(to_trim, Pred(left_char));
std::cout << to_trim << std::endl;
}
Seriously - the stuff in boost is not "sent down from on high", it's written by people like you and me.

perl regex faster than c++/boost

I wrote a CGI script for my website which reads through blocks of text and matches all occurrences of English words. I've been making some fundamental changes to the site's code recently which have necessitated rewriting most of it in C++. As I'd hoped, almost everything has become much faster in C++ than perl, with the exception of this function.
I know that regexes are a relatively recent addition to C++ and not necessarily its strongest suit. It may simply be the case that it is slower than perl in this instance. But I wanted to share my code in the hopes that someone might be able to find a way of speeding up what I am doing in C++.
Here is the perl code:
open(WORD, "</file/path/wordthree.txt") || die "opening";
while(<WORD>) {
chomp;
push #wordlist,$_;
}
close (WORD) || die "closing";
foreach (#wordlist) {
while ($bloc =~ m/$_/g) {
$location = pos($bloc) - length($_);
$match=$location.";".pos($bloc).";".$_;
push(#hits,$match);
}
}
wordthree.txt is a list of ~270,000 English words separated by new lines, and $bloc is 3200 characters of text. Perl performs these searches in about one second. You can see it in play here if you like: http://libraryofbabel.info/anglishize.cgi?05y-w1-s3-v20:1
With C++ I have tried the following:
typedef std::map<std::string::difference_type, std::string> hitmap;
hitmap hits;
void regres(const boost::match_results<std::string::const_iterator>& what) {
hits[what.position()]=what[0].str();
}
words.open ("/file/path/wordthree.txt");
std::string wordlist[274784];
unsigned i = 0;
while (words >> wordlist[i]) {i++;}
words.close();
for (unsigned i=0;i<274783;i++) {
boost::regex word(wordlist[i]);
boost::sregex_iterator lex(book.begin(),book.end(), word);
boost::sregex_iterator end;
std::for_each(lex, end, &regres);
}
The C++ version takes about 12 seconds to read the same amount of text the same number of times. Any advice on how to make it competitive with the perl script is greatly appreciated.
Firstly I'd cut down on the number of allocations:
use string_ref instead of std::string where possible
use mapped files instead of reading it all in memory ahead of time
use const char* instead std::string::const_iterator to navigate the book
Here is a sample that uses Boost Spirit Qi to parse the wordlist (I don't have yours, so I assume line-separated words).
std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);
In full Live On Coliru¹
#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
namespace qi = boost::spirit::qi;
namespace io = boost::iostreams;
using sref = boost::string_ref;
using regex = boost::regex;
namespace boost { namespace spirit { namespace traits {
template <typename It>
struct assign_to_attribute_from_iterators<sref, It, void> {
static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
};
} } }
typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;
void regres(const boost::match_results<const char*>& what) {
hits[what.position()] = sref(what[0].first, what[0].length());
}
int main() {
std::vector<sref> wordlist;
io::mapped_file_source mapped("/etc/dictionaries-common/words");
qi::parse(mapped.begin(), mapped.end(), qi::raw[+(qi::char_ - qi::eol)] % qi::eol, wordlist);
std::cout << "Wordlist contains " << wordlist.size() << " entries\n";
io::mapped_file_source book("/etc/dictionaries-common/words");
for (auto const& s: wordlist) {
regex word(s.to_string());
boost::cregex_iterator lex(book.begin(), book.end(), word), end;
std::for_each(lex, end, &regres);
}
}
Next step
This still creates a regex each iteration. I have a suspicion it will be a lot more efficient if you combine it all into a single pattern. You'll spend more memory/CPU creating the regex, but you'll reduce the power of the loop by the number of entries in the word list.
Because the regex library might not have been designed for this scale, you could have better results with a custom search and a trie implementation.
Here's a simple attempt (that is indeed much faster for my /etc/dictionaries-common/words file of 99171 lines):
Live On Coliru
#include <boost/regex.hpp>
#include <boost/utility/string_ref.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
namespace io = boost::iostreams;
using sref = boost::string_ref;
using regex = boost::regex;
typedef std::map<std::string::difference_type, sref> hitmap;
hitmap hits;
void regres(const boost::match_results<const char*>& what) {
hits[what.position()] = sref(what[0].first, what[0].length());
}
int main() {
io::mapped_file_params params("/etc/dictionaries-common/words");
params.flags = io::mapped_file::mapmode::priv;
io::mapped_file mapped(params);
std::replace(mapped.data(), mapped.end(), '\n', '|');
regex const wordlist(mapped.begin(), mapped.end() - 1);
io::mapped_file_source book("/etc/dictionaries-common/words");
boost::cregex_iterator lex(book.begin(), book.end(), wordlist), end;
std::for_each(lex, end, &regres);
}
¹ of course coliru doesn't have a suitable wordlist
It looks to me like Perl is smart enough to figure out that you're abusing regular expressions to do an ordinary linear search, a straightforward lookup. You are looking up straight text, and none of your search patterns appear to be, well, a pattern. Based on your description, all your search patterns look like ordinary strings, so Perl is likely optimizing it down to a linear string search.
I am not familiar with Boost's internal implementation of regular expression matching, but it's likely that it's compiling each search string into a state machine, and then executing the state machine for each search. That's the usual approach used with generic regular expression implementations. And that's a lot of work. A lot of completely needless work, in this specific case.
What you should do is as follows:
1) You are reading wordthree.txt into an array of strings. Instead of doing it, read it into a std::set<std::string> instead.
2) You are reading the entire text to search into a single book container. It's not clear, based on your code, whether book is a single std::string, or a std::vector<char>. But whatever the case, don't do that. Read the text to search iteratively, one word at a time. For each word, look it up in the std::set, and go from there.
This is, after all, what you're trying to do directly, and you should do that instead of taking a grand detour through the wonders of regular expressions, which accomplishes very little other than wasting a lot of time.
If you implement this correctly, you'll likely see C++ being just as fast, if not faster, than Perl.
I could also think of several other, more aggresively optimized approaches which also leverage std::set, but with custom classes and comparators, that seek to avoid all the heap allocations inherent with using a bunch of std::string, but it probably won't be necessary. A basic approach using a std::set-based lookup should be fast enough.

String handling in C++

How do I write a function in C++ that takes a string s and an integer n as input and gives at output a string that has spaces placed every n characters in s?
For example, if the input is s = "abcdefgh" and n = 3 then the output should be "abc def gh"
EDIT:
I could have used loops for this but I am looking for concise and an idiomatic C++ solution (i.e. the one that uses algorithms from STL).
EDIT:
Here's how I would I do it in Scala (which happens to be my primary language):
def drofotize(s: String, n: Int) = s.grouped(n).toSeq.flatMap(_ + " ").mkString
Is this level of conciseness possible with C++? Or do I have to use explicit loops after all?
Copy each character in a loop and when i>0 && i%(n+1)==0 add extra space in the destination string.
As for Standard Library you could write your own std::back_inserter which will add extra spaces and then you could use it as follows:
std::copy( str1.begin(), str1.end(), my_back_inserter(str2, n) );
but I could say that writing such a functor is just a wasting of your time. It is much simpler to write a function copy_with_spaces with an old good for-loop in it.
STL algorithms don't really provide anything like this. Best I can think of:
#include <string>
using namespace std;
string drofotize(const string &s, size_t n)
{
if (s.size() <= n)
{
return s;
}
return s.substr(0,n) + " " + drofotize(s.substr(n), n);
}

How to use a hash_map with case insensitive unicode string for key?

I'm very new to STL, and pretty new to C++ in general. I'm trying to get the equivalent of a .NET Dictionary<string, value>(StringComparer.OrdinalIgnoreCase) but in C++. This is roughly what I'm trying:
stdext::hash_map<LPCWSTR, SomeStruct> someMap;
someMap.insert(stdext::pair<LPCWSTR, SomeStruct>(L"a string", struct));
someMap.find(L"a string")
someMap.find(L"A STRING")
The trouble is, neither find operation usually works (it returns someMap.end()). It seems to sometimes work, but most of the time it doesn't. I'm guessing that the hash function the hash_map is using is hashing the memory address of the string instead of the content of the string itself, and it's almost certainly not case insensitive.
How can I get a dictionary-like structure that uses case-insensitive keys and can store my custom struct?
The hash_map documentation you link to indicates that you can supply your own traits class as a third template parameter. This must satisfy the same interface as hash_compare.
Scanning the docs, I think that what you have to do is this, which basically replaces the use of StringComparer.OrdinalIgnoreCase you had in your Dictionary:
struct my_hash_compare {
const size_t bucket_size = 4;
const size_t min_buckets = 8;
size_t operator()(const LPCWSTR &Key) const {
// implement a case-insensitive hash function here,
// or find something in the Windows libraries.
}
bool operator()(const LPCWSTR &Key1, const LPCWSTR &Key2) const {
// implement a case-insensitive comparison function here
return _wcsicmp(Key1, Key2) < 0;
// or something like that. There's warnings about
// locale plastered all over this function's docs.
}
};
I'm worried though that the docs say that the comparison function has to be a total order, not a strict weak order as is usual for sorted containers in the C++ standard libraries. If MS really means a total order, then the hash_map might rely on it being consistent with operator==. That is, they might require that if my_hash_compare()(a,b) is false, and my_hash_compare()(b,a) is false, then a == b. Obviously that's not true for what I've written, in which case you're out of luck.
As an alternative, which in any case is probably more efficient, you could push all the keys to a common case before using them in the map. A case-insensitive comparison is more costly than a regular string comparison. There's some Unicode gotcha to do with that which I can never quite remember, though. Maybe you have to convert -> lowercase -> uppercase, instead of just -> uppercase, or something like that, in order to avoid some nasty cases in certain languages or with titlecase characters. Anyone?
Also as other people said, you might not really want LPCWSTR as your key. This will store pointers in the map, which means that anyone who inserts a string has to ensure that the data it points to remains valid as long as it's in the hash_map. It's often better in the long run for hash_map to keep a copy of the key string passed to insert, in which case you should use wstring as the key.
There was some great information given here. I gathered bits and pieces from the answers and put this one together:
#include "stdafx.h"
#include "atlbase.h"
#include <map>
#include <wchar.h>
typedef std::pair<std::wstring, int> MyPair;
struct key_comparer
{
bool operator()(std::wstring a, std::wstring b) const
{
return _wcsicmp(a.c_str(), b.c_str()) < 0;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
std::map<std::wstring, int, key_comparer> mymap;
mymap.insert(MyPair(L"GHI",3));
mymap.insert(MyPair(L"DEF",2));
mymap.insert(MyPair(L"ABC",1));
std::map<std::wstring, int, key_comparer>::iterator iter;
iter = mymap.find(L"def");
if (iter == mymap.end()) {
printf("No match.\n");
} else {
printf("match: %i\n", iter->second);
}
return 0;
}
If you use an std::map instead of the non-standard hash_map, you can set the comparison function to be used when doing the binary search:
// Function object for case insensitive comparison
struct case_insensitive_compare
{
case_insensitive_compare() {}
// Function objects overloader operator()
// When used as a comparer, it should function as operator<(a,b)
bool operator()(const std::string& a, const std::string& b) const
{
return to_lower(a) < to_lower(b);
}
std::string to_lower(const std::string& a) const
{
std::string s(a);
std::for_each(s.begin(), s.end(), char_to_lower);
return s;
}
void char_to_lower(char& c) const
{
if (c >= 'A' && c <= 'Z')
c += ('a' - 'A');
}
};
// ...
std::map<std::string, std::string, case_insensitive_compare> someMap;
someMap["foo"] = "Hello, world!";
std::cout << someMap["FOO"] << endl; // Hello, world!
LPCWSTR is a pointer to a null-terminated array of unicode characters and probably not what you want in this case. Use the wstring specialization of basic_string instead.
For case-insensitivity, you would need to convert the keys to all upper case or all lower case before you insert and search. At least I don't think you can do it any other way.