How can I match the \0 character in a regex in C++? - c++

I need to match the text '\0' with the same regex that I would match 'a' or 'b'. (a regex for a character constant in C++). I've tried a bunch of different regexes, but haven't gotten a successful one yet. My latest attempt:
^['].|\\0[']
Most of the other things I've tried have given seg faults, so this is really the closest I've gotten.

This works pretty nicely with what I've tested ('a','b','\0').
If you don't have std::regex or boost::regex I guess what you can get out of it is the fact that the regex I used is ('.'|'\\0').
#include <boost/regex.hpp>
#include <string>
#include <iostream>
#include <vector>
int main() {
std::vector<std::string> strings;
strings.push_back(R"('a')");
strings.push_back(R"('b')");
strings.push_back(R"('\0')");
boost::regex rgx(R"(('.'|'\\0'))");
boost::smatch match;
for(auto& i : strings) {
if(boost::regex_match(i,match, rgx)) {
boost::ssub_match submatch = match[1];
std::cout << submatch.str() << '\n';
}
}
}
Example

There's nothing magic about '\0'; it's just a character, like any other character, and there's nothing (almost) special you have to do to use it in a regular expression. The only problem you might run into is if you use it in the middle of a character literal that you pass to a function that treats it as the end of a string. To avoid that, force it into a std::string:
const char s[] = "a\0b";
std::string not_my_str(s); // not_my_str holds "a"
std::string str(s, 3); // str holds "a\0b"
Once you've constructed the string object, the embedded '\0' gets no special treatment. Except, of course, if you copy the contents with a function that treats it specially.

The regex that works (in this instance, using the C header ) is:
^('(.|([\\]0))')
Thanks to #WhozCraig for the help!

Related

Why does std::views::split() compile but not split with an unnamed string literal as a pattern?

When std::views::split() gets an unnamed string literal as a pattern, it will not split the string but works just fine with an unnamed character literal.
#include <iomanip>
#include <iostream>
#include <ranges>
#include <string>
#include <string_view>
int main(void)
{
using namespace std::literals;
// returns the original string (not splitted)
auto splittedWords1 = std::views::split("one:.:two:.:three", ":.:");
for (const auto word : splittedWords1)
std::cout << std::quoted(std::string_view(word));
std::cout << std::endl;
// returns the splitted string
auto splittedWords2 = std::views::split("one:.:two:.:three", ":.:"sv);
for (const auto word : splittedWords2)
std::cout << std::quoted(std::string_view(word));
std::cout << std::endl;
// returns the splitted string
auto splittedWords3 = std::views::split("one:two:three", ':');
for (const auto word : splittedWords3)
std::cout << std::quoted(std::string_view(word));
std::cout << std::endl;
// returns the original string (not splitted)
auto splittedWords4 = std::views::split("one:two:three", ":");
for (const auto word : splittedWords4)
std::cout << std::quoted(std::string_view(word));
std::cout << std::endl;
return 0;
}
See live # godbolt.org.
I understand that string literals are always lvalues. But even though, I am missing some important piece of information that connects everything together. Why can I pass the string that I want splitted as an unnamed string literal whereas it fails (as-in: returns a range of ranges with the original string) when I do the same with the pattern?
String literals always end with a null-terminator, so ":.:" is actually a range with the last element of \0 and a size of 4.
Since the original string does not contain such a pattern, it is not split.
When dealing with C++20 ranges, I strongly recommend using string_view instead of raw string literals, which works well with <ranges> and can avoid the error-prone null-terminator issue.
This answer is completely correct, I'd just like to add a couple additional notes that might be interesting.
First, if you use {fmt} for printing, it's a lot easier to see what's going on, since you also don't have to write your own loop. You can just write this:
fmt::print("{}\n", rv::split("one:.:two:.:three", ":.:"));
Which will output (this is the default output for a range of range of char):
[[o, n, e, :, ., :, t, w, o, :, ., :, t, h, r, e, e, ]]
In C++23, there will be a way to directly specify that this print as a range of strings, but that hasn't been added to {fmt} yet. In the meantime, because split preserves the initial range category, you can add:
auto to_string_views = std::views::transform([](auto sr){
return std::string_view(sr.data(), sr.size());
});
And then:
fmt::print("{}\n", std::views::split("one:.:two:.:three", ":.:") | to_string_views);
prints:
["one:.:two:.:three\x00"]
Note the visibly trailing zero. Likewise, the next three attempts format as:
["one", "two", "three\x00"]
["one", "two", "three\x00"]
["one:two:three\x00"]
The fact that we can clearly see the \x00 helps track down the issue.
Next, consider the difference between:
std::views::split("one:.:two:.:three", ":.:")
and
"one:.:two:.:three" | std::views::split(":.:")
We typically consider these to be equivalent, but they're... not entirely. In the latter case, the library has to capture and stash these values - which involves decaying them. In this case, because ":.:" decays into char const*, that's no longer a valid pattern for the incoming string literal. So the above doesn't actually compile.
Now, it'd be great if it both compiled and also worked correctly. Unfortunately, it's impossible to tell in the language between a string literal (where you don't want to include the null terminator) and an array of char (where you want to include the whole array). So at least, with this latter formulation, you can get the wrong thing to not compile. And at least - "doesn't compile" is better than "compiles and does something wildly different from what I expected"?
Demo.

Replace single backslash with double in a string c++

I am trying to replace one backslash with two. To do that I tried using the following code
str = "d:\test\text.txt"
str.replace("\\","\\\\");
The code does not work. Whole idea is to pass str to deletefile function, which requires double blackslash.
since c++11, you may try using regex
#include <regex>
#include <iostream>
int main() {
auto s = std::string(R"(\tmp\)");
s = std::regex_replace(s, std::regex(R"(\\)"), R"(\\)");
std::cout << s << std::endl;
}
A bit overkill, but does the trick is you want a "quick" sollution
There are two errors in your code.
First line: you forgot to double the \ in the literal string.
It happens that \t is a valid escape representing the tab character, so you get no compiler error, but your string doesn't contain what you expect.
Second line: according to the reference of string::replace,
you can replace a substring by another substring based on the substring position.
However, there is no version that makes a substitution, i.e. replace all occurences of a given substring by another one.
This doesn't exist in the standard library. It exists for example in the boost library, see boost string algorithms. The algorithm you are looking for is called replace_all.

C++ std::regex confusion

While working on a solution to this question, I came up with the following c++ regex:
#include <regex>
#include <string>
#include <iostream>
std::string remove_password(std::string const& input)
{
// I think this should work for skipping escaped quotes in the password.
// It works in javascript, but not in the standard library implementation.
// anyone have any ideas?
// (.*password\(("|'))(?:\\\2|[^\2])*?(\2.*)
// const char prog[] = R"__regex((.*password\(')([^']*)('.*)))__regex";
const char prog[] = R"__regex((.*password\(("|'))(?:\\\2|[^\2])*?(\2.*))__regex";
auto reg = std::regex(prog, std::regex_constants::syntax_option_type::ECMAScript);
std::smatch match;
std::regex_match(input, match, reg);
// match[0] is the entire string
// match[1] is pre-password
// match[2] is the password
// match[3] is post-password
return match[1].str() + "********" + match[3].str();
}
int main()
{
using namespace std::literals;
auto test_string = R"__(select * from run_on_hive(server('hdp230m2.labs.teradata.com'),username('vijay'),password('vijay'),dbname('default'),query('analyze table default.test01 compute statistics'));)__";
std::cout << remove_password(test_string);
}
I wanted to capture passwords, even if they contained an escaped quote or double-quote.
However the regex does not compile in clang or gcc.
It compiles correctly in regex101.com when using the javascript syntax.
Am I wrong, or is the implementation incorrect?
Note that ECMAScript is the default flavor in C++ std::regex, you do not have to specify it explicitly. At any rate, std::regex_constants::syntax_option_type::ECMAScript causes one error here since the compiler expects a std::regex_constants value here, and the simplest fix is to remove it or use std::regex(prog, std::regex_constants::ECMAScript).
The [^\2] pattern causes the second issue, Unexpected character in bracket expression. You cannot use backreferences inside bracket expressions, but you may use a negative lookahead to restrict a . / [^] pattern to match anything but what Group 2 holds.
Use
const char prog[] = R"((.*password\((["']))(?:\\\2|(?!\2)[^])*?(\2.*))";
See your fixed C++ demo.
However, it seems you may use a "cleaner" approach using std::regex_replace:
std::string remove_password(std::string const& input)
{
const char prog[] = R"((.*password\((["']))(?:\\\2|(?!\2)[^])*?(\2.*))";
auto reg = std::regex(prog);
return std::regex_replace(input, reg, "$1********$3");
}
See another C++ demo. The $1 and $3 are the placeholders for Group 1 and 3 values.

C++ Escape occurrences of \ in a string

Is there a simple way to escape all occurrences of \ in a string? I start with the following string:
#include <string>
#include <iostream>
std::string escapeSlashes(std::string str) {
// I have no idea what to do here
return str;
}
int main () {
std::string str = "a\b\c\d";
std::cout << escapeSlashes(str) << "\n";
// Desired output:
// a\\b\\c\\d
return 0;
}
Basically, I am looking for the inverse to this question. The problem is that I cannot search for \ in the string, because C++ already treats it as an escape sequence.
NOTE: I am not able to change the string str in the first place. It is parsed from a LaTeX file. Thus, this answers to a similar question does not apply. Edit: The parsing failed due to an unrelated problem, the question here is about string literals.
Edit: There are nice solutions to find and replace known escape sequences, such as this answer. Another option is to use boost::regex("\p{cntrl}"). However, I haven't found one that works for unknown (erroneous) escape sequences.
You can use raw string literal. See http://en.cppreference.com/w/cpp/language/string_literal
#include <string>
#include <iostream>
int main() {
std::string str = R"(a\b\c\d)";
std::cout << str << "\n";
return 0;
}
Output:
a\b\c\d
It is not possible to convert the string literal a\b\c\d to a\\b\\c\\d, i.e. escaping the backslashes.
Why? Because the compiler converts \c and \d directly to c and d, respectively, giving you a warning about Unknown escape sequence \c and Unknown escape sequence \d (\b is fine as it is a valid escape sequence). This happens directly to the string literal before you have any chance to work with it.
To see this, you can compile to assembler
gcc -S main.cpp
and you will find the following line somewhere in your assembler code:
.string "a\bcd"
Thus, your problem is either in your parsing function or you use string literals for experimenting and you should use raw strings R"(a\b\c\d)" instead.

Write a program that reads a string of characters including punctuation and writes what was read but with the punctuation removed

Here's my attempt at it:
#include <iostream>
#include<string>
using namespace std;
int main()
{
string s("hello world!..!.");
for (auto &c : s)
if(!ispunct(c))
{
cout<<s;
}
}
Here's the output
hello world!..!.hello world!..!.hello world!..!.hello world!..!.hello world!..!.
hello world!..!.hello world!..!.hello world!..!.hello world!..!.hello world!..!.
hello world!..!.
Here's another attempt:
#include <iostream>
#include<string>
using namespace std;
int main()
{
string s("hello world!..!.");
for (auto &c : s)
if(!ispunct(c))
{
cout<<c;
}
}
This gives the correct output (i.e : hello world)
Why won't cout<<s; give me the correct output? After all c is a reference, so any changes to c would also apply to s. Or am I wrong about this?
This is why i don't really like the auto feature, auto in your case is a char and there is no elimination from the string.
LE: ispunct doesn't remove the character from the string, it doesn't even know (or care) that you have a string, it only returns true if the character is punctuation character or false if not, and based on that return the cout statement is executed with the character that is not punctuation or not executed for punctuation character.
s is the entire string, so cout<<s sends the entire string to your output stream :\
Your second attempt works because you're sending individual characters to the stream. In the first attempt though, you're sending the whole string for each character that exists in the string. Count the number of non-punct characters in your string, then count the number of times the string was printed out ;)
ispunct does not eliminate the character that is a punct. It only returns 0 or non-zero to indicate
if it is punct or no.
When you encounter a character that is not punct, it returns 0. And you enter your if. You are printing s that is the whole string.
Whereas, with cout<<c you only print the character (which is non punct, since you are now in the loop)
One more variant, using STL algorithms:
#include <iostream>
#include <string>
#include <algorithm>
#include <iterator>
using std::cout;
using std::endl;
using std::string;
using std::back_inserter;
using std::copy_if;
int main()
{
string s("hello world!..!.");
string result = "";
copy_if(begin(s), end(s),
back_inserter(result),
[](char c){return !ispunct(c);}
);
cout << result << endl;
}
For real world code it's recommended to prefer suitable STL algorithms if available over a loop, because saying copy_if states your intent clearly, and does not focus on the individual steps to take. Whether or not it's better in this example I don't want to judge. But it's certainly good to keep this possibility in mind!
One more thing: It's generally regarded a bad thing to write using namespace std, because this can lead to name collisions when a future version of C++ introduces new keywords which you've already defined yourself in your own code. Either use the prefix std:: all the time, or follow the way I've shown in my example, this keeps your code safe.