g++ regex crash on (possibly unsyntactic) expression

g++ regex crash on (possibly unsyntactic) expression - c++

I figure the following program should either complain it can't compile the regular expression or else treat it as legal and compile it fine (I don't have the standard so I can't say for sure whether the expression is strictly legal; certainly reasonable interpretations are possible). Anyway, what happens with g++ (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1 is that, when run, it crashes hard
*** Error in `./a.out': free(): invalid next size (fast): 0x08b51248 ***
in the guts of the library.
Questions are:
a) it's bug, right? I assume (perhaps incorrectly) the standard doesn't say std::regex can crash if it doesn't like the syntax. (msvc eats it fine, fwiw)
b) if it's a bug, is there some easy way to see whether it's been reported or not (my first time poking around gnu-land bug systems was intimidating)?
#include <iostream>
#include <regex>
int main(void)
{
const char* Pattern = "^(%%)|";
std::regex Machine;
try {
Machine = Pattern;
}
catch(std::regex_error e)
{
std::cerr << "regex could not compile pattern: "
<< Pattern << "\n"
<< e.what() << std::endl;
throw;
}
return 0;
}

I would put this in a comment, but I can't, so...
I don't know if you already know, but it seems to be the pipe | character at the end that's causing your problems. It seems like the character representation of | as a last character (since "^(%%)|a" works fine for me) given by g++ is making a mess when regex tries to call free();
The standard (or at least the online draft I'm reading) claims that:
28.8
Class template basic_regex
[re.regex]
1 For a char-like type charT, specializations of class template basic_regex represent regular expressions
constructed from character sequences of charT characters. In the rest of 28.8, charT denotes a given char-
like type. Storage for a regular expression is allocated and freed as necessary by the member functions of
class basic_regex.
2 Objects of type specialization of basic_regex are responsible for converting the sequence of charT objects
to an internal representation. It is not specified what form this representation takes, nor how it is accessed by
algorithms that operate on regular expressions.
[ Note: Implementations will typically declare some function
templates as friends of basic_regex to achieve this — end note ]
and later,
basic_regex& operator=(const charT* ptr);
3 Requires: ptr shall not be a null pointer.
4 Effects: returns assign(ptr).
So unless g++ thinks const char* Pattern ="|"; is a null ptr (I would imagine not...),
I guess it's a bug?
EDIT: Incidentally, consecutive || (even when not at the end) seem to cause a segmentation fault for me also.

Related

Can std::string::c_str() be used whenever a string literal is expected?

I would guess that the last two lines in this code should compile.
#include "rapidjson/document.h"
int main(){
using namespace rapidjson ;
using namespace std ;
Document doc ;
Value obj(kObjectType) ;
obj.AddMember("key", "value", doc.GetAllocator()) ; //this compiles fine
obj.AddMember("key", string("value").c_str(), doc.GetAllocator()) ; //this does not compile!
}
My guess would be wrong, though. One line compiles and the other does not.
The AddMember method has several variants as documented here, but beyond that... why is the return of .c_str() not equivalent to a string literal?
My understanding was that where ever a string literal was accepted, you could pass string::c_str() and it should work.
PS: I'm compiling with VC++ 2010.
EDIT:
The lack of #include <string> is not the problem. It's already included by document.h
This is the error:
error C2664: 'rapidjson::GenericValue<Encoding> &rapidjson::GenericValue<Encoding>::AddMember(rapidjson::GenericValue<Encoding> &,rapidjson::GenericValue<Encoding> &,Allocator &)'
: cannot convert parameter 1 from 'const char [4]' to 'rapidjson::GenericValue<Encoding> &'
with
[
Encoding=rapidjson::UTF8<>,
Allocator=rapidjson::MemoryPoolAllocator<>
]
and
[
Encoding=rapidjson::UTF8<>
]
EDIT2:
Please ignore the fact that .c_str() is called on a temporal value. This example is just meant to show the compile error. The actual code uses a string variable.
EDIT3:
Alternate version of the code:
string str("value") ;
obj.AddMember("key", "value", doc.GetAllocator()) ; //compiles
obj.AddMember("key", str, doc.GetAllocator()) ; // does not compile
obj.AddMember("key", str.c_str(), doc.GetAllocator()) ; // does not compile

The std::string::c_str() method returns a char const*. The type of a string literal is char const[N] where N is the number of characters in the string (including the null terminator). Correspondingly, the result of c_str() can not be used in all places where a string literal can be used!
I'd be surprised if the interface you are trying to call requires a char array, though. That is, in your use it should work. It is more likely that you need to include <string>.

even if this code compiled:
obj.AddMember("key2", string("value").c_str(), doc.GetAllocator());
You cannot guarantee that it is safe.
The const char* returned by std::string::c_str() will be valid until the end of this statement.
If the AddMember method stores a copy of the string itself, all well and good. If it stores a pointer then you're doomed. You need knowledge of the inner workings of AddMember before you can reason about the correctness of your code.
I suspect the authors have already thought of this and have constructed overloads that demand that you either send in a std::string object (or equivalent) or a string literal reference (template<std::size_t N> void AddMember(const char (&str)[N]))
Even if this is not what they had in mind, they might be looking to protect you from yourself, in case you inadvertently send in an invalid pointer.
While seemingly an inconvenience, this compile time error indicates a possibly-faulty program. It's a tribute to the library's authors. Because compile time errors are a gazillion times more useful than runtime errors.

Looking at the documentation you linked to, it seems like you are trying to call the overload of AddMember taking two StringRefTypes (and an Allocator). StringRefType is a typedef for GenericStringRef<Ch>, which has two overloaded constructors taking a single argument:
template<SizeType N>
GenericStringRef(const CharType(&str)[N]) RAPIDJSON_NOEXCEPT;
explicit GenericStringRef(const CharType *str);
When you pass a string literal, the type is const char[N], where N is the length of the string + 1 (for the null terminator). This can be implicitly converted to a GenericStringRef<Ch> using the first constructor overload. However, std::string::c_str() returns a const char*, which cannot be converted implicitly to a GenericStringRef<Ch>, because the second constructor overload is declared explicit.
The error message you get from the compiler is caused by it choosing another overload of AddMember which is a closer match.

Re
” why is the return of .c_str() not equivalent to a string literal
A string literal is a zero-terminated string in an array with size known at compile time.
c_str() produces a pointer to (the first item in) a zero-terminated string in an array with size known only at run-time.
Usually a string literal expression will be used in a context where the expression decays to pointer to first item, but in some special cases it does not decays. These cases include
binding to a reference to array,
using the sizeof operator, and
forming a larger literal by compile time concatenation of string literals (simply writing them in order).
I think that's an exhaustive list.
The error message you cite,
” cannot convert parameter 1 from 'const char [4]' to 'rapidjson::GenericValue &
… does not match your presented code
#include "rapidjson/document.h"
int main(){
using namespace rapidjson ;
using namespace std ;
Document doc ;
Value obj(kObjectType) ;
obj.AddMember("key1", "value", doc.GetAllocator()) ; //this compiles fine
obj.AddMember("key2", string("value").c_str(), doc.GetAllocator()) ; //this does not compile!
}
Nowhere in this code is there a three character long string literal.
Hence the claims that “this compiles” and “this does not compile”, are not very trustworthy.
You
should have quoted the actual error message and actual code (at least one of them is not what you had when you compiled), and
should have quoted the documentation of the function you're calling.
Also, note that the actual argument that compiler reacts to in the quoted diagnostic, is a literal or an array declared as such, not a c_str() call.

Pointer increment and decrement

I was solving a question my teacher gave me and hit a little snag.
I am supposed to give the output of the following code:(It's written in Turbo C++)
#include<iostream.h>
void main()
{
char *p = "School";
char c;
c=++(*(p++));
cout<<c<<","<<p<<endl;
cout<<p<<","<<++(*(p--))<<","<<++(*(p++))<<endl;
}
The output the program gives is:
T,chool
ijool,j,i
I got the part where the pointer itself increments and then increments the value which the pointer points to. But i don't get the part where the string prints out ijool
Can someone help me out?

The program you showed is non-standard and ill-formed (and should not compile).
"Small" problems:
The proper header for input/output streams in C++ is <iostream>, not <iostream.h>
main() returns an int, not a void.
cout and endl cannot be used without a using namespace std; at the beginning of the file, or better: use std::cout and std::endl.
"Core" problems:
char* p = "School"; is a pointer to string litteral. This conversion is valid in C++03 and deprecated in C++11. Aside from that, normally string litterals are read only, and attempts to modify them often result in segfaults (and modifying a string litteral is undefined behvior by the standard). So, you have undefined behavior everytime you use p, because you modify what it points to, which is the string litteral.
More subtle (and the practical explanation): you are modifying p several times in the line std::cout<<p<<","<<++(*(p--))<<","<<++(*(p++))<<std::endl;. It is undefined behavior. The order used for the operations on p is not defined, here it seems the compiler starts from the right. You can see sequence points, sequence before/after for a better explanation.
You might be interested with the live code here, which is more like what you seemed to expect from your program.

Let's assume you correct:
the header to <iostream> - there is no iostream.h header
your uses of cout and endl with std::cout and std::endl respectively
the return type of main to int
Okay,
char *p = "School";
The string literal "School" is of type "array of 7 const char." The conversion to char* was deprecated in C++03. In C++11, this is invalid.
c=++(*(p++));
Here we hit undefined behaviour. As I said before, the chars in a string literal are const. You simply can't modify them. The prefix ++ here will attempt to modify the S character in the string literal.
So from this point onwards, there's no use making conjectures about what should happen. You have undefined behaviour. Anything can happen.

Even if the preceding lines were legal, this line is also undefined behavior, which means that you cannot accurately predict what the output will be:
cout<<p<<","<<++(*(p--))<<","<<++(*(p++))<<endl;
Notice how it modifies the value of p multiple times on that line (really between sequence points)? That's not allowed. At best you can say "on this compiler with this run-time library and this environment at this moment of execution I observed the following behavior", but because it is undefined behavior you can't count on it to do the same thing every time you run the program, or even if the same code is encountered multiple times within the same run of the program.

There are at least three problems with this code (and maybe more; I'm not a C++ expert).
The first problem is that string constants like should not be modified as they can be placed in read-only parts of the program memory that the OS maps directly to the exe file on disk (the OS may share them between several running instances of that same program for example, or avoid those parts of memory needing to be written to the swap file when RAM is low, as it knows it can get the original from the exe). The example crashes on my compiler, for example. To modify the string you should allocate a modifiable duplicate of the string, such as with strdup.
The second problem is it's using cout and endl from the std namespace without declaring that. You should prefix their accesses with std:: or add a using namespace std; declaration.
The third problem is that the order in which the operations on the second cout line happen is undefined behavior, leading to the apparently mysterious change of the string between the time it was displayed at the end of the first cout line and the next line.
Since this code is not intended to do anything in particular, there are different, valid ways you could fix it. This will probably run:
#include <iostream>
#include <string.h>
#include <stdlib.h>
using namespace std;
int main()
{
char *string = strdup("School");
char *p = string;
char c;
c=++(*(p++));
cout<<c<<","<<p<<endl;
cout<<p<<","<<++(*(p--))<<","<<++(*(p++))<<endl;
free(string);
}
(On my compiler this outputs: T,chool, diool,i,d.)
It still has undefined behavior though. To fix that, rework the second cout line as follows:
cout << p << ",";
cout << ++(*(p--)) << ",";
cout << ++(*(p++)) << endl;
That should give T,chool, chool,d,U (assuming a character set that has A to Z in order).

p++ moves the position of p from "School" to "chool". Before that, since it is p++, not ++p, it increments the value of the char. Now c = "T" from "S"
When you output p, you output the remainder of p, which we identified before as "chool".
Since it is best to learn from trial and error, run this code with a debugger. That is a great tool which will follow you forever. That will help for the second set of cout statements. If you need help with gdb or VS debugger, we can walk through it.

Disambiguating std::isalpha() in C++

So I am currently writing a part of a program that takes user text input. I want to ignore all input characters that are not alphabetic, and so I figured std::isalpha() would be a good way to do this. Unfortunately, as far as I know there are two std::isalpha() functions, and the general one needs to be disambiguated from the locale-specific one thusly:
(int(*)(int))std::isalpha()
If I don't disambiguate, std::isalpha seems to return true when reading uppercase but false when reading lowercase letters (if I directly print the returned value, though, it returns 0 for non-alpha chars, 1 for uppercase chars, and 2 for lowercase chars). So I need to do this.
I've done so in another program before, but for some reason, in this project, I sometimes get "ISO C++ forbids" errors. Note, only sometimes. Here is the problematic area of code (this appears together without anything in between):
std::cout << "Is alpha? " << (int(*)(int))std::isalpha((char)Event.text.unicode) << "\n";
if ( (int(*)(int))std::isalpha((char)Event.text.unicode) == true)
{
std::cout << "Is alpha!\n";
//...snip...
}
The first instance, where I send the returned value to std::cout, works fine - I get no errors for this, I get the expected values (0 for non-alpha, 1 for alpha), and if that's the only place I try to disambiguate, the program compiles and runs fine.
The second instance, however, throws up this:
error: ISO C++ forbids comparison between pointer and integer
and only compiles if I remove the (int(*)(int)) snippet, at which point bad behavior ensues. Could someone enlighten me here?

You are casting the return value of the std::alpha() call to int(*)(int), and then compare that pointer to true. Comparing pointers to boolean values doesn't make much sense and you get an error.
Now, without the cast, you compare the int returned by std::alpha() to true. bool is an integer type, and to compare the two different integer types the values are first converted to the same type. In this case they are both converted to int. true becomes 1, and if std::isalpha() returned 2 the comparison ends up with 2 != 1.
If you want to compare the result of std::alpha() against a bool, you should cast that returned in to bool, or simply leave out the comparison and use something like if (std::isalpha(c)) {...}

There is no need to disambiguate, because the there is no ambiguity in a normal call.
Also, there is no need to use the std:: prefix when you get the function declaration from <ctype.h>, which after C++11 is the header you should preferably use (i.e., not <cctype>) – and for that matter also before C++11, but C++11 clinched it.
Third, you should not compare the result to true.
However, you need to cast a char argument to unsigned char, lest you get Undefined Behavior for anything but 7-bit ASCII.
E.g. do like this:
bool isAlpha( char const c )
{
typedef unsigned char UChar;
return !!isalpha( UChar( c ) );
}

Constraints on the lifetime of the argument to std::regex_match (and std::regex_search)

Considering the C++11 function with the signaturer std::regex_match(
std::string const&, std::smatch& match, std::regex const& re ), what
are the constraints on the lifetime of the first argument? I don't find
any, but when I execute the following program (compiled with VC++ 2010,
iterator debugging active):
int
main()
{
std::string a("aaa");
std::string c("ccc");
std::regex re("aaa(.*)ccc");
std::smatch m;
if (std::regex_match(a + "xyz" + c, m, re)) {
std::cout << m[0] << std::endl;
std::cout << m[1] << std::endl;
}
return 0;
}
it crashes, doubtlessly because the sub_match in m only keep
iterators into the string, and not copies. I can't find anything in the
standard which forbids my code.
FWIW: it didn't work in boost::regex, either, and that's what the
std::regex is based on. (Of course, Boost didn't document any
constraints with regards to the lifetime either.)
In the end, I guess my question is: should I send in a DR to the
standards organization, or a bug report to Microsoft?

I don't recall any discussion of this possibility during the adoption of tr1::regex or std::regex, so I think it simply was not considered. In hindsight, it's certainly a trap that we should have foreseen. Off the top of my head, an overload that takes a std::string&& would signal that a temporary is involved, and a copy is needed. So I'd report it to the Standards Committee. (full disclosure: I wrote the Dinkumware implementation, which is what Microsoft ships)

The specification for this overload of regex_match states that it (28.11.2[re.alg.match]/6):
Returns: regex_match(s.begin(), s.end(), m, e, flags)
There are no additional requirements on this overload, and the overload to which it delegates takes only an iterator range--there is no way for it to keep the temporary string alive because it doesn't even know that there is a string to be kept alive.
This issue came up in discussion during STL'sregex presentation at C++Now '12. Someone recommended that additional overloads might be added to the specification, to catch rvalue string arguments (e.g. basic_string<...>&&), which would give a nice compilation error instead of this runtime error. The library specification doesn't include those overloads, though, and I don't see a defect report for this.

Is std::stoi actually safe to use?

I had a lovely conversation with someone about the downfalls of std::stoi. To put it bluntly, it uses std::strtol internally, and throws if that reports an error. According to them, though, std::strtol shouldn't report an error for an input of "abcxyz", causing stoi not to throw std::invalid_argument.
First of all, here are two programs tested on GCC about the behaviours of these cases:
strtol
stoi
Both of them show success on "123" and failure on "abc".
I looked in the standard to pull more info:
§ 21.5
Throws: invalid_argument if strtol, strtoul, strtoll, or strtoull reports that
no conversion could be performed. Throws out_of_range if the converted value is
outside the range of representable values for the return type.
That sums up the behaviour of relying on strtol. Now what about strtol? I found this in the C11 draft:
§7.22.1.4
If the subject sequence is empty or does not have the expected form, no
conversion is performed; the value of nptr is stored in the object
pointed to by endptr, provided that endptr is not a null pointer.
Given the situation of passing in "abc", the C standard dictates that nptr, which points to the beginning of the string, would be stored in endptr, the pointer passed in. This seems consistent with the test. Also, 0 should be returned, as stated by this:
§7.22.1.4
If no conversion could be performed, zero is returned.
The previous reference said that no conversion would be performed, so it must return 0. These conditions now comply with the C++11 standard for stoi throwing std::invalid_argument.
The result of this matters to me because I don't want to go around recommending stoi as a better alternative to other methods of string to int conversion, or using it myself as if it worked the way you'd expect, if it doesn't catch text as an invalid conversion.
So after all of this, did I go wrong somewhere? It seems to me that I have good proof of this exception being thrown. Is my proof valid, or is std::stoi not guaranteed to throw that exception when given "abc"?

Does std::stoi throw an error on the input "abcxyz"?
Yes.
I think your confusion may come from the fact that strtol never reports an error except on overflow. It can report that no conversion was performed, but this is never referred to as an error condition in the C standard.
strtol is defined similarly by all three C standards, and I will spare you the boring details, but it basically defines a "subject sequence" that is a substring of the input string corresponding to the actual number. The following four conditions are equivalent:
the subject sequence has the expected form (in plain English: it is a number)
the subject sequence is non-empty
a conversion has occurred
*endptr != nptr (this only makes sense when endptr is non-null)
When there is an overflow, the conversion is still said to have occurred.
Now, it is quite clear that because "abcxyz" does not contain a number, the subject sequence of the string "abcxyz" must be empty, so that no conversion can be performed. The following C90/C99/C11 program will confirm it experimentally:
#include <stdio.h>
#include <stdlib.h>
int main() {
char *nptr = "abcxyz", *endptr[1];
strtol(nptr, endptr, 0);
if (*endptr == nptr)
printf("No conversion could be performed.\n");
return 0;
}
This implies that any conformant implementation of std::stoi must throw invalid_argument when given the input "abcxyz" without an optional base argument.
Does this mean that std::stoi has satisfactory error checking?
No. The person you were talking to is correct when she says that std::stoi is more lenient than performing the full check errno == 0 && end != start && *end=='\0' after std::strtol, because std::stoi silently strips away all characters starting from the first non-numeric character in the string.
In fact off the top of my head the only language whose native conversion behaves somewhat like std::stoi is Javascript, and even then you have to force base 10 with parseInt(n, 10) to avoid the special case of hexadecimal numbers:
input | std::atoi std::stoi Javascript full check
===========+=============================================================
hello | 0 error error(NaN) error
0xygen | 0 0 error(NaN) error
0x42 | 0 0 66 error
42x0 | 42 42 42 error
42 | 42 42 42 42
-----------+-------------------------------------------------------------
languages | Perl, Ruby, Javascript Javascript C#, Java,
| PHP, C... (base 10) Python...
Note: there are also differences among languages in the handling of whitespace and redundant + signs.
Ok, so I want full error checking, what should I use?
I'm not aware of any built-in function that does this, but boost::lexical_cast<int> will do what you want. It is particularly strict since it even rejects surrounding whitespace, unlike Python's int() function. Note that invalid characters and overflows result in the same exception, boost::bad_lexical_cast.
#include <boost/lexical_cast.hpp>
int main() {
std::string s = "42";
try {
int n = boost::lexical_cast<int>(s);
std::cout << "n = " << n << std::endl;
} catch (boost::bad_lexical_cast) {
std::cout << "conversion failed" << std::endl;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js