Does Re2 use string size or null termination? - c++

The title is pretty much it. If a standard C++ string with UTF-8 characters has no zero bytes does the scanning terminate at the end of the string defined by it's size? Conversely, if the string has a zero byte does scanning stop at that byte, or continue to the full length of the string?
I've look at the Re2.h file and it does not seem to address this issue.

A std::string containing UTF-8 characters can´t have 0-bytes a part of the text
(only as termination), because UTF-8 doesn´t allow 0´s anywhere.
And given you´re using something C++11-compliant, a terminating 0 is guaranteed
(doesn´t matter if you use data() or c_str(). And data is the original data, so...).
See http://en.cppreference.com/w/cpp/string/basic_string/data
or the standard (21.4.7.1/1 etc.).
=> The processing of a string will stop at the 0

The interface to Re2 seems to use std::string, which almost
certainly means that it uses the begin and the end of the
string, and that null characters are characters like any other.
(The are, after all, defined in Unicode and in UTF-8.) Of
course, '\0' is in the category control characters, so it won't
match something like "\pL" (which matches a letter). But it
should match "\pC". And of course, '\u0000' and other representations of the null character.

Related

Why doesn't strlen() count the byte of the terminating NUL-character, when the NUL-character is defined to be part of a string?

I know that strlen() does not count the NUL-terminating character with. I really know that this is a fact. Thus, this question is NOT about asking for why strlen() might "presumably" not return the right string length, which is already asked and answered alot well here on StackOverflow, f.e. in this thread, or this one.
So lets go ahead to my question:
In ISO/IEC 9899:1990 (E); 7.1.1., is stated:
A string is a contiguous sequence of characters terminated by and including the first null character.
What is the reason, why strlen() deviate from this formed standard, and does not "want" to accept a string with its NUL-terminating character?
Why?
Because you would expect this pseudocode's assertion to hold true:
str1 = "foo"
str2 = "bar"
str3 = concatenate(str1, str2)
Assert strlen(str1) + strlen(s2) == strlen(str3)
If terminating '\0' was counted by strlen, above assertion would not hold, which would be much more of overall headache, than what the current C string behavior is. More importantly, it would in my opinion be quite unintuitive and illogical.
Taking your doubt as a reasonable point we can state that: The C-string consists of two parts:
the string's useful content ("the text");
the null terminating character;
The null terminating character is purely a technical measure for determination of the end of the string by the C-originated library functions. Still, if one types a declaration:
char * str = "some string";
they logically would rather expect its length to be 11 which is as many as they can see in this statement. Hence the strlen() value yields only the length of the part 1. of the string.
Not really an answer to your question, but consider this example:
char string[] = "string";
printf("sizeof: %zu\n", sizeof(string));
printf("strlen: %zu\n", strlen(string));
This prints
sizeof: 7
strlen: 6
So sizeof counts the \0, but strlen doesn't.
Questions like this, that ask why a certain age-old decision was made one way and not another way, are hard to answer. I can say that it's perfectly obvious to me, anyway, that strlen should count just the real, "interesting" characters that are in the string, and ignore the \0 at the end that merely terminates it. I'm used to accounting for the \0 separately. I imagine it would have been considerably more of a nuisance overall if strlen had been defined the other way. But I can't prove this with convincing arguments, and I've been using strlen with its current definition for so long that I'm probably hopelessly biased; I might be saying "it's perfectly obvious to me that..." even if strlen's definition were quite wrong.
There is a difference between the physical, stored representation of a C style string and the logical representation of a C style string.
The physical representation, how the string is actually stored in memory or other media includes the null character. The null character is included when discussing the physical representation because it take up an additional piece of storage. In order to be a C style string the null character must be stored.
However the logical representation of a string does not include the null character. The logical representation of a string includes only the text characters that the programmer is wanting to manipulate.
I suspect that the null character, a value of binary zero, was chosen because of the original ASCII character set defined a character value of zero as the NULL character. Part of the lower values among the various teletype control codes, it seems to be the least likely ASCII character that may appear in text. See ASCII Character Codes.
Another nice quality of using a binary zero as the string terminator is that is the value that represents logical false so iterating over a string is often a matter of incrementing an array index or incrementing a pointer while logical true since all characters other than the end of string indicator have a non-zero or logical true value.
Due to how close to the hardware that the C programming language is, the programmer needs to be concerned about both representations, the physical representation when allocating memory to store a string which includes the null character and the logical representation which is the string without the null character.
The various C style string manipulation functions in the Standard Library (strlen(), strcpy(), etc.) are all designed around the logical representation of a C style string. They perform their actions by using the null character as not being part of the text but rather as a special indicator character which indicates the end of the string. However as a part of their operations they need to be aware of the null character and its use as a special symbol. For instance when strcpy() or strcat() are used to copy strings, they must also copy the null character that indicates the end of the string even though it is not part of the actual text of the logical representation.
This choice allows text strings to be stored as arrays of characters, as befits the hardware orientation and efficiency characteristics of C. There is no need to create an additional built in type for text strings and it fits well with the lean character of the C programming language.
C++ is able to provide the std::string because of being object oriented and having the additional facilities of the language that allows for objects to be created and managed. The C programming language, due to its simple syntax and lack of object oriented facilities does not have this convenience.
The problem with this approach is that the programmer needs to be aware of both the physical representation and the logical representation of text strings and be able to accommodate the needs of both when writing programs.

How to match ANSI data encoded as Unicode with regex

I have some ANSI data encoded as UTF16 Little Endian.
It therefore looks like a \0 b \0 c \0. I suspect this is asking way too much of regex, but just on the off-chance, is there any way of matching this data specifically?
I am able to use ^[\w\x00]+$ but that doesn't really ensure that the null bytes are in the right place. Is there any way to have an alternating pattern match, or match based on character position mod 2 so that the even positions must be filled with null bytes, and null bytes are not allowed elsewhere?
If not I'll write a little bit of manual code, just would be helpful to know.
Thank you.

what's the best way to get rid of the \0 at the end of a string literal?

I'm trying to do something like
strcmp(argv[3], "stdout")
however, in the command line I don't want to type
stdout\0
what's the best way to get rid of the \0 at the end of a string literal?
Thanks!
update:
Thanks guys. I found what's wrong with my code... I should have used
strcmp(argv[3], "stdout") == 0
Thanks #Nicol Bolas
You don't have to type "stdout\0" on the command line. Whichever way your system makes command-line arguments available to your process (it differs by operating system) automatically adds the null character.
As you know, a C-style string is terminated by the null character, which is written in code as '\0'. If that character weren't at the end of the string, a function such as strcmp would keep going well beyond the end of the string, since such a string flouts convention. Since the terminating null character is the C convention, however, the compiler is smart enough to add the null character to the end of a string literal, and the system is smart enough to add the null character to the command-line arguments stored in the memory of a freshly created process. If argc is greater than 3, and the third argument you type on the command-line for your program is "stdout", the call to strcmp(argv[3], "stdout") will return 0 to mean that the two strings match.
You don't need to type \0 in most cases. String literals have a \0 implicitly appended to them, and the C functions that store string data into character arrays will append a \0 on the end (which is why the documentation for many of those functions specifies that your character buffer must have enough space for the string and the null terminator).
A string literal consists of zero or more characters from the source character set surrounded by double quotation marks ("). A string literal represents a sequence of characters that, taken together, form a null-terminated string.
strcmp starts comparing the first character of each string. If they are equal to each other, it continues with the following pairs until the characters differ or until a terminating null-character is reached.
so you don' t need to write \0 in the end of stdout, you need to compare the return value of strcmp to 0:
if (strcmp(argv[3], "stdout") == 0)

What to use to represent a lambda character in C++

In the program, Lambda λ theoretically represents nothing: ''. I thought of representing this programatically as '\0', but obviously that terminates a string which is not necessarily what lambda does. Also, I am reading in from istringstream and it has problems reading that character in.
So what character would you use?
I'm assuming you have a reason for representing Int,Char,Int as a string, rather than just define a struct to hold the data.
As you say, \0 doesn't work as it terminates the string. But there are other invisible ASCII characters that you can use and easily escape in C++. Have a look at this list of escape codes.

Failsafe conversion between different character encodings

I need to convert strings from one encoding (UTF-8) to another. The problem is that in the target encoding we do not have all characters from the source encoding and libc iconv(3) function fails in such situation. What I want is to be able to perform conversion but in output string have this problematic characters been replaced with some symbol, say '?'.
Programming language is C or C++.
Is there a way to address this issue ?
Try appending "//TRANSLIT" or "//IGNORE" to the end of the destination charset string. Note that this is only supported under the GNU C library.
From iconv_open(3):
//TRANSLIT
When the string "//TRANSLIT" is appended to tocode, translitera‐
tion is activated. This means that when a character cannot be
represented in the target character set, it can be approximated
through one or several similarly looking characters.
//IGNORE
When the string "//IGNORE" is appended to tocode, characters
that cannot be represented in the target character set will be
silently discarded.
Alternately, manually skip a character and insert a substitution in the output when you get -EILSEQ from iconv(3).
Regex based on the translatable source ranges used to swap a corresponding placeholder in for any chars that don't match.