Regex for invalid Base64 characters - regex

How to create regex which match all invalid Base64 characters ?
I found on stack [^a-zA-Z0-9+/=\n\r].*$ but when I try I got in result string with - sign.
I don't know regex at all, can anyone validate that this is good or bad regex ?

The short answer to your question is that if the message contains any match for a character from the class [^A-Za-z0-9+/=\s] then it contains an invalid base-64 character, except for MIME messages which may freely mix other data (for various purposes) together with the base-64 stream. (These other characters are deleted before decoding the base-64 object.)
As someone who was lucky enough to help write the internals of a very fast base 64 encoding program, that processed multi-byte blocks with each machine instruction, let me add a few remarks:
The base-64 alphabet is: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
output must be padded by zero or more = signs as necessary so that the total length of non-whitespace characters is a multiple of four.
Those equals signs can only occur at the end of the base-64 message, and there can be at most two of them.
Whitespace should be ignored regardless of what type. Usually messages are wrapped to a certain margin (which must be a multiple of four), but this is not necessary. The purpose of a base 64 encoding is to transfer arbitrary values, especially binary data, as plain text. You could theoretically even read someone a JPEG image over the phone using base 64 encoding.
My suggestion therefore for validating a base-64 message is to do more than just use a regular expression. Instead,
Eliminate all whitespace and call the length of the resulting output z.
Count the number x of base-64 alphabet characters.
Count the number y of equals sign(s) at the end of the message.
Return valid if y is at most 2 and x + y = z and invalid otherwise.
Note 1: The padding characters == or = do not serve any purpose in protecting the integrity of the data, and there are many derivatives of base-64 encoding which do not use them. Many consider the padding to be almost as useless and wasteful of processing time as the CR portion of the CRLF line-ending sequence.
Note 2: The variant used for MIME encoding accepts characters outside the base-64 alphabet to be contained within the message stream, but simply discards them when decoding the base-64 data object.
Note 3: I dislike the modern term "Base64" since it is an extremely ugly word. This fake word was never used by the original base-64 writers, but was adopted sometime in the next nine years.
You can encode most of this into a regular expression as follows (without the precise length checks on the last block of base-64 data):
^\s*(?:(?:[A-Za-z0-9+/]{4})+\s*)*[A-Za-z0-9+/]*={0,2}\s*$

That should probably be ^[a-zA-Z0-9+/\r\n]+={0,2}$1 instead.
Currently it only matches one valid character then allows anything after it. So, for instance:
aGVsbG8sIHdvcmxkIQ== match
aGV%sb-G8sIHdvcmxkIQ== also a match (starts with "a")
Whereas removing .* at the end, and adding a quantifier to the class, it forces the entire string to be legit:
aGVsbG8sIHdvcmxkIQ== match
aGV%sb-G8sIHdvcmxkIQ== not a match
1 As #p.s.w.g pointed out, a valid base64 shouldn't contain = within the value (since = has special meaning and is used as a filler).

Related

C++ test for validation UTF-8

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C++:
TEST(validation, Tests)
{
std::string str = "hello";
EXPECT_TRUE(validate_utf8(str));
// I need incorrect UTF-8 cases
}
How can I write incorrect UTF-8 cases in C++?
You can specify individual bytes in the string with the \x escape sequence in hexadecimal form or the \000 escape sequence in octal form.
For example:
std::string str = "\xD0";
which is incomplete UTF8.
Have a look at https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt for valid and malformed UTF8 test cases.
In UTF-8 any character having a most significant bit of 0 is an ordinary ASCII character, any other one is part of a multi-byte sequence (MBS).
If second most significant one is yet another one then this is the first byte of a MBS, otherwise it is one of the follow-up bytes.
In the first byte of a MBS the number of subsequent highest significant one-bits gives you the number of bytes of the entire sequence, e. g. 0b110xxxxx with arbitrary values for x is the start byte of a two-byte sequence.
Theoretically you could now produce sequences up to seven bytes, currently they are limited to four or five bytes (not fully sure here, you need to look up).
You can now produce arbitrary code points by defining appropriate sequences, e.g. "\xc8\x85" would represent the sequence 0b11001000 0b10000101 which is a legal pattern and represents code point 0b 01000 000101 (note how the leading bits representing the UTF-8 headers are cut away) corresponding to a value of 0x405 or 1029. If that's a valid code point at all you need to look up, I just formed an arbitrary bit pattern as an example.
The same way you can now represent longer valid sequences by increasing the number of most significant one-bits joined with the appropriate number of follow-up bytes (note again: number of initial one-bits is total number of bytes including the first byte of the MSB).
Similarly you now produce invalid sequences such that the total number of bytes of the sequence does not match (too many or too few) the number of initial one-bits.
So far you can produce arbitrary valid or invalid sequences where the valid one represent arbitrary code points. You now might need to look up which of these code points are actually valid ones.
Finally you might additionally consider composed characters (with diacritics) – they can be represented as a character (not byte!) or a normalised single character – if you want to go that far then you'd need to look up in the standard which combinations are legal and conform to which normalised code points.

How can force the user/OS to input an Ascii string

This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)
How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine
You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.

Adding a diacritic mark to string fail in c++

I want to add a diacritic mark to my string in c++. Assume I want to modify wordz string in a following manner:
String respj = resp[j];
std::string respjz1 = respj; // create respjz1 and respjz2
std::string respjz2 = respj;
respjz1[i] = 'ź'; // put diacritic marks
respjz2[i] = 'ż';
I keep receiving: wordş and wordĽ (instead of wordź and wordż). I tried to google it but I keep getting results related to the opposite problem - diacritic normalization to non-diacritic mark.
First, what is String? Does it support accented characters or not?
But the real issue is one of encodings. When you say "I keep
receiving", what do you mean. What the string will contain is
not a character, but a numeric value, representing a code point
of a character, in some encoding. If the encoding used by the
compiler for accented characters is the same as the encoding
used by whatever you use to visualize them, then you will get
the same character. If it isn't, you will get something
different. Thus, for example, depending on the encoding, LATIN
SMALL LETTER Z WITH DOT (what I think you're trying to assign to
respjz2[i]) can be 0xFD or 0xBF in the encoding tables I have
access to (and it's absent in most single byte encodings); in
the single byte encoding I normally use (ISO 8859-1), these code
points correspond to LATIN SMALL LETTER Y WITH ACUTE and
INVERTED QUESTION MARK, respectively.
In the end, there is no real solution. Long term, I think you
should probably move to UTF-8, and try to ensure that all of the
tools you use (and all of the tools used by your users)
understand that. Short term, it may not be that simple: for
starters, you're more or less stuck with what your compiler
provides (unless you enter the characters in the form \u00BF
or \u00FD, and even then the compiler may do some funny
mappings when it puts them into a string literal). And you may
not even know what other tools your users use.

How does `cin>>` take three values though they are separated by space?

When looking at some code online I found
cin>>arr[0][0]>>arr[0][1]>>arr[0][2]
where I put a line of three integer values separated by space. I see that those three integers separated by space become the value of arr[0][0], arr[0][1] and arr[0][2].
It doesn't cause any trouble if there are more than one space between them.
plz, can anyone explain me how this work?
Most overloads of operator>> consume and discard all whitespace characters first thing. They begin parsing the actual value (say, an int) starting from the first non-whitespace character in the stream.
Reading almost any types of inputs from a stream will skip any leading whitespaces first, unless you explicitly turn that feature off. You should read std::basic_istream documentation for more information:
Extracts an integer value potentially skipping preceding whitespace. The value is stored to a given reference value.
This function behaves as a FormattedInputFunction. After constructing and checking the sentry object, which may skip leading whitespace, extracts an integer value by calling std::num_get::get().
The same applies to other stream input functions, including the scanf family where most format specifiers will consume all whitespace characters before reading the value:
All conversion specifiers other than [, c, and n consume and discard all leading whitespace characters (determined as if by calling isspace) before attempting to parse the input. These consumed characters do not count towards the specified maximum field width.

Encoding binary data using string class

I am going through one of the requirment for string implementations as part of study project.
Let us assume that the standard library did not exist and we were
foced to design our own string class. What functionality would it
support and what limitations would we improve. Let us consider
following factors.
Does binary data need to be encoded?
Is multi-byte character encoding acceptable or is unicode necessary?
Can C-style functions be used to provide some of the needed functionality?
What kind of insertion and extraction operations are required?
My question on above text
What does author mean by "Does binary data need to be encoded?". Request to explain with example and how can we implement this.
What does author mean y point 2. Request to explain with example and how can we implement this.
Thanks for your time and help.
Regarding point one, "Binary data" refers to sequences of bytes, where "bytes" almost always means eight-bit words. In the olden days, most systems were based on ASCII, which requires seven bits (or eight, depending on who you ask). There was, therefore, no need to distinguish between bytes and characters. These days, we're more friendly to non-English speakers, and so we have to deal with Unicode (among other codesets). This raises the problem that string types need to deal with the fact that bytes and characters are no longer the same thing.
This segues onto point two, which is about how you represent strings of characters in a program. UTF-8 uses a variable-length encoding, which has the remarkable property that it encodes the entire ASCII character set using exactly the same bytes that ASCII encoding uses. However, it makes it more difficult to, e.g., count the number of characters in a string. For pure ASCII, the answer is simple: characters = bytes. But if your string might have non-ASCII characters, you now have to walk the string, decoding characters, in order to find out how many there are1.
These are the kinds of issues you need to think about when designing your string class.
1This isn't as difficult as it might seem, since the first byte of each character is guaranteed not to have 10 in its two high-bits. So you can simply count the bytes that satisfy (c & 0xc0) != 0xc0. Nonetheless, it is still expensive relative to just treating the length of a string buffer as its character-count.
The question here is "can we store ANY old data in the string, or does certain byte-values need to be encoded in some special way. An example of that would be in the standard C language, if you want to use a newline character, it is "encoded" as \n to make it more readable and clear - of course, in this example I'm talking of in the source code. In the case of binary data stored in the string, how would you deal with "strange" data - e.g. what about zero bytes? Will they need special treatment?
The values guaranteed to fit in a char is ASCII characters and a few others (a total of 256 different characters in a typical implementation, but char is not GUARANTEED to be 8 bits by the standard). But if we take non-european languages, such as Chinese or Japanese, they consist of a vastly higher number than the ones available to fit in a single char. Unicode allows for several million different characters, so any character from any european, chinese, japanese, thai, arabic, mayan, and ancient hieroglyphic language can be represented in one "unit". This is done by using a wider character - for the full size, we need 32 bits. The drawback here is that most of the time, we don't actually use that many different characters, so it is a bit wasteful to use 32 bits for each character, only to have zero's in the upper 24 bits nearly all the time.
A multibyte character encoding is a compromise, where "common" characters (common in the European languages) are used as one char, but less common characters are encoded with multiple char values, using a special range of character to indicate "there is more data in the next char to combine into a single unit". (Or,one could decide to use 2, 3, or 4 char each time, to encode a single character).