Is there a dataset available to fully test a base64 encode/decoder?

Is there a dataset available to fully test a base64 encode/decoder? - unit-testing

I see that there are many base64 implementations available in the opensource and I found multiple internal implementations in a product that I am maintaining.
I'm trying to factor out duplicates but I am not 100% certain that all these implementations give identical output. Therfore I need to have a dataset that tests all possible combinations of input.
Is that somewhere available ? google search did not really report it.
I saw a similar question on stackoverflow but that one has not been fully answered and it is actually just asking for one phrase (in ascii) that would test all 64 chars. It does not handle padding with = for example. So one test string will certainly not fit the bill for a 100% test.

Perhaps something like Base64Test in Bouncy Castle would do what you want?. The tricky part in base64 is handling the padding correctly. It's certainly important to cover that as you mentioned. Accordingly, RFC 4648 specifies these test vectors:
BASE64("") = ""
BASE64("f") = "Zg=="
BASE64("fo") = "Zm8="
BASE64("foo") = "Zm9v"
BASE64("foob") = "Zm9vYg=="
BASE64("fooba") = "Zm9vYmE="
BASE64("foobar") = "Zm9vYmFy"
Some of your implementations may produce base64 output that differs only by whether they insert line breaks, and where implementations that break lines insert the break and the line termination used. You would have to do additional testing to determine whether you can safely replace an implementation that's using one style with a different one. In particular, a decoder might make assumptions about line length or termination.

Related

Unit testing for either/or conditions

The module I'm working on holds a list of items and has a method to locate and return an item from that list based on certain criteria. the specification states that "...if several matching values are found, any one may be returned"
I'm trying to write some tests with Nunit, and I can't find anything that allows me to express this condition very well (i.e. the returned object must be either A or B but I don't mind which)
Of course I could quite easily write code that sets a boolean to whether the result is as expected and then just do a simple assert on that boolean, but this whole question is making me wonder whether this is a "red flag" for unit testing and whether there's a better solution.
How do experienced unit testers generally handle the case where there are a range of acceptable outputs and you don't want to tie the test down to one specific implementation?

Since your question is in rather general form, I can only give a rather general answer, but for example...
Assert.That(someObject, Is.TypeOf<A>().Or.TypeOf<B>());
Assert.That(someObject, Is.EqualTo(objectA).Or.EqualTo(objectB));
Assert.That(listOfValidOjects, Contains.Item(someObject));
It depends on the details of what you are testing.

I am coming from Java, JUnit and parametrized tests, but it seems that nunit supports those as well (see here).
One could use that to generate values for your different variables (and the "generator" could keep track of the expected overall result, too).
Using that approach you might find ways to avoid "hard-coding" all the potential input value combinations (as said: by really generating them); but at least you should be able to write code where that information of different input values together with the expected result is more nicely "colocated" in your source code.

How to unit test a generator/serialization method?

I want to write unit tests for a serialization method. By serialization method I mean a methd that outputs a set of data into a special format.
For example, a method that outputs data in XML format. (I write in C++ but it is the same in every language.)
class Generator
{
public:
std::string serialize();
};
// unit test (pseudo-code)
Generator gen;
// set some data in gen
std::string actual = gen.serialize();
std::string expected = "<xml>...</xml>";
ASSERT_EQUAL(expected, actual);
The problem with this is that the unit test is highly dependent on non-important things, like the formatting of the XML (line breaks) or the order of XML-attributes.
While with XML the previous method will work, it will not work with generators that output binary data.
So, what is a robust way to test serialization methods?
The ideas I have are the following, but all have serious drawbacks.
using external libraries to parse the data (for proprietary formats, there may not exist).
always write pairs of serialization/deserialization and test them in combination (bugs in both methods might remain undiscovered).
store the serialized data in external files and compare against them in the test (the unit test is difficult to read and maintain).

As you are asking about unit-testing, I assume that the intended behaviour of the serializer in all its details is known by you. That is, you know where you want line breaks and indentation etc. to be inserted.
The problem now is, that in every single test case only a subset of these details would be relevant. In other words, in some tests you want to test the proper indentation, and in some tests you just want to be sure that a number is inserted in the right way.
In addition to the options you have provided I recommend another approach: Use regular expression matching instead of string comparison. With the help of regular expressions you can reduce the serialized string to the essential parts which are of interest in the respective test. To check, for example, if the result string contains a certain number, say, 42, you could match it against ^[^0-9]*42[^0-9]*$. Then, the enclosing XML would be ignored in this particular test. This test would then be robust against a large number of changes in the serialization.
With this approach you avoid the dependency on external parsing libraries (well, you are depending on the regular expression library, but that is in many languages today even part of the standard library), you can also test for aspects which the serialize-deserialize can not test (indentation), your tests run fast and are not OS specific (no dependency on the file system).

This is more like a long comment with my first thoughts on the topic.
I think you have to look at two different scenarios. Your data <-> serialized data relation could be either 1:1 or 1:n.
XML would be a 1:n relation, where you XML code would have quite a little bit of freedom, but would still be unserialized to the same data again. In this case it seems to me, that developing and testing serialization/deserialization in combination is the way to go. If there are external libraries available as well, use them of course. If there are no external libraries available, then - as long as serialization / deserialization - yield the same result, you will probably not have "bugs", but "features"...
Testing the deserialization with stored external datafiles does also make sense, but this does not apply to the serialization, imho.
Looking at a 1:1 relation, like maybe putting the data into a certain binary format, you should go for the stored data in external files. Always use external libraries, if they exist, as well, of course.
I would suggest to do all three of those approaches together - where applicable, of course. You should not rely on a single one of them.

Parse an XML in standard C/C++ without additional libraries

I have an XML (assuming it is valid) and I must parse it and store it in a tree.
What is the best approach to parse it, without using other libraries, just basic manipulation of strings?
Keep in mind that I don't have to validate it, just parse and memorize it into a tree.

The basic structure of XML is quite simple:
<tagname [attribute[="value"] ...]>content</tagname>
where the content may contain both normal text and more XML structures, or the special form
<tagname [attribute[="value"] ...]/>
which is equivalent to
<tagname [attribute[="value"] ...]></tagname>
that is,. empty content.
So if you don't need to interpret a DTD or do other fancy things, you can do the following:
Check that the first non-whitespace character is <. If not, you don't have XML and can just give an error and exit.
Now follows the tag name, until the first whitespace, or the / or the > character. Store that.
If the next non-whitespace character is /, check that it is followed by >. If so, you've finished parsing and can return your result. Otherwise, you've got malformed XML, and can exit with an error.
If the character is >, then you've found the end of the begin tag. Now follows the content. Continue at step 6.
Otherwise what follows is an argument. Parse that, store the result, and continue at step 3.
Read the content until you find a < character.
If that character is followed by /, it's the end tag. Check that it is followed by the tag name and >, and if yes, return the result. Otherwise, throw an error.
If you get here, you've found the beginning of a nested XML. Parse that with this algorithm, and then continue at 6.

Reading XML looks simple but doing it correctly involves a few complexities you don't really want to deal with. Indeed, writing a simple XML parser effectively amounts to creating yet another XML library. I have done it and an incomplete version of this is sitting somewhere on my disk. Even if you don't need to validate your XML structure:
whether you validate or not, you need to deal with entity references like < and the variety of character entity references like A and
the plain body of an XML document is relatively simple but the header a major pain to deal with in particular the DTD: there are two versions thereof which are slightly different and you probably need to process the inline DTD
even the body isn't entirely trivial because of these annoying character data segments
even without validation you may need to support external entity references
the characters to be accepted and/or rejected for various parts of XML are also somewhat interesting
note that XML is defined in terms of Unicode and proper handling of this isn't entirely trivial either: just using char or wchar_t just doesn't cut it.
The first version I implemented was a nice little iterator intended to pop out all the elements encountered. This allowed for the nice feature of easily stopping and continuing the parsing at the choice of the iterator user. Unfortunately, I didn't get it to fly when trying to copy with the various entity references. It would parse simple XML files nice and fast but some quirks in the specification I just didn't get right.
What worked best for me was creating a simple recursive decent parser combined with a suitable stack of buffers to somewhat transparently deal with entity references. However, to finish this completely I still need to deal with some encoding issues and in the end I just had higher priority projects to work on (in my spare time, that is).
In summary: it can be done, obviously, as others did. It is probably a somewhat pointless exercise unless you have a really bright idea which makes your implementation uniquely better suited than the alternatives.

The best and only approach is to re-implement such a library from scratch without using any other libraries...
You're welcome to use existing libraries like pugixml, for example. It's installation is as simple as adding the files to your project and start using it. It's lightweight compared to other validating parsers, such as Xerces.

Unicode test strings for unit tests

I need some Utf32 test strings to exercise some cross platform string manipulation code. I'd like a suite of test strings that exercise the utf32 <-> utf16 <-> utf8 encodings to validate that characters outside the BMP can be transformed from utf32, through utf16 surrogates, through utf8, and back. properly.
And I always find it a bit more elegant if the strings in question aren't just composed of random bytes, but are actually meaningful in the (various) languages they encode.

Although this isn't quite what you asked for, I've always found this test document useful.
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
The same site offers this
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt
... which are equivalents of English's "Quick brown fox" text, which exercise all the characters used, for a variety of languages. This page refers to a larger list of "pangrams" which used to be on Wikipedia, but was apparently deleted there. It is still available here:
http://clagnut.com/blog/2380/

https://github.com/noct/cutf/tree/master/bin
Includes following files:
UTF-8-demo.txt
big.txt
quickbrown.txt
utf8_invalid.txt

To really test all possible conversions between formats, opposed to character conversions (i.e. towupper(), towlower()) you should test all characters. The following loop gives you all of those:
for(wint_t c(0); c < 0x110000; ++c)
{
if(c >= 0xD800 && c <= 0xDFFF)
{
continue;
}
// here 'c' is any one Unicode character in UTF-32
...
}
That way you can make sure you don't miss anything (i.e. 100% complete test.) This is only 1,112,065 characters, so it will be very fast with a modern computer.
Note that for basic conversions between encodings my loop above is more than enough. However, there are other feature in Unicode which would require testing character pairs which behave differently when used together. This is really not necessary here.
Also I now have a separate C++ libutf8 library to convert characters between UTF-32, UTF-16, and UTF-8. The tests use loops as shown above. The tests also verify that using invalid character codes gets caught properly.

Hmmm
You could find a lot of incidental data by googling (and see the right column for questions like these on SO...)
However, I recommend you pretty much build your test strings as byte array. It is not really about 'what data', just that unicode gets handled correctly.
E.g. you will want to make sure that identical strings in different normalized forms (i.e. even if not in canonical form) still compare equal.
You will want to check that the string length detection is robust (and recognizes single, double, triple and quadruple byte characters). You will want to check that traversing a string from begin to end honours the same logic. More targeted tests for random access of unicode characters.
These are all things you knew, I'm sure. I'm just spelling them out to remind you that you need test data catered to exactly the edge cases, the logical properties that are intrinsic to Unicode.
Only then will you have proper test data.
Beyond this scope (technical correct Unicode handling) is actual localization (collation, charset conversion etc.). I refer to the Turkey Test
Here are helpful links:
http://minaret.info/test/collate.msp
http://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html

You can try this one (there are some sentences in russian, greek, chinese, etc. to test Unicode):
http://www.madore.org/~david/misc/unitest/

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.
Any ideas or references are appreciated.

Check out the Aho–Corasick string matching algorithm!

Take a look at Suffix Tree.

Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.
The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.
Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.
Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)
Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.
When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().
Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.
So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.

I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.
What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.
I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).
Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.

Regex engine initialization is expected to have some overhead,
so if there are no real regular expressions involved,
C - memcmp() should do fine.
If you can tell the file sizes and give some
specific use cases, we could build a benchmark
(I consider this very interesting).
Interesting: memcmp explorations and timing differences
Regards
rbo

There is always Boyer Moore

Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.

You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools.
You can use Lex for getting tokens of the string.
Compare your pre-defined strings with the tokens returned from Lexer.
When match is found, perform the desired action.
There are many sites which describe about Lex and Yacc.
One such site is http://epaperpress.com/lexandyacc/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js