Get all available characters? - unit-testing

Sometimes you might have string\text that is encoded, decrypted, Base64 converted and so on. Most of the time it works but from time to time you might have missed that some transformations might lose a bit of data. This can be hard to find because the truncation happens only with some characters. To make sure that everything works as intended a unitTest that "packs" and "unpacks" the string\text is a good validation but to be really sure you want a teststring that contain all possible characters no matter the culture.
Is there any way in c# to easily get all available characters in to a string from all cultures?
Regards

Related

Can this regex be made memory efficient

I get an xml as plain unformatted text blob. I have to make some replacements and I use regex find and replace.
For example:
<MeasureValue><Text value="StartCalibration" /></MeasureValue>
has to be converted to
<MeasureValue type="Text" value="StartCalibration"/>
The regex I wrote was
<MeasureValue><((\w*)\s+value="(.*?)".*?)></MeasureValue>
And the replacement part was:
<MeasureValue type="$2" value="$3"/>
Here a link showing the same.
The issue is that in a file having 370 such occurrences, I get out of memory error. I have heard of the so called greedy regex patterns and wondering if this can be the case plaguing me. If this is already memory efficient then I will leave it as it is and try to increase the server memory. I have to process thousands of such documents.
EDIT: This is part of script for Logstash from Elasticsearch. As per documentation, Elasticsearch uses Apache Lucene internally to parse regular expressions. Not sure if that helps.
As a rule of thumb, specificity is positively correlated with efficiency in regex.
So, know your data and build something to surgically match it.
The more specific you build your regex, like literally writing down the pattern (and usually ending up with a freak regex), the fewer resources it will take due to the fewer "possibilities" it can match in your data.
To be more precise, imagine we are trying to match a string
2014-08-26 app[web.1]: 50.0.134.125
Approaches such as
(.*) (.*) (.*)
leaves it too open and prone to match with MANY different patterns and combinations, and thus takes a LOT more to process its infinite possibilities. check here https://regex101.com/r/GvmPOC/1
On the other han you could spend a little more time building a more elaborated expression such as:
^[0-9]{4}\-[0-9]{2}-[0-9]{2} app\[[a-zA-Z0-9.]+\]\: [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}$`
and I agree, it is horrible but much more precise. It won't waste your precious resources finding unnecessary stuff. check here https://regex101.com/r/quz7fo/1
Another thing to have in mind is: operators such as * or + do a scan operation, which depending on the size of your string, might take some time. Also, whenever is possible, specifying the anchors ^$ also help the script not to try to find too many matches within the same string.
Bringing it to your reality...
If we have to use regex.
The million-dollar question is, how can we turn your regex into something more precise?
Since there is no limit to tag name lengths in XML... there is no way to make it utterly specific :(
We could try to specify what characters to match and avoid . and \w. So substitute it to something more like a-zA-Z is preferrable. Also making use of negative classes [^] would help to narrow down the range of possibilities.
Avoiding * and ? and try to put a quantifier {} (although I don't know your data to make this decision). And as I stated above, there is no limit in XML for this.
I didn't understand precisely the function of the ? in your code, so removing it is something less to process.
Ended up with something like
<(([a-zA-Z]+) value="([^"]*)"[^<>]*)>
Not many changes though. You can try to measure it to see if there was any improvement.
But perhaps the best approach is not to use regex at all :(
I don't know the language you are working with, but if it is getting complicated with the processing time, I would suggest you to not use regex and try some alternative.
If there is a slight possibility to use a xml parser it would be preferable.
https://softwareengineering.stackexchange.com/questions/113237/when-you-should-not-use-regular-expressions
Sorry if it wasn't as conclusive as you might have expected, but the field for working on it is likewise really open.

C++ Parsing Library with UTF-8 support

Let's say I want to make a parser for a programming language (EBNF already known), and want it done with as little of a fuss as possible. Also, I want to support identifiers of any UTF-8 letters. And I want it in C++.
flex/bison have a non-existent UTF-8 support, as I read it. ANTLR seems not to have a working C++ output.
I've considered boost::spirit, they state on their site it's actually not meant for a full parser.
What else is left? Rolling it entirely per hand?
If you don't find something which has the support you want, don't forget that flex is mostly independant on the encoding. It lexes an octet stream and I've used it to lex pure binary data. Something encoded in UTF-8 is an octet stream and can be handled by flex is you accept to do manually some of the work. I.E. instead of having
idletter [a-zA-Z]
if you want to accept as letter everything in the range Latin1 supplement excepted the NBSP (in other words, in the range U00A1-U00FF) you have to do something like (I may have messed up the encoding, but you get the idea)
idletter [a-zA-Z]|\xC2[\xA1-\xFF]|\xC3[\x80-\xBF]
You could even write a preprocessor which does most of the work for you (i.e. replaces \u00A1 by \xC2\xA1 and replace [\u00A1-\u00FF] by \xC2[\xA1-\xFF]|\xC3[\x80-\xBF], how much work is the preprocessor depend on how generic you want your input to be, there will be a time when you'd probably better integrate the work in flex and contribute it upstream)
Parser works with tokens, it's not its duty to know the encoding. It will usually just compare the ids of the tokens, and in case you code your special rules you may compare the underlining UTF-8 strings the way you do it anywhere else.
So you need a UTF-8 lexer? Well, it highly depends on how you define your problem. If you define your identifiers to consist of ASCII alphanumerics and anything else non-ASCII, then flex will suit your needs just fine. If you want to actually feed Unicode ranges to the lexer, you'll need something more complicated. You can look at Quex. I'd never used it myself, but it claims to support Unicode. (Although I would kill somebody for "free tell/seek based on character indices")
EDIT: Here is a similar question, it claims that flex won't work because of bug that ignores that some implementations may have a signed char... It may be outdated though.

Efficient memory storage and retrieval of categorized string literals in C++

Note: This is a follow up to this question.
I have a "legacy" program which does hundreds of string matches against big chunks of HTML. For example if the HTML matches 1 of 20+ strings, do something. If it matches 1 of 4 other strings, do something else. There are 50-100 groups of these strings to match against these chunks of HTML (usually whole pages).
I'm taking a whack at refactoring this mess of code and trying to come up with a good approach to do all these matches.
The performance requirements of this code are rather strict. It needs to not wait on I/O when doing these matches so they need to be in memory. Also there can be 100+ copies of this process running at the same time so large I/O on startup could cause slow I/O for other copies.
With these requirements in mind it would be most efficient if only one copy of these strings are stored in RAM (see my previous question linked above).
This program currently runs on Windows with Microsoft compiler but I'd like to keep the solution as cross-platform as possible so I don't think I want to use PE resource files or something.
Mmapping an external file might work but then I have the issue of keeping program version and data version in sync, one does not normally change without the other. Also this requires some file "format" which adds a layer of complexity I'd rather not have.
So after all of this pre-amble it seems like the best solution is to have a bunch arrays of strings which I can then iterate over. This seems kind of messy as I'm mixing code and data heavily, but with the above requirements is there any better way to handle this sort of situation?
I'm not sure just how slow the current implementation is. So it's hard to recommend optimizations without knowing what level of optimization is needed.
Given that, however, I might suggest a two-stage approach. Take your string list and compile it into a radix tree, and then save this tree to some custom format (XML might be good enough for your purposes).
Then your process startup should consist of reading in the radix tree, and matching. If you want/need to optimize the memory storage of the tree, that can be done as a separate project, but it sounds to me like improving the matching algorithm would be a more efficient use of time. In some ways this is a 'roll your own regex system' idea. Rather similar to the suggestion to use a parser generator.
Edit: I've used something similar to this where, as a precompile step, a custom script generates a somewhat optimized structure and saves it to a large char* array. (obviously it can't be too big, but it's another option)
The idea is to keep the list there (making maintenance reasonably easy), but having the pre-compilation step speed up the access during runtime.
If the strings that need to be matched can be locked down at compile time you should consider using a tokenizer generator like lex to scan your input for matches. If you aren't familiar with it lex takes a source file which has some regular expressions (including the simplest regular expressions -- string literals) and C action code to be executed when a match is found. It is used often in building compilers and similar programs, and there are several other similar programs that you could also use (flex and antlr come to mind). lex builds state machine tables and then generates efficient C code for matching input against the regular expressions those state tables represent (input is standard input by default, but you can change this). Using this method would probably not result in the duplication of strings (or other data) in memory among the different instances of your program that you fear. You could probably easily generate the regular expressions from the string literals in your existing code, but it may take a good bit of work to rework your program to use the code that lex generated.
If the strings you have to match change over time there are some regular expressions libraries that can compile regular expressions at run time, but these do use lots of RAM and depending on your program's architecture these might be duplicated across different instances of the program.
The great thing about using a regular expression approach rather than lots of strcmp calls is that if you had the patterns:
"string1"
"string2"
"string3"
and the input:
"string2"
The partial match for "string" would be done just once for a DFA (Deterministic Finite-state Automaton) regular expression system (like lex) which would probably speed up your system. Building these things does require a lot of work on lex 's behalf, but all of the hard work is done up front.
Are these literal strings stored in a file? If so, as you suggested, your best option might be to use memory mapped files to share copies of the file across the hundreds of instances of the program. Also, you may want to try and adjust the working set size to try and see if you can reduce the number of page faults, but given that you have so many instances, it might prove to be counterproductive (and besides your program needs to have quota privileges to adjust the working set size).
There are other tricks you can try to optimize IO performance like allocating large pages, but it depends on your file size and the privileges granted to your program.
The bottomline is that you need to experiment to see what works best and remember to measure after each change :)...

What is the best way to find wide string headaches such as L"%s"?

Here is an example of one of the headaches I mean:
We have a multiplatform project that uses mostly Unicode strings for rendering text to the screen. On windows in VC++ the line:
swprintf(swWideDest, LEN, L"%s is a wide string", swSomeWideString);
compiles fine and prints the wide string into the other wide string.
However, this should really be:
swprintf(swWideDest, LEN, L"%ls is a wide string", swSomeWideString);
Without replacing the '%s' with a '%ls' this will not work on other platforms. As testing in our environment on Windows is easier, quicker, and far simpler to debug. These kind of bugs can easily go unnoticed.
I know that the best solution is to write correct code in the first place, but under pressure simple mistakes are made, and in this particular case, the mistake can easily go unnoticed for a long time.
I suspect there are many variations on this sort of bug, that we are yet to enjoy.
Does anyone have a nice and neat way of finding these kind of bugs?
: D
You might want to have a look at FastFormat in case Boost.Format is too slow for your needs.
Compared to stringstreams and Boost.Format:
IOStreams: FastFormat.Format is
faster than IOStreams, by between
~100-900%, in all cases
Boost.Format: FastFormat.Format is
faster than Boost.Format, by between
~400-1650%, in all cases
As none of the functions of *printf family are typesafe you either
search for probable errors via regular expressions and fix them manually
use another approach that is typesafe, maybe based on stringstreams or boost.format

Parsing a string in C++

I have a huge set of log lines and I need to parse each line (so efficiency
is very important).
Each log line is of the form
cust_name time_start time_end (IP or URL )*
So ip address, time, time and a possibly empty list of ip addresses or urls separated by semicolons. If there is only ip or url in the last list there is no separator. If there
is more than 1, then they are separated by semicolons.
I need a way to parse this line and read it into a data structure. time_start or
time_end could be either system time or GMT. cust_name could also have multiple strings
separated by spaces.
I can do this by reading character by character and essentially writing my own parser.
Is there a better way to do this ?
Maybe Boost RegExp lib will help you.
http://www.boost.org/doc/libs/1_38_0/libs/regex/doc/html/index.html
I've had success with Boost Tokenizer for this sort of thing. It helps you break an input stream into tokens with custom separators between the tokens.
Using regular expressions (boost::regex is a nice implementation for C++) you can easily separate different parts of your string - cust_name, time_start ... and find all that urls\ips
Second step is more detailed parsing of that groups if needed. Dates for example you can parse using boost::datetime library (writing custom parser if string format isn't standard).
Why do you want to do this in C++? It sounds like an obvious job for something like perl.
Consider using a Regular Expressions library...
Custom input demands custom parser. Or, pray that there is an ideal world and errors don't exist. Specially, if you want to have efficiency. Posting some code may be of help.
for such a simple grammar you can use split, take a look at http://www.boost.org/doc/libs/1_38_0/doc/html/string_algo/usage.html#id4002194
UPDATE changed answer drastically!
I have a huge set of log lines and I need to parse each line (so efficiency is very important).
Just be aware that C++ won't help much in terms of efficiency in this situation. Don't be fooled into thinking that just because you have a fast parsing code in C++ that your program will have high performance!
The efficiency you really need here is not the performance at the "machine code" level of the parsing code, but at the overall algorithm level.
Think about what you're trying to do.
You have a huge text file, and you want to convert each line to a data structure,
Storing huge data structure in memory is very inefficient, no matter what language you're using!
What you need to do is "fetch" one line at a time, convert it to a data structure, and deal with it, then, and only after you're done with the data structure, you go and fetch the next line and convert it to a data structure, deal with it, and repeat.
If you do that, you've already solved the major bottleneck.
For parsing the line of text, it seems the format of your data is quite simplistic, check out a similar question that I asked a while ago: C++ string parsing (python style)
In your case, I suppose you could use a string stream, and use the >> operator to read the next "thing" in the line.
see this answer for example code.
Alternatively, (I didn't want to delete this part!!)
If you could write this in python it will be much simpler. I don't know your situation (it seems you're stuck with C++), but still
Look at this presentation for doing these kinds of task efficiently using python generator expressions: http://www.dabeaz.com/generators/Generators.pdf
It's a worth while read.
At slide 31 he deals with what seems to be something very similar to what you're trying to do.
It'll at least give you some inspiration.
It also demonstrates quite strongly that performance is gained not by the particular string-parsing code, but the over all algorithm.
You could try to use a simple lex/yacc|flex/bison vocabulary to parse this kind of input.
The parser you need sounds really simple. Take a look at this. Any compiled language should be able to parse it at very high speed. Then it's an issue of what data structure you build & save.