Regular Expression performance issue to pasrse big contents - regex

Suppose i have to filter some text from a file. Then i have 2 solutions
Either I take all the contents of file into
a single variable(like fileinputstream or something else) which can be
parsed using regular expression.
Or i use looping to read file line
by line. Then i apply either regular
expression or some string function on each line.
Which method will be better and faster?

Most regular expression libraries (such as PCRE) are very efficient and highly optimized, so I say go with the first option.
But of course if performance is very important to you, you should use a profiler anyway; it could give you a better answer for your exact situation.

Related

Construct String from Regexp

I've trying to find out a way to construct string data from a compiled *regexp.Regexp with given interface{} array. For example:
re := regexp.MustCompile(`(?P<word>\w+)\s*(?P<num>\d+)`)
I want to construct a string from the structure found in re by a string and a int data which may be received as interface{}.
Can't figure out how I can do that in Go. Please help me out.
Thanks in advance.
Such a library, often called Xeger, exists for many languages, including go. However, this one is called regen: It's a tool to generate random strings from Go/RE2 regular expressions.
Here's an example:
$ regen -n 3 '[a-z]{6,12}(\+[a-z]{6,12})?#[a-z]{6,16}(\.[a-z]{2,3}){1,2}'
iprbph+gqastu#regegzqa.msp
abxfcomj#uyzxrgj.kld.pp
vzqdrmiz#ewdhsdzshvvxjk.pi
Essentially, all regen does is parse the regular expressions it's
given and iterate over the tree produced by regexp/syntax and attempt
to generate strings based on the ops described by its results. This
could probably be optimized further by compiling the resulting Regexp
into a Prog, but I didn't feel like this was worthwhile when it's a
very small tool.
Some additional information can be
found at https://godoc.org/go.spiff.io/regen.
While technically not impossible this is a huge amount of work.
Go regexp package compiles regular expressions into bytecode programs that are then executed to search strings. De-compiling this bytecode back into a regular expression is not impossible but quite difficult and moreover you cannot be sure that the regexp expression you get is the exact same that was used originally... for example
/(?:a+)+/
and
/a+/
will be compiled to the same code for matching (expressions are "simplified" before being passed to the code generator).
Most probably you want to find another solution, like saving the original strings before compiling them into regular expression objects.

How to store regex patterns, as regex objects or strings?

How to store regex patterns, as regex objects or strings?
I have a class X, and I need to store a pattern that will later
be used for matching regular expressions. At this point I simply
have a member called 'patternRegex' as std::string. Would it not
be better if I store an object of type regex? Then the naming
would be just 'pattern' because from the type it will be clear
that it is regex. Are there any tradeoffs I should watch out for?
"Compilation" from string to a regular expression finite state machine is time costly. If you plan to use the regular expressions frequently, eg. in loops, your code will be faster if you keep the regex objects instead of their string representations.
Regular expression strings get compiled before use. If you intend to use one regular expression more than once you may like to compile it first by instantiating a regex object.
It's better to store them as objects, because constructing a regex from a string invokes parsing the string and building (implementation-defined) parsing structures. So, better create a member field of type std::regex
The other answer already mentioned that you should store a std::regex because it is faster when used multiple times. I think it's worth to point that there is another advantage which holds even if used only once: It catches errors early.
In my code the string often comes from some configuration file and I'd like to know as soon as possible if it is a valid regular expression or not. When you store just the string, it'll only fail when first used which might be much harder to test.

Regular Expression Vs. String Parsing

At the risk of open a can of worms and getting negative votes I find myself needing to ask,
When should I use Regular Expressions and when is it more appropriate to use String Parsing?
And I'm going to need examples and reasoning as to your stance. I'd like you to address things like readability, maintainability, scaling, and probably most of all performance in your answer.
I found another question Here that only had 1 answer that even bothered giving an example. I need more to understand this.
I'm currently playing around in C++ but Regular Expressions are in almost every Higher Level language and I'd like to know how different languages use/ handle regular expressions also but that's more an after thought.
Thanks for the help in understanding it!
Edit: I'm still looking for more examples and talk on this but the response so far has been great. :)
It depends on how complex the language you're dealing with is.
Splitting
This is great when it works, but only works when there are no escaping conventions.
It does not work for CSV for example because commas inside quoted strings are not proper split points.
foo,bar,baz
can be split, but
foo,"bar,baz"
cannot.
Regular
Regular expressions are great for simple languages that have a "regular grammar". Perl 5 regular expressions are a little more powerful due to back-references but the general rule of thumb is this:
If you need to match brackets ((...), [...]) or other nesting like HTML tags, then regular expressions by themselves are not sufficient.
You can use regular expressions to break a string into a known number of chunks -- for example, pulling out the month/day/year from a date. They are the wrong job for parsing complicated arithmetic expressions though.
Obviously, if you write a regular expression, walk away for a cup of coffee, come back, and can't easily understand what you just wrote, then you should look for a clearer way to express what you're doing. Email addresses are probably at the limit of what one can correctly & readably handle using regular expressions.
Context free
Parser generators and hand-coded pushdown/PEG parsers are great for dealing with more complicated input where you need to handle nesting so you can build a tree or deal with operator precedence or associativity.
Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.
The rule of thumb for CF grammars is
If regular expressions are insufficient but all words in the language have the same meaning regardless of prior declarations then CF works.
Non context free
If words in your language change meaning depending on context, then you need a more complicated solution. These are almost always hand-coded solutions.
For example, in C,
#ifdef X
typedef int foo
#endif
foo * bar
If foo is a type, then foo * bar is the declaration of a foo pointer named bar. Otherwise it is a multiplication of a variable named foo by a variable named bar.
It should be Regular Expression AND String Parsing..
You can use both of them to your advantage!Many a times programmers try to make a SINGLE regular expression for parsing a text and then find it very difficult to maintain..You should use both as and when required.
The REGEX engine is FAST.A simple match takes less than a microsecond.But its not recommended for parsing HTML.

hierarchical regex expression

Is it possible/practical to build a single regular expression that matches hierarchical data?
For example:
<h1>Action</h1>
<h2>Title1</h2><div>data1</div>
<h2>Title2</h2><div>data2</div>
<h1>Adventure</h1>
<h2>Title3</h2><div>data3</div>
I would like to end up with matches.
"Action", "Title1", "data1"
"Action", "Title2", "data2"
"Adventure", "Title3", "data3"
As I see it this would require knowing that there is a hierarchical structure at play here and if I code the pattern to capture the H1, it only matches the first entry of that hierarchy. If I don't code for H1 then I can't capture it. Was wondering if there are any special tricks I an employ to solve this.
This is a .NET project.
The solution is to not use regular expressions. They're not powerful enough for this sort of thing.
What you want is a parser - since it looks like you're trying to match HTML, there are plenty to choose from.
It's generally considered bad practice to attempt to parse HTML/XML with RegEx, precisely because it's hierarchical. You COULD use a recursive function to do so, but a better solution in this case is to use a real XML parser. I couldn't give you better advice than that without knowing the platform you're using.
EDIT: Regex is also very slow, which is another reason it's bad for processing HTML; however, I don't know that an XML/DOM processor is likely to be faster since it's likely to use a lot more memory.
If you JUST want data from a simple document like you've demonstrated, and/or if you want to build a solution yourself, it's not that tough to do. Just build a simple, recursive state-based stream processor that looks for tags and passes the contents to the the next recursive level.
For example:
- In a recursive function, seek out a "<" character.
- Now find a ">" character.
- Preserve everything you find until the next "<" character.
- Find a ">" character.
- Pass whatever you found between those tags into the recursive function.
You'd have to work out error checking yourself, but the base case (when you return back up to the previous level) is just when there's nothing else to find.
Maybe this helps, maybe not. Good luck to you.
Regex does not work for this type of data. It is not regular, per se.
You should use an XML parser for this.

Most efficient method to parse small, specific arguments

I have a command line application that needs to support arguments of the following brand:
all: return everything
search: return the first match to search
all*search: return everything matching search
X*search: return the first X matches to search
search#Y: return the Yth match to search
Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.
A few examples might be:
2*foo
bar#8
all*'foo bar'
This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.
What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?
It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.
Thanks!
I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:
((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?
If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).
If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.
If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.
Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.
The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.
OR
If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.
In this case the strtok approach would be much better since the number of commands to be parsed are few.