Regular expression incremental parsing

Regular expression incremental parsing - regex

Are there languages or tools that support the parsing of regexes on a character-by-character basis?
I think this may be equivalent to "regexes on streams" which is something that seems to be one of the features of the upcoming Perl version 6.
Basically I want to do this because I'm building a tool that does translation of a terminal stream over a pseudo-terminal, and it occurred to me that the ultimate sort of flexibility that should be attainable is by allowing the specification of regex-replace expressions.
The use case is that I want to allow my mouse scroll events to be passed to a naive program such as the less pager, which means my tool (which spawns less over a PTY) will be doing something like issuing the code \x1b[?1000h which switches on mouse reporting, and then subsequently translating every mouse wheel escape code received thereafter such as \x1b[M!! (the last several chars encode the mouse position within the terminal and should be ignored but also stripped) into the \x1b[A Up-arrow code.
As you can see being able to specify a regex that works on the stdin terminal-reading stream to generate the translated stream to send to the slave pty would be ideal.
Do I need to wait for Perl 6 to be able to achieve this? There must be particular reasons for why regex engines generally require having the whole string available?
It's pretty obvious I don't need the full blown power of regex here. I can speculate for instance that it might be the case that supporting backtracking makes stream-parsing regex impossible.
So since I don't need backtracking maybe there is some sort of light-weight regex engine out there that provides a stream API. It just seems like taking advantage of some form of parsing system (if one exists that is suitable) would be smarter than building something arbitrary.

Looks like s2p is an example of something that I can use.
In particular, the potential of being able to set $| to not do line-buffering.
Actually I don't think this will work. It seems to be built around lines and uses the s operator to run regex.

Related

Specific Regex Failing on Neko and Native

So I'm working on some cleanup in haxeflixel, and I need to validate a csv map, so I'm using a regex to check if its ok (don't mention the ending commas, I know thats not valid csv but I want to allow it), and I think I have a decent regex for doing that, and it seems to work well on flash, but c++ crashes, and neko gives me this error: An error occured while running pcre_exec....
here is my regex, I'm sorry its long, but I have no idea where the problem is...
^(([ ]*-?[0-9]+[ ]*,?)+\r?\n?)+$
if anyone knows what might be going on I'd appreciate it,
Thanks,
Nico
ps. there are probably errors in my regex for checking csv, but I can figure those out, its kind of enjoyable, I'd rather just know what specifically could be causing this:)
edit: ah, I've just noticed this doesn't happen on all strings, once I narrow it down to what strings, I will post one... as for what I'm checking for, its basically just to make sure theres no weird xml header, or any non integer value in the map file, basically it should validate this:
1,1,1,1
1,1,1,1
1,1,1,1
or this:
1,1,1,1,
1,1,1,1,
1,1,1,1,
but not:
xml blahh blahh>
1,m,1,1
1,1,b,1
1,1,1,1
xml>
(and yes I know thats not valid xml;))
edit: it gets stranger:
so I'm trying to determine what strings crash it, and while this still wouldnt explain a normal map crashing, its definatly weird, and has the same result:
what happens is:
this will fail a .match() test, but not crash:
a
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
while this will crash the program:
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,*a*,1,1,1,1,1,1,1,1,1,1,1,1,1

To be honest, you wrote one of the worst regexps I ever seen. It actually looks like it was written specifically to be as slow as possible. I write it not to offend you, but to express how much you need to learn to write regexps(hint: writing your own regexp engine is a good exercise).
Going to your problem, I guess it just runs out of memory(it is extremely memory intensive). I am not sure why it happens only on pcre targets(both neko and cpp targets use pcre), but I guess it is about memory limits per regexp run in pcre or some heuristics in other targets to correct such miswritten regexps.
I'd suggest something along the lines of
~/^(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n)*(( *-?[0-9]+ *,)* *-?[0-9]+ *,?\r?\n?)$/
There, "~/" and last "/" are haxe regexp markers.
I wasnt extensively testing it, just a run on your samples, but it should do the job(probably with a bit of tweaking).
Also, just as a hint, I'd suggest you to split file into lines first before running any regexps, it will lower memmory usage(or you will need to hold only a part of your text in memory) and simplify your regexp.
I'd also note that since you will need to parse csv anyhow(for any properly formed input, which are prevailing in your data I guess), it might be much faster to do all the tests while actually parsing.
Edit: the answer to question "why it eats so much memory"
Well, it is not a short topic, and that's why I proposed to you to write your own regexp engine. There are some differences in implementations, but generally imagine regexp engine works like that:
parses your regular expression and builds a graph of all possible states(state is basically a symbol value and a number of links to other symbols which can follow it).
sets up a list of read pointer and state pointer pairs, current state list, consisting of regexp initial state and a pointer to matched string first letter
sets up read pointer to the first symbol of symbol string
sets up state poiter to initial state of regexp
takes up one pair from current state list and stores it as current state and current read pointer
reads symbol under current read pointer
matches it with symbols in states which current state have links to, and makes a list of states that matched.
if there is a final regexp state in this list, goes to 12
for each item in this list adds a pair of next read pointer(which is current+1) and item to the current state list
if the current state list is empty, returns false, as string didn't match the regexp
goes to 6
here it is, in a final state of matched regexp, returns true, string matches regexp.
Of course, there are some differences between regexp engines, and some of them eliminate some problems afaik. And of course they also have pseudosymbols, groupings, they need to store the positions regexp and groups matched, they have lookahead and lookbehind and also grouping references which makes it a bit(quite a humble measure) more complex and forces to use a bit more complex data structures, but the main idea is the same. So, here we are and your problem is clearly seen from algorithm. The less specific you are about what you want to match and the more there chances for engine to match the same substring as different paths in state graph, the more memory and processor time it will consume, exponentionally.
Try to model how regexp engine matches regexp (a+a+)+b on strings aaaaaab, ab, aa, aaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (Don't try the last one, it would take hours or days to compute on a modern PC.)
Also, it worth to note that some regexp engines do things in a bit different way so they can handle this situations properly, but there always are ways to make regexp extremely slow.
And another thing to note is that I may hav ebeen wrong about the exact memory problem. This case it may be processor too, and before that it may be engine limits on memory/processor kicking in, not exactly system starving of memory.

When and why did the output of qr() change?

The output of perl's qr has changed, apparently sometime between versions 5.10.1 and 5.14.2, and the change is not documented--at least not fully.
To demonstrate the change, execute the following one-liner on each version:
perl -e 'print qr(foo)is."\n"'
Output from perl 5.10.1-17squeeze6 (Debian squeeze):
(?-xism:foo)
Output from perl 5.14.2-21+deb7u1 (Debian wheezy):
(?^:foo)
The perl documentation (perldoc perlop) says:
$rex = qr/my.STRING/is;
print $rex; # prints (?si-xm:my.STRING)
s/$rex/foo/;
which appears to no longer be true:
$ perl -e 'print qr/my.STRING/is."\n"'
(?^si:my.STRING)
I would like to know when this change occurred (which version of Perl, or supporting library or whatever).
Some background, in case it's relevant:
This change has caused a bunch of unit tests to fail. I need to decide if I should simply update the unit tests to reflect the new format, or make the tests dynamic enough to support both formats, etc. To make an informed decision, I would like to understand why the change took place. Knowing when and where it took place seems like the best place to start in that investigation.

It's documented in perl5140delta:
Regular Expressions
(?^...) construct signifies default modifiers
[...] Stringification of regular expressions now uses this notation. [...]
This change is likely to break code that compares stringified regular expressions with fixed strings containing ?-xism.
The function regexp_pattern can be used to parse the modifiers for normalisation purposes.

Part of the reason this was added, was that regular expressions were getting quite a few new modifiers.
Your example would actually produce something like this if that change didn't happen:
(?d-xismpaul:foo)
That also doesn't really express the modifiers in place.
d/u/l can only be added to a regex, not subtracted like i.
They are also mutually exclusive.
a/aa There are actually two levels for this modifier.
While work went underway adding these modifiers it was determined that this will break quite a few tests on CPAN modules.
Seeing as the tests were going to break anyway, it was agreed upon that there should be a way of specifying just use the defaults ((?^:…)).
That way, the tests wouldn't have to updated every time a new modifier was added.

To receive the stringified form of a regexp you can use Regexp::Parser and its qr method. Using this module you can not only test the representation of a regexp, but also walk a tree.

Creating a simple parser in (V)C++ (2010) similar to PEG

For an school project, I need to parse a text/source file containing a simplified "fake" programming language to build an AST. I've looked at boost::spirit, however since this is a group project and most seems reluctant to learn extra libraries, plus the lecturer/TA recommended leaning to create a simple one on C++. I thought of going that route. Is there some examples out there or ideas on how to start? I have a few attempts but not really successful yet ...
parsing line by line
Test each line with a bunch of regex (1 for procedure/function declaration), one for assignment, one for while etc...
But I will need to assume there are no multiple statements in one line: eg. a=b;x=1;
When I reach a container statement, procedures, whiles etc, I will increase the indent. So all nested statements will go under this
When I reach a } I will decrement indent
Any better ideas or suggestions? Example code I need to parse (very simplified here ...)
procedure Hello {
a = 1;
while a {
b = a + 1 + z;
}
}
Another idea was to read whole file into a string, and go top down. Match all procedures, then capture everything in { ... } then start matching statements (end with ;) or containers while { ... }. This is similar to how PEG does things? But I will need to read entire file

Multipass makes things easier. On a first pass, split things into tokens, like "=", or "abababa", or a quote-delimited string, or a block of whitespace. Don't be destructive (keep the original data), but break things down to simple chunks, and maybe have a little struct or enum that describes what the token is (ie, whitespace, a string literal, an identifier type thing, etc).
So your sample code gets turned into:
identifier(procedure) whitespace( ) identifier(Hello) whitespace( ) operation({) whitespace(\n\t) identifier(a) whitespace( ) operation(=) whitespace( ) number(1) operation(;) whitespace(\n\t) etc.
In those tokens, you might also want to store line number and offset on the line (this will help with error message generation later).
A quick test would be to turn this back into the original text. Another quick test might be to dump out pretty-printed version in html or something (where you color whitespace to have a pink background, identifiers as light blue, operations as light green, numbers as light orange), and see if your tokenizer is making sense.
Now, your language may be whitespace insensitive. So discard the whitespace if that is the case! (C++ isn't, because you need newlines to learn when // comments end)
(Note: a professional language parser will be as close to one-pass as possible, because it is faster. But you are a student, and your goal should be to get it to work.)
So now you have a stream of such tokens. There are a bunch of approaches at this point. You could pull out some serious parsing chops and build a CFG to parse them. (Do you know what a CFG is? LR(1)? LL(1)?)
An easier method might be to do it a bit more ad-hoc. Look for operator({) and find the matching operator(}) by counting up and down. Look for language keywords (like procedure), which then expects a name (the next token), then a block (a {). An ad-hoc parser for a really simple language may work fine.
I've done exactly this for a ridiculously simple language, where the parser consisted of a really simple PDA. It might work for you guys. Or it might not.

Since you mentioned PEG i'll like to throw in my open source project : https://github.com/leblancmeneses/NPEG/tree/master/Languages/npeg_c++
Here is a visual tool that can export C++ version: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-language-workbench
Documentation for rule grammar: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-dsl-documentation
If i was writing my own language I would probably look at the terminals/non-terminals found in System.Linq.Expressions as these would be a great start for your grammar rules.
http://msdn.microsoft.com/en-us/library/system.linq.expressions.aspx
System.Linq.Expressions.Expression
System.Linq.Expressions.BinaryExpression
System.Linq.Expressions.BlockExpression
System.Linq.Expressions.ConditionalExpression
System.Linq.Expressions.ConstantExpression
System.Linq.Expressions.DebugInfoExpression
System.Linq.Expressions.DefaultExpression
System.Linq.Expressions.DynamicExpression
System.Linq.Expressions.GotoExpression
System.Linq.Expressions.IndexExpression
System.Linq.Expressions.InvocationExpression
System.Linq.Expressions.LabelExpression
System.Linq.Expressions.LambdaExpression
System.Linq.Expressions.ListInitExpression
System.Linq.Expressions.LoopExpression
System.Linq.Expressions.MemberExpression
System.Linq.Expressions.MemberInitExpression
System.Linq.Expressions.MethodCallExpression
System.Linq.Expressions.NewArrayExpression
System.Linq.Expressions.NewExpression
System.Linq.Expressions.ParameterExpression
System.Linq.Expressions.RuntimeVariablesExpression
System.Linq.Expressions.SwitchExpression
System.Linq.Expressions.TryExpression
System.Linq.Expressions.TypeBinaryExpression
System.Linq.Expressions.UnaryExpression

Global substitution for latex commands in vim

I am writing a long document and I am frequently formatting some terms to italics. After some time I realized that maybe that is now what I want so I would like to remove all the latex commands that format text to italics.
Example:
\textit{Vim} is undoubtedly one of the best editors ever made. \textit{LaTeX} is an extremely powerful, intelligent typesetter. \textbd{Vim-LaTeX} aims at bringing together the best of both these worlds
How can I run a substitution command that recognizes all the instances of \textit{whatever} and changes them to just whatever without affecting different commands such as \textbd{Vim-LaTeX} in this example?
EDIT: As technically the answer that helps is the one from Igor I will mark that one as the correct one. Nevertheless, Konrad's answer should be taken into account as it shows the proper Latex strategy to follow.

You shouldn’t use formatting commands at all in your text.
LaTeX is built around the idea of semantic markup. So instead of saying “this text should be italic” you should mark up the text using its function. For instance:
\product{Vim} is undoubtedly one of the best editors ever made. \product{LaTeX}
is an extremely powerful, intelligent typesetter. \product{Vim-LaTeX} aims at
bringing together the best of both these worlds
… and then, in your preamble, a package, or a document class, you (re-)define a macro \product to set the formatting you want. That way, you can adapt the macro whenever you deem necessary without having to change the code.
Or, if you want to remove the formatting completely, just make the macro display its bare argument:
\newcommand*\product[1]{#1}

Use this substitution command:
% s/\\textit{\([^}]*\)}/\1/
If textit can span muptiple lines:
%! perl -e 'local $/; $_=<>; s/\\textit{([^}]*)}/$1/g; print;'
And you can do this without perl also:
%s/\\textit{\(\_.\{-}\)}/\1/g
Here:
\_. -- any symbol including a newline character
\{-} -- make * non-greedy.

internal code-completion in vim

There's a completion type that isn't listed in the vim help files (notably: insert.txt), but which I instinctively feel the need for rather often. Let's say I have the words "Awesome" and "SuperCrazyAwesome" in my file. I find an instance of Awesome that should really be SuperCrazyAwesome, so I hop to the beginning of the word, enter insert mode, and then must type "SuperCrazy".
I feel I should be able to type "S", creating "SCrazy", and then simply hit a completion hotkey or two to have it find what's to the left of the cursor ("S"), what's to the right ("Crazy"), regex this against all words in the file ("/S\w*Crazy/"), and provide me with a completion popup menu of choices, or just do the replace if there's only one match.
I'd like to use the actual completion system for this. There exists a "user defined" completion which uses a function, and has a good example in the helps for replacing from a given list. However, I can't seem to track down many particulars that I'd need to make this happen, including:
How do I get a list of all words in the file from a vim function?
Can I list words from all buffers (with filenames), as vim's complete does?
How do I, in insert mode, get the text in the word before/after the cursor?
Can completion replace the entire word, and not just up to the cursor?
I've been at this for a couple of hours now. I keep hitting dead ends, like this one, which introduced me to \%# for matching with the cursor position, which doesn't seem to work for me. For instance, a search for \w*\%# returns only the first character of the word I'm on, regardless of where I'm in it. The \%# doesn't seem to anchor.

Although its not exactly following your desired method in the past I've written https://github.com/mjbrownie/swapit which might perform your task if you are looking for related keywords. It would fall down in this scenario if you have hundreds of matches.
It's mainly useful for 2-10 possible sequenced matches.
You would define a list
:SwapList awesomes Awesome MoreAwesome SuperCrazyAwesome FullyCompletelyAwesome UnbelievablyAwesome
and move through the matches with the incrementor decrementor keys (c+a) (c+x)
There are also a few other cycling type plugins like swap words that I know of on vim.org and github.
The advantage here is you don't have to group words together with regex.

I wrote something like that years ago when working with 3rd party libraries with rather long CamelCasePrefixes in every function different for each component. But it was in Before Git Hub era and I considered it a lost jewel, but search engine says I am not a complete ass and posted it to Vim wiki.
Here it is: http://vim.wikia.com/wiki/Custom_keyword_completion
Just do not ask me what 'MKw' means. No idea.
This will need some adaptation to your needs, as it is looking up only the word up to the cursor, but the idea is there. It works for current buffer only. Iterating through all buffers would be sluggish as it is not creating any index. For those purposes I would go with external grep.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js