Regular Expression - how to find text within particular if blocks? - regex

I'm new to regular expressions and would like to use one to search through our source control to find text within a block of code that follows a particular enum value. I.e.:
/(\/{2}\#debug)(.|\s)*?(\/{2}\#end-debug).*/
var junk = dontWantThis if (junk) {dont want this} if ( **myEnumValue** ) **{ var yes = iWantToFindThis if (true) { var yes2 = iWantThisToo } }**
var junk2 = dontWantThis if (junk) {dont want this}
var stuff = dontWantThis if (junk) {dont want this} if ( enumValue ) { wantToFindThis }
var stuff = iDontWantThis if (junk) {iDontWantThisEither}
I know I can use (\{(/?[^\>]+)\}) to find if blocks, but I only want the first encompassing block of code that follows the enum value I'm looking for. I've also notice that using (\{(/?[^\>]+)\}) gives me the first { and last }, it doesn't group the subsequent {}.
Thank you!
Tim

Regexps simply can't handle this kind of stuff. For this you'll need a parser and scanner.

As others hint at, it's mathematically impossible to do with with regular expressions (at least in general; you might be able to get it to work if you have highly specialized cases). Try using a combination of lex and awk to get the desired results if you want to stick with standard Unix tools, or just go to Perl, Python, Ruby, etc. and build up the lexical parsing you need.

While nesting is a problem, you could use backtracking and lookahead to effectively count your matching braces or quotes. This is not strictly part of a regular expression but has been added to many regex libraries, such as the one in .NET, perl, and java; probably more. I wouldn't recommend that you go this route, as you should find it easier to lexically parse this. But if you do try this as a quick fix, absolutely collect a few test cases and run them through regexbuddy or expresso.

Related

HTML tokenizer algorithm

I'm trying to write a basic html parser which doesn't tolerate errors and was reading HTML5 parsing algorithm but it's just too much information for a simple parser. I was wondering if someone had an idea on the logic for a basic tokenizer which would simply turn a small html into a list of significant tokens. I'm more of interested in the logic than the code..
std::string html = "<div id='test'> Hello <span>World</span></div>";
Tokenizer t;
t.tokenize(html);
So for the above html, I want to convert it to a list of something like this:
["<","div","id", "=", "test", ">", "Hello", "<", "span", ">", "world", "</", "span", ">", "<", "div", ">"]
I don't have anything for the tokenize method but was wondering if iterating over the html character by character is the best way to build the list..
void Tokenizer::tokenize(std::string html){
std::list<std::string> tokens;
for(int i = 0; i < html.length();i++){
char c = html[i];
if(...){
...
}
}
}
I think what you are looking for is a lexical analyzer. Its goal is getting all the tokens that are defined in your language, in this case is HTML. As #IraBaxter said, you can use a Lexical tool, like Lex, that is founded in Linux or OSX; but you must define the rule and, for this, you need use Regular Expressions.
But, if you wan to know about an algorithm for this issue you can check the book of Keith D. Cooper & Linda Torczon, chapter 2, Scanners. This chapter talks about Automatas and who they can be used to create a Scanner where it use a Table-Driven Scanner to get tokens, like you want. Let me share you an image of this chapter:
The idea is that you define a DFA where you have:
A finite set of states in the recognizer, including start state, accepting states and error state.
An Alfabet.
A function which helps to determine if a transition is valid or not, using the table of transitions or, if you don't want use a table, coding the automata.
Take a time to study this chapter.
The other answers here are great, and you should definitely use a lexical-analyzer-generator like flex for the job. The input to such a generator is a list of rules that identify the different token types. An input file might look like this:
WHITE_SPACE \s*
IDENTIFIER [a-zA-Z0-9_]+
LEFT_ANGLE <
The algorithm that flex uses is essentially:
Find the rule that matches the most text.
If two rules match the same length of text, choose the one that occurs earlier in the list of rules provided.
You could write this algorithm quite easily yourself using regular expressions. However, do remember that this will not be as fast as flex, since flex compiles the regular expressions away into a very fast DFA.

Parsing Javascript with Python

In one of my script I use urllib2and BeautifulSoup to parse a HTML page and read a <script> tag.
This is what I get :
<script>
var x_data = {
logged: logged,
lengthcarrousel: 2,
products : [
{
"serial" : "106541823"
...
</script>
My goal is to read the JSON in the x_data variable and I do not know how to do it properly.
I though of :
Convert to string and remove the first chars to the { and same for last }
Use Regular Expression with something like '{.*}' and take the first group
Something else ?
I don't know if these are efficient and if there is some other ways to do it in a nice way.
Do you think a method is preferable to the other ? any method I may not be aware of ?
Thank you in advance for any advice.
EDIT :
Following advice I get the Regexp solution but I can't search in multiple lines despite using re.MULTILINE :
string1 = '<script>
var x_data = {
logged: logged,
lengthcarrousel: 2,
products : [
{
"serial" : "106541823"}
]
};
</script>'
p = re.compile(r'\{.*\};',re.MULTILINE);
m = p.search(string1)
if m:
print m.group(0)
else:
print "Error !"
I always got an "Error !".
EDIT2 :
Works well with re.DOTALL.
I think these methods are essentially the same in terms of elegance and performance (using {.*} may be slightly better because .* is greedy, i.e. there will be almost no backtracking, and because it seems to me more "forgiving" for different JS code formatting nuances). What you may be more interested in is this: https://docs.python.org/3.6/library/json.html.
If it always looks exactly like this, then you can hack a solution like the one you proposed, based on it looking exactly like this.
Because programmers do everything in code, I suspect in practice it will not alway look exactly this, and then any hacky solution will be fragile and will fail at unexpected (read "impossibly inconvenient") moments. (Regex is known to be hacky when it comes to parsing code).
If you want to do this right, you will need to get a real JavaScript parser, apply it to the code fragment defined by the script tag content, to produce an AST, then search the AST for JavaScript nested structures that happen to look like JSON, and take the content of that tree, prettyprinted.
Even this will be fragile in the face of a programmer who assembles the JSON fragment using JavaScript assignment statements. You can handle this by computing data flow, and discovering sets of code that happen to assemble JSON code. This is rather a lot of work.
So you get to decide what the limits on your solution will be, and then accept the consequences when somebody you don't control does something random.

Conditional RegExp Replace - if reference is found, then write something else

Two cases
1. Key<A, M> desc = newKey();
2. Property<B, N> type = newKey("type", B.bar);
The RegExp and replace
find: (?:Key|Property)<(.*), (.*)> (.*) = newKey\((.*)\);
rep.: Foo<C$1, $2> $3 = pl.nP("$3", $2.class); // ($4)
The Result
1. Foo<CA, M> desc = pl.nP("desc", M.class); //
2. Foo<CB, N> type = pl.nP("type", N.class); // ("type", B.bar)
The Problem:
Now I want to avoid the empty comment at the line 1.
Is there a way to write the $4 and the stuff around it only if $4
isn't empty?
You could remove empty comments afterwards with another regular expression.
EDIT
Another solution would be to deal with the special case separately (... = newKey\(\)).
Perhaps you could automate this process with a simple script, if the tedium of repetitive typing becomes too great(eg. when dealing with multiple conditionals).
As far as I know, there isn't any 'intelligence' built into the replace field in Sublime Text; all you can do is to assemble the captured pieces to your liking.
While skimming through a few Google search results yesterday, I found an article about conditional patterns in Perl, but nothing pertaining to the problem at hand.
For the sake of full disclosure, I should say that I am in no sense an expert in the field, so I could be wrong. I do however have some experience with the Python API for
Sublime Text. It might be possible to implement this functionality yourself, if it doesn't already exist within the plethora of extensions available.
I'm sorry if this sounds like a very long-winded 'uh uh', but I'll be on the lookout for a general solution.

Measure the "matching"?

Is there mechanism to measure or compare of how tight the pattern corresponds to the given string? By pattern I mean regex or something similar. For example we have string "foobar" and two regexes: "fooba." and ".*" Both patterns match the string. Is it possible to determine that "fooba." is more appropriate pattern for given string then ".*"?
There are metrics and heuristics for string 'distance'. Check this for example http://en.wikipedia.org/wiki/Edit_distance
Here is one random Java implementation that came with Google search.
http://www.merriampark.com/ldjava.htm
Some metrics are expensive to compute so look around and find one that fits your needs.
As for your specific example, IIRC, regex matching in Java prioritizes terms by matching length and then order so if you use something like
"(foobar)|(.*)", it will match the first one and you can determine this by examining the results returned for the two capture groups.
How about this for an idea: Use the length of your regular expression: length("fooba.") > length(".*"), so "fooba." is more specific...
However, it depends on where the regular expressions come from and how precise you need to be as "fo.*|.*ba" would be longer than "fooba.", so the solution will not always work.
What you're asking for isn't really a property of regular expressions.
Create an enum that measures "closeness", and create a class that will hold a given regex, and a closeness value. This requires you to determine which regex is considered "more close" than another.
Instantiate your various classes, and let them loose on your code, and compare the matched objects, letting the "most closeness" one rise to the top.
pseudo-code, without actually comparing anything, or resembling any sane language:
enum Closeness
Exact
PrettyClose
Decent
NotSoClose
WayOff
CouldBeAnything
mune
class RegexCloser
property Closeness Close()
property String Regex()
ssalc
var foo = new RegexCloser(Closeness := Exact, Regex := "foobar")
var bar = new RegexCloser(Closeness := CouldBeAnything, Regex := ".*")
var target = "foobar";
if Regex.Match(target, foo)
print String.Format("foo {0}", foo.Closeness)
fi
if Regex.Match(target, bar)
print String.Format("bar {0}", bar.Closeness)
fi

Why is this regular expression faster?

I'm writing a Telnet client of sorts in C# and part of what I have to parse are ANSI/VT100 escape sequences, specifically, just those used for colour and formatting (detailed here).
One method I have is one to find all the codes and remove them, so I can render the text without any formatting if needed:
public static string StripStringFormating(string formattedString)
{
if (rTest.IsMatch(formattedString))
return rTest.Replace(formattedString, string.Empty);
else
return formattedString;
}
I'm new to regular expressions and I was suggested to use this:
static Regex rText = new Regex(#"\e\[[\d;]+m", RegexOptions.Compiled);
However, this failed if the escape code was incomplete due to an error on the server. So then this was suggested, but my friend warned it might be slower (this one also matches another condition (z) that I might come across later):
static Regex rTest =
new Regex(#"(\e(\[([\d;]*[mz]?))?)?", RegexOptions.Compiled);
This not only worked, but was in fact faster to and reduced the impact on my text rendering. Can someone explain to a regexp newbie, why? :)
Do you really want to do run the regexp twice? Without having checked (bad me) I would have thought that this would work well:
public static string StripStringFormating(string formattedString)
{
return rTest.Replace(formattedString, string.Empty);
}
If it does, you should see it run ~twice as fast...
The reason why #1 is slower is that [\d;]+ is a greedy quantifier. Using +? or *? is going to do lazy quantifing. See MSDN - Quantifiers for more info.
You may want to try:
"(\e\[(\d{1,2};)*?[mz]?)?"
That may be faster for you.
I'm not sure if this will help with what you are working on, but long ago I wrote a regular expression to parse ANSI graphic files.
(?s)(?:\e\[(?:(\d+);?)*([A-Za-z])(.*?))(?=\e\[|\z)
It will return each code and the text associated with it.
Input string:
<ESC>[1;32mThis is bright green.<ESC>[0m This is the default color.
Results:
[ [1, 32], m, This is bright green.]
[0, m, This is the default color.]
Without doing detailed analysis, I'd guess that it's faster because of the question marks. These allow the regular expression to be "lazy," and stop as soon as they have enough to match, rather than checking if the rest of the input matches.
I'm not entirely happy with this answer though, because this mostly applies to question marks after * or +. If I were more familiar with the input, it might make more sense to me.
(Also, for the code formatting, you can select all of your code and press Ctrl+K to have it add the four spaces required.)