I'm very interested in Syntax Definitions for Sblime text 2
I've studied the basics but I don't know how to write RE (and rules) for smth like
variable = sentense, i.e. myvar = func(foo, bar) + baz
I can't write anything better than ^\s*([^=\n]+)=([^=\n]+\n) (that doesn't work)
How to write this RE in proper way?
Also, i have some difficulties in defining RE for block
IF i FROM .. TO ..
...
ELSE
...
END IF
Hoe to write it?
In this case you have to write a parser. A regex won't work because the patterns may vary. You've already noticed it when you stated 'variable = sentence'.
For this, you can use spoofax or javacup for grammar definitions. I'll give you a snip in JavaCup:
Scanner issues: suppose 'variable' follows the pattern: (_|[a-zA-Z])(_|[a-zA-Z])*
and 'number' is: ([0-9])+
Note that number could be any decimal or int, but here I state it as that pattern, supposing my language only deals with integer (or whatever that pattern means :) ).
Now we can declare our grammar following the JavaCUP syntax. Which is more or less like:
expression ::= variable "=" sentence
sentence ::= sentence "+" sentence;
sentence ::= sentence "-" sentence;
sentence ::= sentence "*" sentence;
sentence ::= sentence "/" sentence;
sentence ::= number;
...and that goes further.
If you've never had any compiler's class, it may seems very difficult to see. Plus there is a lot of grammar's restrictions to avoiding infinity loop in the parser, depending on which you're using (RL or LL).
Anyway, the real answer for your question is: you can't do what you want only with regex, i'll need more concepts.
Related
I would be grateful if someone could explain how the following regex should be interpreted; it is from the W3C reference for Namespaces in XML 1.0, and defines an NCName ([4]) as:
Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
I can understand subtraction when applied to lists, such as:
[a-z-[aeiuo]] representing the list of all consonants (see http://www.regular-expressions.info/charclasssubtract.html), but not when applied to a group (apologies if this is the wrong term) as shown above.
The comment indicates how I should interpret the regex, but I'm struggling; why not just:
Name - ( ':' )
if the intention is for NCName to be Name minus ':', then why are the zero or more characters required on either side (I'm not asking a separate question, just indicating my area of confusion)?
Please accept my thanks in advance.
The documents published by W3C use a variant of the EBNF Notation to describe the languages standardized by them.
It is described in section "6 Notation" of the XML Recommendation.
The example you posted:
NCName ::= Name - (Char* ':' Char*) /* An XML Name, minus the ":" */
How to read it:
NCName is the object described by the rule;
::= separates the name of the described object (on the left) by the expression that describes it (on the right);
Name is an object already described by another rule;
- is the except symbol; A - B in EBNF means "matches A but doesn't match B";
(...) - the parentheses create a group; they make the expression inside them behave as a single item;
Char is another object already described by another rule in the documentation; it basically means a Unicode character;
* - repetition, matches the previous item zero or more times;
':' - string in single or double quotes is a string literal; it represents itself; here, the colon character;
Put together, it means a NCName is a Name that doesn't contain :.
The comment seems incorrect (or maybe it is just bad worded).
I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']
basically have two questions.
1. Is there a c++ library that would do full text boolean search just like in mysql. E.g.,
Let's say I have:
string text = "this is my phrase keywords test with boolean query.";
string booleanQuery = "\"my phrase\" boolean -test -\"keywords test\" OR ";
booleanQuery += "\"boolean search\" -mysql -sql -java -php"b
//where quotes ("") contain phrases, (-) is NOT keyword and OR is logical OR.
If answer to first is no, then;
2. Is it possible to search a phrase in text. e.g.,
string text =//same as previous
string keyword = "\"my phrase\"";
//here what's the best way to search for my phrase in the text?
TR1 has a regex class (derived from Boost::regex). It's not quite like you've used above, but reasonably close. Boost::phoenix and Boost::Spirit also provide similar capabilities, but for a first attempt the Boost/TR1 regex class is probably a better choice.
As to the 2nd point: string class does have a method find, see http://www.cppreference.com/wiki/string/find
Sure there is, try Spirit:
http://boost-spirit.com/home/
my string style like this
expression1/field1+expression2*expression3+expression4/field2*expression5*expression6/field3
a real style mybe like this:
computer/(100)+web*mail+explorer/(200)*bbs*solution/(300)
"+" and "*" represent operator
"computer","web"...represent expression
(100),(200) represent field num . field num may not exist.
I want process the string to this:
<computer>/(100)+web*<mail>+explorer/(200)*bbs*<solution>/(300)
rules like this
if expression length is more than 3 and its field is not (200), then add brackets to it.
My recommendation is to mix regex with other language features. The complication arises from the fact that field appears before expressions, and lookbehind is usually more limited than lookforward.
In pseudo-Java-code, I recommend doing something like this:
String[] parts = input.split("/");
for (int i = 0; i < parts.length; i++) {
if (!parts[i].startsWith("(200)"))
parts[i] = parts[i].replaceAll("(?=[a-z]{4})([a-z]+)", "<$1>");
}
String output = parts.join("/");
I would not use just regular expression.
You say "if expression length is more than 3 and its field is not (200), then add brackets to it"
I think a normal conditional statement is the best and clearest solutoion for this.
I think regular expressions are sometimes overused. Regexes are hard to read, and when a couple of conditional statements can do the same but more clearly, then I'd say the code quality is higher.
My situation: I'm new to Spirit, I have to use VC6 and am thus using Spirit 1.6.4.
I have a line that looks like this:
//The Description;DESCRIPTION;;
I want to put the text DESCRIPTION in a string if the line starts with //The Description;.
I have something that works but looks not that elegant to me:
vector<char> vDescription; // std::string doesn't work due to missing ::clear() in VC6's STL implementation
if(parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+~ch_p(';'))[assign(vDescription)]
),
// End grammar
space_p).hit)
{
const string desc(vDescription.begin(), vDescription.end());
}
I would much more like to assign all printable characters up to the next ';' but the following won't work because parse(...).hit == false
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+print_p)[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)
How do I make it hit?
You might try using confix_p:
confix_p(as_lower_d["//the description;"],
(+print_p)[assign(vDescription)],
ch_p(';')
)
It should be equivalent to Fred's response.
The reason your code fails is because print_p is greedy. The +print_p parser will consume characters until it encounters the end of the input or a non-printable character. Semicolon is printable, so print_p claims it. Your input gets exhausted, the variable is assigned, and the match fails — there's nothing left for the last semicolon of your parser to match.
Fred's answer constructs a new parser, (print_p - ';'), which matches everything print_p does, except for semicolons. "Match everything except X, and then match X" is a common pattern, so confix_p is provided as a shortcut for constructing that kind of parser. The documentation suggests using it for parsing C- or Pascal-style comments, but that's not required.
For your code to work, Spirit would need to recognize that the greedy print_p matched too much and then backtrack to allow matching less. But although Spirit will backtrack, it won't backtrack to the "middle" of what a sub-parser would otherwise greedily match. It will backtrack to the next "choice point," but your grammar doesn't have any. See Exhaustive backtracking and greedy RD in the Spirit documentation.
You're not getting a hit because ';' is matched by print_p. Try this:
parse(chars,
// Begin grammar
(
as_lower_d["//the description;"]
>> (+(print_p-';'))[assign(vDescription)]
>> ';'
),
// End grammar
space_p).hit)