how to avoid to match the last letter in this regexp? - regex

I have a quesion about regexp in tcl:
first output: TIP_12.3.4 %
second output: TIP_12.3.4 %
and sometimes the output maybe look like:
first output: TIP_12 %
second output: TIP_12 %
I want to get the number 12.3.4 or 12 using the following exgexp:
output: TIP_(/[0-9].*/[0-9])
but why it does not matches 12.3.4 or 12%?

You need to escape the dot, else it stands for "match every character". Also, I'm not sure about the slashes in your regexp. Better solution:
/TIP_(\d+\.?)+/

Your problem is that / is not special in Tcl's regular expression language at all. It's just an ordinary printable non-letter character. (Other languages are a little different, as it is quite common to enclose regular expressions in / characters; this is not the case in Tcl.) Because it is a simple literal, using it in your RE makes it expect it in the input (despite it not being there); unsurprisingly, that makes the RE not match.
Fixing things: I'd use a regular expression like this: output: TIP_([\d.]+) under the assumption that the data is reasonably well formatted. That would lead to code like this:
regexp {output: TIP_([0-9.]+)} $input -> dottedDigits
Everything not in parentheses is a literal here, so that the code is able to find what to match. Inside the parentheses (the bit we're saving for later) we want one or more digits or periods; putting them inside a square-bracketed-set is perfect and simple. The net effect is to store the 12.3.4 in the variable dottedDigits (if found) and to yield a boolean result that says whether it matched (i.e., you can put it in an if condition usefully).
NB: the regular expression is enclosed in braces because square brackets are also Tcl language metacharacters; putting the RE in braces avoids trouble with misinterpretation of your script. (You could use backslashes instead, but they're ugly…)

Try this :
output: TIP_(/([0-9\.^%]*)/[0-9])
Capture group 1.
Demo here :
http://regexr.com?31f6g

The following expression works for me:
{TIP_((\d+\.?)+)}

Related

How do I match a string that is not in a comment?

I use the following in my .vimrc to match capitalised strings and highlight them:
match Macro /\v<[A-Z|_]{2,}>/
However, I don't want to match comments (ie. where a // preceeds the text in the same line or where the text is surrounded by a /* and */).
How do I modify the above to achieve this?
I'm assuming that the | in your regex was supposed to mean "or." It doesn't: within brackets, no "or" is required. Your | refers to the actual character |.
This regular expression should do the trick about 98% of the time, maybe more:
\v(\/\/[^\n]*|\/\*(\_[^*]|\*\_[^/])*)#<!<[A-Z_]{2,}>
It uses positive lookbehind to make sure that there is no // preceding the string in the same line and no /* preceding it that is not followed by a */. It fails in the following case:
if (string == "/*") { // Looks like the start of a block comment
return CONSTANT; // Won't be highlighted
}
If you want better results than this (that is, if you're worried that you'll obsess over the bug whenever you run into it) you could make this more sophisticated. How sophisticated depends on your language. In JavaScript, for example, you will need to worry about regex literals as well as strings:
// Looks like a comment after the "//" in the regex:
if (/\//.test(string)) return CONSTANT; // Won't be highlighted
If you want an idea of how complicated a regex to match a regex is, look at my answer here.

Annotating mismatches in regular expression

I need to "annotate" with a X character each mismatch in a regular expression, For example if I have a text file like:
Line1Name: this is a (string).
Line2Name: (a string)
Line3Name this is a line without parenthesis
Line4Name: (a string 2)
Now following regular expression will match everything before a :
^[^:]+(?=:)
so the result will be
Line1Name:
Line2Name:
Line4Name:
However I would need to annotate the mismatch at the 3rd line, having this output:
Line1Name:
Line2Name:
X
Line4Name:
Is this possible with regular expressions?
If you have a look at what a regular expression is, you will realize that it is not possible to do logical operations with a regex alone. Quoting Wikipedia:
In computing, a regular expression provides a concise and flexible means to “match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.
emphasis mine – simply put, a regex is a fancy way to find a string; it either does (it matches), or not.
To achieve what you are after, you need some kind of logic switch that operates on the match / not-match result of your regex search and triggers an action. You haven’t specified in what environment you are using your regex, so providing a solution is a bit pointless, but as an example, this would do what you are trying to do in pure bash:
# assuming your string is in $str
result="$([[ $str =~ ^[^:]+: ]] && echo "${str%:*}" || echo "X")"
and this does the same thing in a language supporting your regex pattern (Ruby):
# assuming your string is in str
result = str.match(/^[^:]+(?=:)/) || "X"
As a side note, your example code does not match the output: you are using a lookahead for the colon, which excludes it in the match, but your output includes it. I’ve opted for sticking with your regex over your output pattern in my examples, thus excluding the colon from the result.

Regex or on multiple/single characters

I'm dynamically making a regex.
I want it to match the following:
lem
le,,m
levm
lecm
Basically, "lem" but before the m it can have any number of , or any one of any character. Right now I have
le[\,]{0,}[.]?m
you can see it at
http://regexr.com?303ne
It should match every one but the third one.
Update: I figured it out:
le[\,]{0,}.?m
Whenever you think "or" in Regular Expressions, you should start with alternation:
a|b
matches either a or b. So
any number of a list of characters OR 1 of any character
can be translated quite literally to
[...]*|.
where ... would be the list of characters to match (a character class). If you use that as part of a longer expression, you need to use parentheses, because concatenation binds stronger (has higher precedence) than alternation:
le([,]*|.)m
Because the character class has only one item, we can simplify this:
le(,*|.)m
Note that . by default means "any character but newline".
What about this:
le(,*|.?)m
it should do what you want.
How about this one:
([^,])(?=\\1)
But this does the opposite :-) Not sure if it is ok for you
UPD:
this should work for you:
~^(?:,|([^,])(?!\\1))+$~
not sure what dialect you're looking for, but it works in PCRE: http://ideone.com/6Q3Wk
UPD2:
the same regex included into another
$r = '(?:,|([^,])(?!\\1))+';
var_dump(preg_match('~le' . $r . 'm~', 'leem'));
In this case the final expression becomes: le(?:,|([^,])(?!\\1))+m where le and m are added around mine without modifications

Can I shorten this regular expression?

I have the need to check whether strings adhere to a particular ID format.
The format of the ID is as follows:
aBcDe-fghIj-KLmno-pQRsT-uVWxy
A sequence of five blocks of five letters upper case or lower case, separated by one dash.
I have the following regular expression that works:
string idFormat = "[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}";
Note that there is no trailing dash, but the all of the blocks within the ID follow the same format. Therefore, I would like to be able to represent this sequence of four blocks with a trailing dash inside the regular expression and avoid the duplication.
I tried the following, but it doesn't work:
string idFormat = "[[a-zA-Z]{5}[-]{1}]{4}[a-zA-Z]{5}";
How do I shorten this regular expression and get rid of the duplicated parts?
What is the best way to ensure that each block does also not contain any numbers?
Edit:
Thanks for the replies, I now understand the grouping in regular expressions.
I'm running a few tests against the regular expression, the following are relevant:
Test 1: aBcDe-fghIj-KLmno-pQRsT-uVWxy
Test 2: abcde-fghij-klmno-pqrst-uvwxy
With the following regular expression, both tests pass:
^([a-zA-Z]{5}-){4}[a-zA-Z]{5}$
With the next regular expression, test 1 fails:
^([a-z]{5}-){4}[a-z]{5}$
Several answers have said that it is OK to omit the A-Z when using a-z, but in this case it doesn't seem to be working.
You can try:
([a-z]{5}-){4}[a-z]{5}
and make it case insensitive.
If you can set regex options to be case insensitive, you could replace all [a-zA-Z] with just plain [a-z]. Furthermore, [-]{1} can be written as -.
Your grouping should be done with (, ), not with [, ] (although you're correctly using the latter in specifying character sets.
Depending on context, you probably want to throw in ^...$ which matches start and end of string, respectively, to verify that the entire string is a match (i.e. that there are no extra characters).
In javascript, something like this:
/^([a-z]{5}-){4}[a-z]{5}$/i
This works for me, though you might want to check it:
[a-zA-Z]{5}(-[a-zA-Z]{5}){4}
(One group of five letters, followed by [dash+group of five letters] four times)
([a-zA-Z]{5}[-]{1}){4}[a-zA-Z]{5}
Try
string idFormat = "([a-zA-Z]{5}[-]{1}){4}[a-zA-Z]{5}";
I.e. you basically replace your brackets by parentheses. Brackets are not meant for grouping but for defining a class of accepted characters.
However, be aware that with shortened versions, you can use the expression for validating the string, but not for analyzing it. If you want to process the 5 groups of characters, you will want to put them in 5 groups:
string idFormat =
"([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})-([a-zA-Z]{5})";
so you can address each group and process it.

Search text with a regular expression to match outside specific characters

I have text that looks like:
My name is (Richard) and I cannot do
[whatever (Jack) can't do] and
(Robert) is the same way [unlike
(Betty)] thanks (Jill)
The goal is to search using a regular expression to find all parenthesized names that occur anywhere in the text BUT in-between any brackets.
So in the text above, the result I am looking for is:
Richard
Robert
Jill
You can do it in two steps:
step1: match all bracket contents using:
\[[^\]]*\]
and replace it with ''
step2: match all the remaining parenthesized names(globally) using:
\([^)]*\)
You didn't say what language you're using, so here's some Python:
>>> import re
>>> REGEX = re.compile(r'(?:[^[(]+|\(([^)]*)\)|\[[^]]*])')
>>> s="""My name is (Richard) and I cannot do [whatever (Jack) can't do] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"""
>>> filter(None, REGEX.findall(s))
The output is:
['Richard', 'Robert', 'Jill']
One caveat is that this does not work with arbitrary nesting. The only nesting it's really designed to work with is one level of parens in square brackets as mentioned in the question. Arbitrary nesting can't be done with just regular expressions. (This is a consequence of the pumping lemma for regular languages.)
The regex looks for chunks of text without brackets or parens, chunks of text enclosed in parens, and chunks of text enclosed in brackets. Only text in parens (not in square brackets) is captured. Python's findall finds all matches of the regex in sequence. In some languages you may need to write a loop to repeatedly match. For non-paren matches, findall inserts an empty string in the result list, so the call to filter removes those.
IF you are using .NET you can do something like:
"(?<!\[.*?)(?<name>\(\w+\))(?>!.*\])"
It's not really the best job for a single regexp - have you considered, for example, making a copy of the string and then deleting everything in between the square brackets instead? It would then be fairly straight forward to extract things from inside the parenthesis. Alternatively, you could write a very basic parser that tokenises the line (into normal text, square bracket text, and parenthesised text, I imagine) and then parses the tree that produces; it'd be more work initially but would make life much simpler if you later want to make the behaviour any more complicated.
Having said that, /(?:(?:^|\])[^\[]*)\((.*?)\)/ does the trick for your test case (but it will almost certainly have some weird behaviour if your [ and ] aren't matched properly, and I'm not convinced it's that efficient).
A quick (PHP) test case:
preg_match_all('/(?:(?:^|\])[^\[]*)\((.*?)\)/', "My name is ... (Jill)", $m);
print(implode(", ", $m[1]));
Outputs:
Richard, Robert, Jill
>>> s="My name is (Richard) and I cannot do [whatever (Jack) can't do (Jill) can] and (Robert) is the same way [unlike (Betty)] thanks (Jill)"
>>> for item in s.split("]"):
... st = item.split("[")[0]
... if ")" in st:
... for i in st.split(")"):
... if "(" in i:
... print i.split("(")[-1]
...
Richard
Robert
Jill
So you want the regex to match the name, but not the enclosing parentheses? This should do it:
[^()]+(?=\)[^\[\]]*(?:\[[^\[\]]*\][^\[\]]*)*$)
As with the other answers, I'm making certain assumptions about your target string, like expecting parentheses and square brackets to be correctly balanced and not nested.
I say it should work because, although I've tested it, I don't know what language/tool you're using to do the regex matching with. We could provide higher-quality answers if we had that info; all regex flavors are not created equal.