This question already has answers here:
Regular expression to skip character in capture group
(6 answers)
Closed 7 years ago.
I have a regex capture, and I would like to exclude a character (a space, in this particular case) from the middle of the captured string. Can this be done in one step, by modifying the regex?
(Quick and dirty) example:
Text: Key name = value
My regex: (.*) = (.*)
Output: \1 = "Key name" and \2 = "value"
Desired output: \1 = "Keyname" and \2 = "value"
Update: I'm not sure what regex engine will run this regex, since it's part of a larger software product. If you have a solution, please specify which engines it will run on, and on which it will not.
Update2: The aforementioned product takes a regex as an input, and then uses the matched values further, which is the reason for which a one-step solution is asked for. There is no opportunity to insert an intermediate processing step in the pipeline.
This is a possible theoretical pure-regex implementation using the end-of-previous-match \G anchor:
/(?:\G(\w+)\h(?:(?:=\h)(\w+))?)+/g
Online demo
Legenda
(?: # Non capturing group 1
\G # Matches where the regex engine stops in the previous step
(\w+) # capture group 1: a regex word of 1+ chars
\h* # zero or more horizontal spaces (space, tabs)
(?: # Non capturing group 2
=\h* # literal '=' follower by zero or more hspaces
(\w+) # capture group 2: a regex word of 1+ chars
)? # make the non capturing group 2 optional
)+ # repeat the non capturing group 1, one or more
In the substitution section of the demo:
\1 actually contains Keyname (the 2 terms are separated by a fake space)
\2 is value
NOTE: i don't recommend using this unless actually needed (why?).
There are multiple possible approaches in 2 steps: as surely already stated simply strip spaces from the first capturing group of the OP regex.
I would come up with sth. like:
(?<key>[\w]+)\s*=\s*(?<value>.+)
# look for a word character and capture it in a group called "key"
# followed by zero or unlimited times of a whitespace character (\s)
# followed by an equation sign
# followed by zero or unlimited times of a whitespace character (\s)
# capture the rest in a group called value
... and process the captured output afterwards. But with the \w character class no whitespace will matched (do you have keys with a whitespace in it?).
See a working demo here. But as mentionned in the comments, it depends on your programming language.
Related
I have a string looks like this
#123##1234###2356####69
It starts with # and followed by any digits, every time the # appears, the number of # increases, first time 1, second time 2, etc.
It's similar to this regex, but since I don't know how long this pattern goes, so it's not very useful.
^#\d+##\d+###\d+$
I'm using PCRE regex engine, it allows recursion (?R) and conditions (?(1)...) etc.
Is there a regex to validate this pattern?
Valid
#123
#12##235
#1234##12###368
#1234##12###368####22235#####723356
Invalid
##123
#123###456
#123##456##789
I tried ^(?(1)(?|(#\1)|(#))\d+)+$ but it doesn't seem to work at all
You can do this using PCRE conditional sub-pattern matching:
^(?:((?(1)\1)#)\d+)++$
RegEx Demo
RegEx Details:
^: Start
(?:: Start non-capture group
(: Start capture group #1
(?(1)\1): if/then/else directive that means match back-reference \1 only if 1st capture group is available otherwise match null
#: Match an additional #
): End capture group #1
\d+: Match 1+ digits
)++: End non-capture group. Match 1+ of this non-capture group.
$: End
One option could be optionally matching a backreference to group 1 inside group 1 using a possessive quantifier \1?+# adding # on every iteration.
^(?:(\1?+#)\d+)++$
^ Start of string
(?: Non capture group
(\1?+#)\d+ Capture group 1, match an optional possessive backreference to what is already captured in group 1 and add matching a # followed by 1+ digits
)++ Close the non capture group and repeat 1+ times possessively
$ End of string
Regex demo
I think you can use forward-referencing here:
^(?:((?:\1(?!^)|^)#)\d+)+$
See the regex demo.
Details:
^ - start of string
(?:((?:\1(?!^)|^)#)\d+)+ - one or more occurrences of
((?:\1(?!^)|^)#) - Group 1 (the \1 value): start of string or an occurrence of the Group 1 value if it is not at the string start position
\d+ - one or more digits
$ - end of string.
NOTE: This technique does not work in regex flavors that do not support forward referencing, like ECMAScript based flavors (e.g. JavaScript, VBA, C++ std::regex)
Despite there are already working answers, and inspired by Wiktor's answer, I came up this idea:
(?:(^#|#\1)\d+)+$
Which is also quite short and effective(also works for non pcre environment).
See the test cases
(it must be something trivial and answered many times already - but I can't formulate the right search query, sorry!)
From the text like prefix start.then.123.some-more.text. All the rest I need to extract start.then.123.some-more.text - i.e. string that has no spaces, have periods in the middle and may have or not the trailing period (and that trailing period should not be included). I struggle to build a regex that would catch both cases:
prefix (start[0-9a-zA-Z\.\-]+)\..* - this works correctly only if there's a trailing period,
prefix (start[0-9a-zA-Z\.\-]+)\.?.* - I thought adding ? after \. will make it optional - but it doesn't...
P.S. My environment is MS VBA script, I'm using CreateObject("vbscript.regexp") - but I guess the question is relevant to other regex engines as well.
If you don’t want to include “prefix” you can use:
(?<=prefix )\S*?(?=\.?\s)
Demo
EDIT:
Even simpler, without lookbehinds or lookaheads, if you're using capturing groups anyway:
prefix (\S*\w)
This will stop at the last letter, number, or underscore. If you want to be able to capture a hyphen as the last character, you can change \w above to [\w-].
Demo 2
You could match prefix, and use a capturing group to first match chars A-Za-z0-9.
Then you can repeat the previous pattern in a group preceded by either a . or - using a character class.
prefix ([0-9a-zA-Z]+(?:[.-][0-9a-zA-Z]+)+)
In parts
prefix Match literally
( Capture group 1
[0-9a-zA-Z]+ Match 1+ times any of the listed chars
(?: Non capture group
[.-][0-9a-zA-Z]+ match either a . or - and again match 1+ times any of the listed chars
)+ Close group and repeat 1+ times to match at least a dot or hyphen
) Close group
Regex demo
If the value in the capturing group should begin with start:
prefix (start(?:[.-][0-9a-zA-Z]+)+)
Regex demo
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
Example: search for the pattern man but only as the beginning of a word (i.e. not preceded directly by letters).
This pattern would be found in the strings man, spider-man, manpower, iron_man. But not in woman or human.
I have assumed that if the word "man" is preceded by a hyphen or underscore, to achieve a match the hyphen or underscore must be preceded by a letter (e.g., "-man" would not be matched).
The \K escape sequence resets the beginning of the match to the current position in the token list. If supported by the regex engine, the following regular expression (with the case-indifferent flag set) could be used.
(?:^| |[a-z][-_])\Kman
Demo
The selected answer to this SO question provides a list of regex engines that support \K. That list was last updated in August 2019.
The regex engine performs the following operations.
(?: # begin non-capture group
^ # match beginning of line
| # or
# match a space
| # or
[a-z] # match a letter
[-_] # match '-' or '_'
) # end non-capture group
\K # discard everything matched so far
man # match 'man'
Alternatively, a capture group could be used.
(?:^| |[a-z][-_])(man)
Demo
You can use positive lookbehind to achieve this:
(?<=^|[a-z][_-]|\s)man
regex101 demo
Add a word boundary \b or look behind for underscore to the start:
((?<=_)|\b)man
See live demo.
My regex knowledge is pretty limited, but I'm trying to write/find an expression that will capture the following string types in a document:
DO match:
ADY123
AD12ADY
1HGER_2
145-DE-FR2
Bicycle1
2Bicycle
128D
128878P
DON'T match:
BICYCLE
183-329-193
3123123
Is such an expression possible? Basically, it should find any string containing letters AND digits, regardless of whether the string contains a dash or underscore. I can find the first two using the following two regex:
/([A-Z][0-9])\w+/g
/([0-9][A-Z)\w+/g
But searching for possible dashes and hyphens makes it more complicated...
Thanks for any help you can provide! :)
MORE INFO:
I've made slight progress with: ([A-Z|a-z][0-9]+-*_*\w+) but it doesn't capture strings with more than one hyphen.
I had a document with a lot of text strings and number strings, which I don't want to capture. What I do want is any product code, which could be any length string with or without hyphens and underscores but will always include at least one digit and at least one letter.
You can use the following expression with the case-insensitive mode:
\b((?:[a-z]+\S*\d+|\d\S*[a-z]+)[a-z\d_-]*)\b
Explanation:
\b # Assert position at a word boundary
( # Beginning of capturing group 1
(?: # Beginning of the non-capturing group
[a-z]+\S*\d+ # Match letters followed by numbers
| # OR
\d+\S*[a-z]+ # Match numbers followed by letters
) # End of the group
[a-z\d_-]* # Match letter, digit, '_', or '-' 0 or more times
) # End of capturing group 1
\b # Assert position at a word boundary
Regex101 Demo
I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)
Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)
It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1