looking for a regex pattern key:value pairs - regex

I am trying to find a regex pattern for the outlook search, I am looking for grouping pattern to handle this
from:Jack subject:(sending invoice) title:ibm
I used this pattern but i do not get the words after the first word
(?<name>\\w+):[(](?<value>\\w*)[)]*

\w doesn't handle spaces, change your regex to:
(?<name>\\w+):[(](?<value>[^)]*)[)]
[^)]* means 0 or more characters that is not right parens.
May be you'd prefer to use [^)]+ that means ONE or more characters that is not right parens.
If the parenthesis are optional, use:
(?<name>\\w+):[(]?(?<value>[^)]+)[)]?

The first bracket in 'value' is not optional in your regex
from:Jack subject:(sending invoice) title:ibm
As for whitespaces, I'd do it this way, as the brackets are not present in every key value pair anyway:
(?<name>\\w+):[(]*(?<value>(?:\\w|\\s)+)[)]*
However, it seems that the value is either one word or a sequence of words but within enclosing brackets - let's re-write the regex then, otherwise you'll get the whole thing after ':' as the first value:
(?<name>\\w+):(?:(?<value>\\w+)|(?:\\((?<value>(?:\\w|\\s)+)))

Related

RegEx to match all sets of items that have part of specific value

I'm trying to use RegEx to filter all sets of items that have part of a specific value in a capture group that I have defined.
I have to check if the fifth capture group contains at least part of a specific text.
My string:
First Item;Second Item;Third Item;Fourth Item;First Word;Sixth
Item?First Item;Second Item;Third Item;Fourth Item;Second Word;Sixth
Item?First Item;Second Item;Third Item;Fourth Item;Can't Capture This
Set;Sixth Item
RegEx that works for exact word:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);(Second Word);([^;\?$]+)
The problem is that I need this RegEx to work to capture only part of the word.
Not Working:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);(.*Word.*);([^;\?$]+) >
Thanks!
Use [^;]* instead of .* because you have semi-colons as field delimiters:
(?:^|\?)([^;]+);([^;]+);([^;]+);([^;]+);([^;]*Word[^;]*);([^;?]+)
See proof. ([^;]*Word[^;]*) will match zero or more characters other than semi-colons, then a Word and zero or more characters other than semi-colons.

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

Regexp, that ignores only first capture group

We have tab spaced list of "key=value" pairs.
How we can split it, using regexp?
Case key=value must be transformed into value. Case key=value=value2 must be transformed into value=value2.
https://regex101.com/r/dR5dT0/1 - I've started solution like this, but can't find beautiful way to remove only "key=" part from text.
UPD BTW, do you know cool crash courses on regular expressions?
You can just use
=(\S*)
See regex demo
Since the list is already formatted, the = in the pattern will always be the name/value delimiter.
The \S matches any non-whitespace character.
The * is a quantifier meaning that the \S should occur zero or more times (\S* matches zero or more non-whitespace characters).
You can use this regex for matching:
/\w+=(\S+)/
and grab captured group #1
RegEx Demo

Cleaning up a regular expression which has lots of repetition

I am looking to clean up a regular expression which matches 2 or more characters at a time in a sequence. I have made one which works, but I was looking for something shorter, if possible.
Currently, it looks like this for every character that I want to search for:
([A]{2,}|[B]{2,}|[C]{2,}|[D]{2,}|[E]{2,}|...)*
Example input:
AABBBBBBCCCCAAAAAADD
See this question, which I think was asking the same thing you are asking. You want to write a regex that will match 2 or more of the same character. Let's say the characters you are looking for are just capital letters, [A-Z]. You can do this by matching one character in that set and grouping it by putting it in parentheses, then matching that group using the reference \1 and saying you want two or more of that "group" (which is really just the one character that it matched).
([A-Z])\1{1,}
The reason it's {1,} and not {2,} is that the first character was already matched by the set [A-Z].
Not sure I understand your needs but, how about:
[A-E]{2,}
This is the same as yours but shorter.
But if you want multiple occurrences of each letter:
(?:([A-Z])\1+)+
where ([A-Z]) matches one capital letter and store it in group 1
\1 is a backreference that repeats group 1
+ assume that are one or more repetition
Finally it matches strings like the one you've given: AABBBBBBCCCCAAAAAADD
To be sure there're no other characters in the string, you have to anchor the regex:
^(?:([A-Z])\1+)+$
And, if you wnat to match case insensitive:
^(?i)(?:([A-Z])\1+)+$

Regex match multiple elements

I have regex that I am trying to match to specific function parameters. I want to be able to style them a certain way in a language package.
Here is the text I am trying to match:
addFill(path:svgjs.Element, pattern:Pattern, docMaxSide:number) {
pathFillId(path)
}
In this example, I want to match the words "path" "pattern" and "docMaxSide" from the parameters. I want to make sure it does NOT match the word "path" in the second line (where I am calling pathFillId).
Here is my current regex: \(.*?(\w+):.*?\)
Broken down:
\( Find open parens
.*? It may have stuff before it, but after the parens
(\w+): Capture a word before a colon
.*? There may be more stuff after the colon
\) Close parens
Right now, it will only match the first item, "path". But I need it to match all the words I mentioned above.
UPDATE: I should have been more specific. It should only match if it's a function parameter. For example, I don't want path1 matched in the following: var path1:string. The difficulty is coming up with regex that matches items only between parens.
Try this:
\w+(?=:)
with the g modifier (the global modifier finds all elements and don't return on the first match)
Also see the example
UPDATE
If you want only match the parameters in the parenthesis you can do this:
\w+(?=:[\w.]+\s*[,)])
Here is the example for this regex
You problem is this part of your regex: .*?. So you specify that you want any character (.), that's correct. But then you must decide for one of * and ? - * means {0,}, ? means {0,1}.
If that doesn't help, you might test your regex with regexe.com or similar.