Multiple results from one subgroup - regex

I have this string:
<own:egna attribute1="1" attribute2="2">test</own:egna>
I want to catch all attributes with a regexp.
This regexp matches one attribute: (\s+attribute\d=['"][^'"]+['"])
But why is it that appending a + like ``(\s+attribute\d=['"][^'"]+['"])+` actually only returns the last matched attribute and not all of them?
How would you change this to return all attributes in separate groups?
I'm actually having more regexp around this, so using functions such as python's findall and equivalents won't do.

The short answer is you can't - only the last group is accessible. The Python docs state this explicitly:
If a group matches multiple times, only the last match is accessible [...]
You'll have to use some language features:
In PHP, there's preg_match_all that returns all matches.
In other languages, you'll have to do this manually: add the g modifier to the regex and loop over it. Perl, for example, will manage a string position and return the next match in $1 each time a /([...])/g pattern is matched.
Also take a look at Capturing a repeated group.

Related

Regex to match text from multiple links

How to extract links which contain a certain word?
For e.g.:
https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text
How to search "word" from below regex?
((https:).*?(###))
The result should be like this
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
https://www.test.com/word/3/text/text
Let's try to build such regex. First we need to find the beginning of url:
/(https?:\/\//
We add ? after https for http urls.
Then we need to find any text except ###, so we need to add:
(?:(?!###).)*
which means - any amount of characters not starting a ### sequence.
Also we need to add word itself and previous sub-expression again, since word can be surrounded by any text:
word(?:(?!###).)*
But the thing is that last sub-expression will skip last character before ###, so we need to add one more thing to handle it:
.(?=###|$)
which means - any character followed by ### or end of string. The final expression will look like:
/(https:\/\/(?:(?!###).)*word(?:(?!###).)*.(?=###|$))/g
But i believe, it's better to just split text by ### and then check for needed word by String.prototype.includes.
If the word has to be a part of the pathname, you might use filter in combination with URL and check if the parts of the pathname contain word.
let str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
let filteredUrls = str.split("###")
.filter(s =>
new URL(s).pathname
.split('/')
.includes('word')
);
console.log(filteredUrls);
If you want to use regex only and possessive quantifiers are supported (The javascript tag has been removed) you might use:
https?://[^#w]*(?:#(?!##)|w(?!ord)|[^#w]*)++word.*?(?=###|$)
Regex demo
Previous answer
You for sure looking for this regular expression:
https://www.test.com/(text/)*word/\d+(/text)*
Here is how you can use it in JavaScript context (very slash / is escaped by backslash \/):
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/https:\/\/www.test.com\/(text\/)*word\/\d+(\/text)*/g);
console.log(urls);
In the array you get exactly the elements you wanted.
Update the answer after update question and adding comment by the author
If you need take the words from your example string, then you have to use a little more complex regular exception:
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/(?<=\/)\w+(?=\/\d+\/\w)|(?<=(\w\/\w+\/))\w+(?=\/\d)/g);
console.log(urls);
Explanation
Here is regular expression /(?<=(\w\/\w+\/))\w+(?=\/\d)|(?<=\/)\w+(?=\/\d+\/\w)/g, limited by /.../ and with the g flag forcing pattern searches for occurrence.
The regular expression has two alternatives ...|...
The first one (?<=\/)\w+(?=\/\d+\/\w) captures cases when the searched word is directly behind the slash (?<=\/) and before more words behind the number (?=\/\d+\/\w).
https://www.test.com/word/3/text/text
The second alternative (?<=(\w\/\w+\/))\w+(?=\/\d) captures cases where the word is preceded by other words following the domain (?<=(\w\/\w+\/)) (in fact two slashes separated by alphanumeric characters) and the searched word is immediately before the slash followed by the number (?=\/\d).
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
All slashes must be escaped: \/.
The construction (?<=...) means lookbehind in regular expressions and (?=...) means lookahead in regular expressions.
Note 1. The above example currently only works well in a Chrome browser, as that:
(...) now lookbehind is part of the ECMAScript 2018 specification. As of this writing (late 2018), Google's Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if cross-browser compatibility matters, you can't use lookbehind in JavaScript.
Note 2. Lookbehnd, even if it is interpreted correctly, in most regular expression engines must contain a fixed length regular expression, which I do not keep in the example above, because this one is still valid and works for regular expression engines used in Google Chrome's JavaScript engine, JGsoft engine and .NET framework RegEx classes.
Note 3. The lookbehind syntax or its poorer \K replacement are widely supported by many regular expression engines used in a large group of programming languages.
More explanation about regular expressions which I used you can find for example here.
You may first split by ### then check whether /word/ exists in each element:
var s = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var result = [];
s.split(/###/).forEach(function(el) {
if (el.includes('/word/'))
result.push(el);
})
// or else by using filter
// result = s.split(/###/).filter(el => el.includes('/word/'))
console.log(result);

Regex capture into group everything from string except part of string

I'm trying to create a regex, which will capture everything from a string, except for specific parts of the string. The he best place to start seems to be using groups.
For example, I want to capture everything except for "production" and "public" from a string.
Sample input:
california-public-local-card-production
production-nevada-public
Would give output
california-local-card
nevada
On https://regex101.com/ I can extract the strings I don't want with
(production|public)\g
But how to capture the things I want instead?
The following will kind of get me the word from between production and public, but not anything before or after https://regex101.com/r/f5xLLr/2 :
(production|public)-?(\w*)\g
Flipping it and going for \s\S actually gives me what I need in two separate subgroups (group2 in both matches) https://regex101.com/r/ItlXk5/1 :
(([\s\S]*?)(production|public))\g
But how to combine the results? Ideally I would like to extract them as a separate named group , this is where I've gotten to https://regex101.com/r/scWxh5/1 :
(([\s\S]*?)(production|public))(?P<app>\2)\g
But this breaks the group2 matchings and gets me empty strings. What else should I try?
Edit: This question boils down to this: How to merge regex group matches?
Which seems to be impossible to solve in regex.
A regexp match is always a continuous range of the sample string. Thus, the anwswer is "No, you cannot write a regexp which matches a series of concatenated substrings as described in the question".
But, this popular kind of task is being solved very easily by replacing unnecessary words by empty strings. Like
s/-production|production-|-public|public-//g
(Or an equivalent in a language you're using)
Note. Provided that \b is supported, it would be more correct to spell it as
s/-production\b|\bproduction-|-public\b|\bpublic-//g
(to avoid matching words like 'subproduction' or 'publication')
Your regex is nearly there:
([\s\S]*?)(?>production|public)
But this results in multiple matches
Match 1
Full match 0-17 `california-public`
Group 1. 0-11 `california-`
Match 2
Full match 17-39 `-local-card-production`
Group 1. 17-29 `-local-card-`
So You have to match multiple times to retrieve the result.

How to Use Delphi TRegEx to replace a particular capture group?

I am trying to use TRegex in Delphi XE7 to do a search and replace in a string.
The string looks like this "#FXXX(b, v," and I want to replace the second integer value v.
For example:
#F037(594,2027,-99,-99,0,0,0,0)
might become
#F037(594,Fred,-99,-99,0,0,0,0)
I am a newbie at RegEx but made up this pattern that seems to work fine for finding the match and identifying the right capturing group for the "2027" (the part below in parentheses). Here it is:
#F\d{3}(\s*\d{1,5}\s*,\s*(\d{1,5})\s*,
My problem is that I cannot work out how to replace just the captured group "2027" using the Delphi TRegEx implementation. I am getting rather confused about TMatch and TGroup and how to use them. Can anyone suggest some sample code? I also suspect I am not understanding the concept of backreferences.
Here is what I have so far:
Uses
RegularExpressions;
//The function that does the actual replacement
function TForm6.DoReplace(const Match: TMatch): string;
begin
//This causes the whole match to be replaced.
//#F037(594,2027,-99,-99,0,0,0,0) becomes Fred-99,-99,0,0,0,0)
//How to just replace the first matched group (ie 2027)?
If Match.Success then
Result := 'Fred';
end;
//Code to set off the regex replacement based on source text in Edit1 and put the result back into Memo1
//Edit1.text set to #F037(594,2027,-99,-99,0,0,0,0)
procedure TForm6.Button1Click(Sender: TObject);
var
regex: TRegEx;
Pattern: string;
Evaluator: TMatchEvaluator;
begin
Memo1.Clear;
Pattern := '#F\d{3}\(\s*\d{1,5}\s*,\s*(\d{1,5})\s*,';
RegEx.Create(Pattern);
Evaluator := DoReplace;
Memo1.Lines.Add(RegEx.Replace(Edit1.Text, Pattern, Evaluator));
end;
When using regex replacements, the whole matched content will be replaced. You have access to the whole match, captured groups and named captured groups.
There are two different ways of doing this in Delphi.
You are currently using an Evaluator, that is a object method containing instructions what to replace. Inside this method you have access to the whole match content. The result will be the replacement string.
This way is useful if vanilla regex is not capable of things you want to do in the replace (e.g. increasing numbers, changing charcase)
There is another overload Replace method that uses a string as replacement. As you want to do a basic regex replace here, I would recommend using it.
In this string you can backreference to your matched pattern ($0 for whole match, $Number for captured groups, ${Name} for named capturing groups), but also add whatever characters you want.
So you can capture everything you want to keep in groups and then backreference is as recommended in Wiktors comment.
As you are doing a single replace, I would als recommend using the class function TRegex.Replace instead of creating the Regex and then replacing.
Memo1.Lines.Add(
TRegex.Replace(
Edit1.Text,
'(#F\d{3}\(\s*\d{1,5}\s*,\s*)\d{1,5}(\s*,)',
'$1Fred$2'));
PCRE regex also supports \K (omits everything matched before) and lookaheads, which can be used to capture exactly what you want to replace, like
Memo1.Lines.Add(
TRegex.Replace(
Edit1.Text,
'#F\d{3}\(\s*\d{1,5}\s*,\s*\K\d{1,5}(?=\s*,)',
'Fred'));

regex expression for selecting a value

I want to write a regexp formula for the below sip message that takes number:
< sip:callpark#as1sip1.com:5060;user=callpark;service=callpark;preason=park;paction=park;ptoken=150009;pautortrv=180;nt_server_host=47.168.105.100:5060 >
(Actually there are "<" and ">" signs in the message, but the site does not let me write)
For this case, I want to select ptoken value.. I wrote an expression such as: ptoken=(.*);p but it returns me ptoken=150009;p, I just need the number:150009
How do I write a regexp for this case?
PS: I write this for XML script..
Thanks,
I SOLVE THE PROBLEM BY USING TWO REGEX:
ereg assign_to="token" check_it="true" header="Refer-To:" regexp="(ptoken=([\d]*))" search_in="hdr"/
ereg assign_to="callParkToken" search_in="var" variable="token" check_it="true" regexp="([\d].*)" /
You could use the following regex:
ptoken=(\d+)
# searches for ptoken= literally
# captures every digit found in the first group
Your wanted numbers are in the first group then. Take a look at this demo on regex101.com. Depending on your actual needs, there could be better approaches (Xpath? as tagged as XML) though.
You should use lookahead and lookbehind:
(?<=ptoken=)(.+?)(?=;)
It captures any character (.+?) before which is ptoken= and behind which is ;
The <ereg ... > action has the assign_to parameter. In your case assign_to="token". In fact, the parameter can receive several variable names. The first is assigned the whole string matching the regular expression, and the following are assigned the "capture groups" of the regular expression.
If your regexp is ptoken=([\d]*), the whole match includes ptoken which is bad. The first capture group is ([\d]*) which is the required value. Thus, use <ereg regexp="ptoken=([\d]*)" assign_to="dummyvar,token" ..other parameters here.. >.
Is it working?

How to match a string that does not end in a certain substring?

how can I write regular expression that dose not contain some string at the end.
in my project,all classes that their names dont end with some string such as "controller" and "map" should inherit from a base class. how can I do this using regular expression ?
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Do a search for all filenames matching this:
(?<!controller|map|anythingelse)$
(Remove the |anythingelse if no other keywords, or append other keywords similarly.)
If you can't use negative lookbehinds (the (?<!..) bit), do a search for filenames that do not match this:
(?:controller|map)$
And if that still doesn't work (might not in some IDEs), remove the ?: part and it probably will - that just makes it a non-capturing group, but the difference here is fairly insignificant.
If you're using something where the full string must match, then you can just prefix either of the above with ^.* to do that.
Update:
In response to this:
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Not quite sure what you're attempting with the public/class stuff there, so try this:
public.*class.*(?<!controller|map)$`
The . is a regex char that means "anything except newline", and the * means zero or more times.
If this isn't what you're after, edit the question with more details.
Depending on your regex implementation, you might be able to use a lookbehind for this task. This would look like
(?<!SomeText)$
This matches any lines NOT having "SomeText" at their end. If you cannot use that, the expression
^(?!.*SomeText$).*$
matches any non-empty lines not ending with "SomeText" as well.
You could write a regex that contains two groups, one consists of one or more characters before controller or map, the other contains controller or map and is optional.
^(.+)(controller|map)?$
With that you may match your string and if there is a group() method in the regex API you use, if group(2) is empty, the string does not contain controller or map.
Check if the name does not match [a-zA-Z]*controller or [a-zA-Z]*map.
finally I did it in this way
public.*class.*[^(controller|map|spec)]$
it worked