Regex (repeating group) does not seem to work - regex

I'm trying to match tag and possible attributes of xml tag.
Here you can find example:
https://regex101.com/r/mZYvGU/1/
^([\S]+)(?:\s*([\S]+)="([^"]+)")*
test string:
tag att1="1" att2="2"
The problem is - it is always catching last attribute only.
Thanks for help.

You can follow my example:
let s = `tag att1="1" att2="2" att3="34"`;
let regex = /(.*?)(?<=\s)(\w+)="(\w+)"/ig;
let result = [];
s.replace(regex, function(match,tag,att_name,att_value){
if(tag.trim()!==''){
result.push(tag.trim());
result.push(att_name); //att1
result.push(att_value); //1
}else{
result.push(att_name); //att2 and att3
result.push(att_value); //2 and 34
};
});
console.log(result);

Related

How to get multiple regex on same string in scala

My requirement is to get multiple regex patterns in a given String.
"<a href=\"https://page1.google.com/ab-cd/ABCDEF\”>Hello</a> hiiii <a href=\"https://page2.yahoo.com/gr\”>page</a><img src=\"https://image01.google.com/gr/content/attachment/987654321\” alt=\”demo image\”></a><a href=\"https://page3.google.com/hr\">"
With this below code:
val p = Pattern.compile("href=\"(.*?)\"")
val m = p.matcher(str)
while(m.find()){
println(m.group(1))
}
I am getting output:
https://page1.google.com/ab-cd/ABCDEF
https://page2.yahoo.com/gr
https://page3.google.com/hr
With change in Pattern:
val p = Pattern.compile("img src=\"(.*?)\"")
I am getting output:
https://image01.google.com/gr/content/attachment/987654321
But with Pattern:
val p = Pattern.compile("href=\"(.*?)\"|img src=\"(.*?)\"")
I am getting output:
https://page1.google.com/ab-cd/ABCDEF
https://page2.yahoo.com/gr
Null
https://page3.google.com/hr
Please let me know, how to get multiple regex pattern or is their any other easy way to do this?
Thanks
You may use
val rx = "(?:href|img src)=\"(.*?)\"".r
val results = rx.findAllMatchIn(s).map(_ group 1)
// println(results.mkString(", ")) prints:
// https://page1.google.com/ab-cd/ABCDEF,
// https://page2.yahoo.com/gr,
// https://image01.google.com/gr/content/attachment/987654321,
// https://page3.google.com/hr
See the Scala demo
Details
(?:href|img src)=\"(.*?)\" matches either href or img src, then a =", and then captures any 0+ chars other than line break chars as few as possible into Group 1, and then a " is matched
With .findAllIn, you get all matches, then .map(_ group 1) only fetches Group 1 values.

How to build regex to capture all possible matching group

I have a string which contains the data in xml format like as
str = "<p><a>_a_10gd_</a><a>_a_xy8a_</a><a>_a_1020_</a><a>_a_dfa7_</a><a>_a_ABCD_</a></p>";
What I am trying to do is that I want to capture _abc__(Value)__ from all possible mach. I have tried it that way
Let say I am doing this in JavaScript :-
var regex = /_a_(.+)_/g ;
var str = "<a>_a_10gd_</a><a>_a_xy8a_</a><a>_a_1020_</a><a>_a_dfa7_</a><a>_a_ABCD_</a>";
while(m = regex.exec(str)){
console.log(m[1]); // m[1] should contains each mach
}
I want to get all maching group in an array like this :-
var a = ['10gd', 'xy8a', '1020', 'dfa7', 'ABCD'];
Please tell me that what will be required regex and explain it also because I am new to regex and their capturing group.
Just change (.+) to (.+?) see:
var regex = /_a_(.+?)_/g ;
var str = "<a>_a_10gd_</a><a>_a_xy8a_</a><a>_a_1020_</a><a>_a_dfa7_</a><a>_a_ABCD_</a>";
while(m = regex.exec(str)){
console.log(m[1]); // m[1] should contains each mach
}
for more information about greediness, see What do lazy and greedy mean in the context of regular expressions?
Another option is to accept only characters except _ before the _ (instead of . which you have used), like so:
var regex = /_a_([^_]+)_/g ;

Write regex to match markup element like attribute/values even ones not wrapped in quotes

Say I have
<div class="doublequotes"></div>
<div class='simplequotes'></div>
<customElement data-attr-1=no quotes data-attr-2 = again no quotes/>
I would like to see a nice regex to grab all attribute/vale pairs above as follows:
class, doublequotes
class, simplequotes
data-attr-1, no quotes
data-attr-2, again no quotes
Please note in the setup the following
presence of both single/double quotes to wrap values
possible absence of any quote
possible absence of any quote + multiple-word value
Here is a solution, written in Javascript so you can try it out right here, that separates into tags and then attributes, which allows retaining the parent tag (if you don't want that, don't use tag[1]).
A main reason this extracts tags and then attributes is so we don't find false "attributes" outside the tags. Note how the look="a distraction" part is not included in the parsed output.
<textarea id="test" style="width:100%;height:11ex">
<div class="doublequotes"> look="a distraction" </div><div class='simplequotes'></div>
<customElement data-attr-1=no quotes data-attr-2 = again no quotes/>
<t key1="value1" key2='value2' key3 = value3 key4 = v a l u e 4 key5 = v a l u e 5 />
Poorly nested 1 (staggered tags): <a1 b1=c1>foo<d1 e1=f1>bar</a1>baz</d1>
Poorly nested 2 (nested tags): <a2 b2=c2 <d2 e2=f2>>
</textarea>
<script type="text/javascript">
function parse() {
var xml = document.getElementById("test").value; // grab the above text
var out = ""; // assemble the output
tag_re = /<([^\s>]+)(\s[^>]*\s*\/?>)/g; // each tag as (name) and (attrs)
// each attribute, leaving room for future attributes
attr_re = /([^\s=]+)\s*=\s*("[^"]*"|'[^']*'|[^'"=\/>]*?[^\s\/>](?=\s+\S+\s*=|\s*\/?>))/g;
while(tag = tag_re.exec(xml)) { // for each tag
while (attr = attr_re.exec(tag[2])) { // for each attribute in each tag
out += "\n" + tag[1] + " -> " + attr[1] + " -> "
+ attr[2].replace(/^(['"])(.*)\1$/,"$2"); // remove quotes
}
};
document.getElementById("output").innerHTML = out.replace(/</g,"<");
}
</script>
<button onclick="parse()" style="float:right;margin:0">Parse</button>
<pre id="output" style="display:table"></pre>
I am not sure how complete this is since you haven't explicitly stated what is and is not valid. The comments to the question already establish that this is neither HTML nor XML.
Update: I added to nesting tests, both of which are invalid in XHTML, as an attempt to answer the comment about imbricated elements. This code does not recognize <d2 as a new element because it is inside another element and therefore assumed to be a part of the value of the b2 attribute. Because this included < and > characters, I had to HTML-escape the <s before rendering it to the <pre> tag (this is the final replace() call).
After more than a few tweaks, I have managed to build something
([0-9a-zA-z-]+)\s*=\s*(("([^">]*)")|('([^'>]*)')|(([^'"=>\s]+\s)\s*(?![ˆ\s]*=))*)?
This should deal reasonably even with something like
<t key1="value1" key2='value2' key3 = value3 key4 = v a l u e 4 key5 = v a l u e 5 />

Selectively uppercasing a string

I have a string with some XML tags in it, like:
"hello <b>world</b> and <i>everyone</i>"
Is there a good Scala/functional way of uppercasing the words, but not the tags, so that it looks like:
"HELLO <b>WORLD<b> AND <i>EVERYONE</i>"
We can use dustmouse's regex to replace all the text in/outside XML tags with Regex.replaceAllIn. We can get the matched text with Regex.Match.matched which then can easily be uppercased using toUpperCase.
val xmlText = """(?<!<|<\/)\b\w+(?!>)""".r
val string = "hello <b>world</b> and <i>everyone</i>"
xmlText.replaceAllIn(string, _.matched.toUpperCase)
// String = HELLO <b>WORLD</b> AND <i>EVERYONE</i>
val string2 = "<h1>>hello</h1> <span>world</span> and <span><i>everyone</i>"
xmlText.replaceAllIn(string2, _.matched.toUpperCase)
// String = <h1>>HELLO</h1> <span>WORLD</span> AND <span><i>EVERYONE</i>
Using dustmouse's updated regex :
val xmlText = """(?:<[^<>]+>\s*)(\w+)""".r
val string3 = """<h1>>hello</h1> <span id="test">world</span>"""
xmlText.replaceAllIn(string3, m =>
m.group(0).dropRight(m.group(1).length) + m.group(1).toUpperCase)
// String = <h1>>hello</h1> <span id="test">WORLD</span>
Okay, how about this. It just prints the results, and takes into consideration some of the scenarios brought up by others. Not sure how to capitalize the output without mercilessly poaching from Peter's answer:
val string = "<h1 id=\"test\">hello</h1> <span>world</span> and <span><i>everyone</i></span>"
val pattern = """(?:<[^<>]+>\s*)(\w+)""".r
pattern.findAllIn(string).matchData foreach {
m => println(m.group(1))
}
The main thing here is that it is extracting the correct capture group.
Working example: http://ideone.com/2qlwoP
Also need to give credit to the answer here for getting capture groups in scala: Scala capture group using regex

how to use regex in javascript to capture everything between two segments

take this url
http://service.com/room/dothings?adsf=asdf&dafadsf=dfasdf
http://service.com/room/saythings?adsf=asdf&dafadsf=dfasdf
say if i want to capture dothings, saythings,
I now the following regex
/room\/(.+)\?/.exec(url)
and in the result i get this.
["room/dothings?", "dothings"]
what should i write to obtain the string above with only one item in an array.
I know this doesn't answer your question, but parsing a URL with regex is not easy, and in some cases not even safe. I would do the parsing without regex.
In browser:
var parser = document.createElement('a');
parser.href = 'http://example.com/room/dothings?adsf=asdf&dafadsf=dfasdf';
In node.js:
var url = require('url');
var parser = url.parse('http://example.com/room/dothings?adsf=asdf&dafadsf=dfasdf');
And then in both cases:
console.log(parser.pathname.split('/')[2]);
That's actually easy. You were almost there.
With all the obligatory disclaimers about parsing html in regex...
<script>
var subject = 'http://service.com/room/dothings?adsf=asdf&dafadsf=dfasdf';
var regex = /room\/(.+)\?/g;
var group1Caps = [];
var match = regex.exec(subject);
while (match != null) {
if( match[1] != null ) group1Caps.push(match[1]);
match = regex.exec(subject);
}
if(group1Caps.length > 0) document.write(group1Caps[0],"<br>");
</script>
Output: dothings
If you add strings in subject you can for (key in group1Caps) and it will spit out all the matches.
Online demo