Regex - Multiline extraction

Regex - Multiline extraction - regex

Using the enclosed regex I'm able to match extract the 'model_name' value when nfc_support" value="true in a few instances. However, I'm unable to get it to match is other instances as displayed below. Any help in getting it to match in both instances would be greatly appreciated.
EX:
<capability name=\"model_name\"[A-Za-z1-9"=();,._/<>\s]*<capability name=\"nfc_support\" value=\"true\"/>
Will work with:
<capability name="model_name" value="T11"/>
<capability name="brand_name" value="Turkcell"/>
<capability name="marketing_name" value="Campaign"/>
</group>
<group id="chips">
<capability name="nfc_support" value="true"/>
</group>
But cannot match this:
<capability name="model_name" value="U8650"/>
<capability name="brand_name" value="Huawei"/>
<capability name="marketing_name" value="Sonic"/>
</group>
<group id="chips">
<capability name="nfc_support" value="true"/>

Your regex will match everything between the first model_name and the last nfc_support = true, because you use the greedy * quantifier. This is a problem if you have multiple occurences of nfc_support in the same string you are applying the regex to, as it will keep searching until it finds <capability name = "nfc_support" value = "true"/>. A better practice to selectively match text that may appear multiple times is to use the reluctant greedy quantifier: *?, to avoid matching too much.
Assuming all lines will follow a format of model_name, brand_name, marketing_name, /group, group id, then nfc_support, a regex that enforces this format is:
(?s)<capability name=\"model name\" value=\"(.*?)\"/>\n<capability name=\"brand_name\" value=\"(.*?)\"/>\n<capability name=\"marketing_name\" value=\"(.*?)\"/>\n</group>\n<group_id=\"chips\">\n<capability name=\"nfc_support\" value=\"true\"/>
Apologies in advance if there are typos in this regex, but you get the gist of it...
This regex will store the values of model_name, brand_name, and marketing_name into groups $1, $2, and $3, respectively, only if nfc_support is "true." The (?s) enables multiline searching.

Forgive me if I'm wrong, but it looks like your expression of:
[A-Za-z1-9"=();,._/<>\s]
does not account for a 0 in your character class (showing as 1-9) and should thus be:
[A-Za-z0-9"=();,._/<>\s]
EDIT: This is in regards to your example of a non-match for "model_name" value="U8650"

Related

How to get the index by regular expression in ANT

I have a string with a version as .v_september (every month it will vary). In this i wanted to take the value after underscore, which means "sep" (First 3 letters alone).
By using the regex .v_(.*) i am able to take the complete month and not able to get the first 3 letters alone.
Can someone help me out how can I achieve this in Apache ANT.
Thanks !

Regex functions on properties are a bit awkward in native Ant (as opposed to working with text within files). Ant-contrib has the replaceregexp task, but I try to avoid ant-contrib whenever possible.
Instead, it can be accomplished with the loadfile task and a nested filter:
<property name="version" value=".v_september" />
<loadfile property="version.month.short">
<propertyresource name="version" />
<filterchain>
<tokenfilter>
<replaceregex pattern="\.v_(.{3}).*" replace="\1" />
</tokenfilter>
</filterchain>
</loadfile>
<echo message="${version.month.short}" />
Regarding the regex pattern, note how it needs to end with .*. This is because Ant doesn't have a "match" function that simply returns the content of a capture group. It's just running a replacement, so we need to replace everything in the string that isn't part of the group.

.* will capture everything and for limiting to capturing only three characters you need to write {3} instead of *. Also you should escape the . in the beginning of your regex to only match a literal dot. You can use this regex and capture from group1,
\.v_(.{3})
Demo

regex match all xml tags that contain a certain attribute value

I have an xml file where I want to match all xml tags that contain an attribute matching a certain string in Perl.
Sample XML:
<item attr="Car" />
<item attr="Apple_And_Pears.htm#123" />
<item attr="Paper" />
<item attr="Orange_And_Peach.htm#213" />
I want a regex that grabs all nodes that has an attribute that contains ".htm"
<item attr="Orange_And_Peach.htm#213" />
<item attr="Apple_And_Pears.htm#123" />
With the following regex, I'm matching with all tags rather than only tags with .htm attribute:
<item.*?attr="[^>]*>
Is there some sort of positive lookahead until a certain character?
Thanks

The appropriate Perl solution is not regex. With Mojo::DOM (one of many options):
use strict;
use warnings;
use Mojo::DOM;
use File::Slurper 'read_text';
my $xml = read_text 'test.xml';
my $dom = Mojo::DOM->new->xml(1)->parse($xml);
my $tags = $dom->find('item[attr*=".htm"]');
print "$_\n" for #$tags;

As Grinnz suggested you should use an approriate xml-parser (check out this interesting post on stackoverflow explaining why), but since you asked for it here's a simple regex you could use with a positive lookahead:
<item.*?attr=".*(?=\.htm).*
If you want to match tags with only one ".htm" in it, you can use both a negative and positive lookaround:
^(?:(?!\.htm).)*\.htm(?!.*\.htm).*$

Regular Expressions: Lookback to only the first occurrence (non-greedy lookback?)

Here's the problem:
XML:
<userPermissions>
<enabled>true</enabled>
<name>ViewPublicReports</name>
</userPermissions>
<userPermissions>
<enabled>true</enabled>
<name>ViewRoles</name>
</userPermissions>
<userPermissions>
<enabled>true</enabled>
<name>ViewSetup</name>
</userPermissions>
What I'm trying to match is:
<userPermissions>
<enabled>true</enabled>
<name>ViewRoles</name>
</userPermissions>
All the patterns that I've managed to put together matches up to the first string:
(?<=<userPermissions>)[\s\S]+?ViewRoles[\s\S]*?<\/userPermissions>
Not quite sure how to make the backwards match from "ViewRoles" non-greedy.
Thanks in advance for your help.
*Edit: I'm using a tool that deploys metadata between Salesforce instances, which are captured as XML. The tool provides a "find/replace" functionality that uses regex for the "find." I don't have the option of using an XML parser.

This <userPermissions>(?:(?!</userPermissions>)[\S\s])*?ViewRoles[\S\s]*?</userPermissions>
matches that tag.
Formatted
<userPermissions>
(?:
(?! </userPermissions> )
[\S\s]
)*?
ViewRoles
[\S\s]*?
</userPermissions>

It has been told, but the correct way to extract this would be to use an XML parser. However, you can also use the following regex:
(.+\n){2}.+ViewRoles.+\n.+
Which actually matches the following structure:
2 rows without restrictions
a row that includes "ViewRoles"
another row without restrictions

How to convert lines on the basis of pattern?

I have data which I need to convert into pattern.
Input data is list seperated by something (as it's easy to find and replace) ex. comma
food,apple,10,10
sweets,candy,20,20
I want to convert it to XML:
<Item>
<Product type="food" name="apple" price"10" quantity="10">
</Item>
<Item>
<Product type="sweets" name="candy" price"20" quantity="20">
</Item>

You need a regular expression find/replace:
Use the find Dialog, replace tab:
Find What: ^([^,]*),([^,]*),([^,]*),([^\r\n]*)(\R)*
Replace with: <Item>\5 <Product type="\1" name="\2" price="\3" quantity="\4"> \5</Item>\5
check Regular expression in the lower left
press Replace All
Explanation:
the find splices the string into the comma-separated parts and captures the
values in \1 to \5
\5 captures the linebreak
the replacement puts the captured values inside the XML-Node

Replace: (\w+),(\w+),(\w+),(\w+)
with : <Item>\n <Product type="\1" name="\2" price="\3" quantity="\4">\n</Item>
Please check out this demo.

Regex: Skip/Ignore pattern

Given that the following string is embedded in text, how can I extract the whole line but not matching on the inner "<" and ">"?
<test type="yippie<innertext>" />
EDIT:
Being more specific, we need to handle both use cases below where "type" has or does not have "<" and ">" chars.
<h:test type="yippie<innertext>" />
<h:test type="yippie">
Group 1: 'h:test'
Group 2: ' type="yippie<innertext>" ' -or- ' type="yippie"' (ie, remaining content before ">" or "/>")
So far, I have something like this, but it's a little off how it Group 2 stops at the first ">". Tweaking first part of Group 2's condition.
(<([a-zA-Z0-9_:-]+)([^>"]*|[^>]*?)\s*(/)?>)
Thanks for your help.

Try this:
<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>
Example usage (Python):
>>> x = '<h:test type="yippie<innertext>" />'
>>> re.search('<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>', x).groups()
('h:test', ' type="yippie<innertext>" ')
Also note that if your document is HTML or XML then you should use an HTML or XML parser instead of trying to do this with regular expressions.

It looks like you are trying to parse XML/HTML with a regex. I would say that your approach is fundamentally wrong. A sufficiently advanced regex is not indistinguishable from an XML parser. After all, what if you needed to parse:
<test type="yippie<inner\"text\"_with_quotes,_literal_slash_and_quote\\\">" />
Furthermore, you probably need to escape the inner < and > as < and >
For further reasons why you should not parse XML with a regex, I can only yield to this superior answer:
RegEx match open tags except XHTML self-contained tags

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex - Multiline extraction - regex

Forgive me if I'm wrong, but it looks like your expression of: [A-Za-z1-9"=();,._/<>\s] does not account for a 0 in your character class (showing as 1-9) and should thus be: [A-Za-z0-9"=();,._/<>\s] EDIT: This is in regards to your example of a non-match for "model_name" value="U8650"

Related

How to get the index by regular expression in ANT

regex match all xml tags that contain a certain attribute value

Regular Expressions: Lookback to only the first occurrence (non-greedy lookback?)

How to convert lines on the basis of pattern?

Regex: Skip/Ignore pattern

Categories

Resources