How to convert lines on the basis of pattern? - regex

I have data which I need to convert into pattern.
Input data is list seperated by something (as it's easy to find and replace) ex. comma
food,apple,10,10
sweets,candy,20,20
I want to convert it to XML:
<Item>
<Product type="food" name="apple" price"10" quantity="10">
</Item>
<Item>
<Product type="sweets" name="candy" price"20" quantity="20">
</Item>

You need a regular expression find/replace:
Use the find Dialog, replace tab:
Find What: ^([^,]*),([^,]*),([^,]*),([^\r\n]*)(\R)*
Replace with: <Item>\5 <Product type="\1" name="\2" price="\3" quantity="\4"> \5</Item>\5
check Regular expression in the lower left
press Replace All
Explanation:
the find splices the string into the comma-separated parts and captures the
values in \1 to \5
\5 captures the linebreak
the replacement puts the captured values inside the XML-Node

Replace: (\w+),(\w+),(\w+),(\w+)
with : <Item>\n <Product type="\1" name="\2" price="\3" quantity="\4">\n</Item>
Please check out this demo.

Related

Regex to find subelement in XML

I am using the Regular Expression search feature in Notepad++ to find matches in a few hundred files.
My goal is to find a parent/child combo in in each. I don't care a lot about what specifically is selected (parent and child or just child). I just want to know if the parent contains a specific child.
I want to find a parent element that also has a child element.
Example of what it should find (since one of the sub-elements is a ):
<description>
<otherstuff>
</otherstuff>
<something>
</something>
<description>
</description>
<otherstuff>
</otherstuff>
</description>
Example of what it should NOT find:
<description>
<otherstuff>
</otherstuff>
<something>
</something>
<notadescription>
</notadescription>
<otherstuff>
</otherstuff>
<description>
Each may have other children and sub children as well. They both also may be in the same document.
If I search for this:
<description>(.*)<description>(.*)</description>
It selects too much, because it will select another top level when I only want it to select the child for that 2nd piece.
You said you're working with Notepad++, here here a way to go:
Ctrl+F
Find what: <description>(?:(?!</description).)*<description>(?:(?!<description>).)*</description>
check Match case
check Wrap around
check Regular expression
CHECK . matches newline
Explanation:
<description> # opening tag
(?:(?!</description).)* # tempered greedy token, make sure we have not closing tag before:
<description> # opening tag
(?:(?!<description>).)* # tempered greedy token, make sure we have not opening tag before:
</description> # closing tag
Screen capture:
You should not use (.*) it is greedy
here is an example why you shouldn't use it in you case
<description>
<otherstuff>
</otherstuff>
<description>
<description>hello<\description>
</description>
<\description>
Supposing that here we use <description>(.*)<description>(.*)</description>
It will parse:
<description>
<description>hello<\description>
</description>
<\description>
So if you want to parse only what is inside the 2nd description you should use (.*?) it is called non greedy
Using <description>(.*)<description>(.*?)</description> will parse:
<description>
<description>hello<\description> # end of parse
# here <\description> is missing cause (.*?) will look only for the first match
So you must use (.*?) it will stop parsing right when it found the first end match, but (.*) is greedy so it will look for largest match possible
So if you use <description>(.*)<description>(.*?)</description> it will be fine, cause it will parse only what is inside the sub description in your case
I'm guessing that we'd be designing an expression to exclude <notadescription>, such as:
<description>(?!<notadescription>)[\s\S]*<\/description>
which if we would be capturing the description element, we might want a capturing group:
(<description>(?!<notadescription>)[\s\S]*<\/description>)
Demo

regex match all xml tags that contain a certain attribute value

I have an xml file where I want to match all xml tags that contain an attribute matching a certain string in Perl.
Sample XML:
<item attr="Car" />
<item attr="Apple_And_Pears.htm#123" />
<item attr="Paper" />
<item attr="Orange_And_Peach.htm#213" />
I want a regex that grabs all nodes that has an attribute that contains ".htm"
<item attr="Orange_And_Peach.htm#213" />
<item attr="Apple_And_Pears.htm#123" />
With the following regex, I'm matching with all tags rather than only tags with .htm attribute:
<item.*?attr="[^>]*>
Is there some sort of positive lookahead until a certain character?
Thanks
The appropriate Perl solution is not regex. With Mojo::DOM (one of many options):
use strict;
use warnings;
use Mojo::DOM;
use File::Slurper 'read_text';
my $xml = read_text 'test.xml';
my $dom = Mojo::DOM->new->xml(1)->parse($xml);
my $tags = $dom->find('item[attr*=".htm"]');
print "$_\n" for #$tags;
As Grinnz suggested you should use an approriate xml-parser (check out this interesting post on stackoverflow explaining why), but since you asked for it here's a simple regex you could use with a positive lookahead:
<item.*?attr=".*(?=\.htm).*
If you want to match tags with only one ".htm" in it, you can use both a negative and positive lookaround:
^(?:(?!\.htm).)*\.htm(?!.*\.htm).*$

How to get second tag by regex

I don't understand how to get in the second match <sub>aaaa</sub> and not <sub>eeee</sub>
my regex:
<item>.*?<sub>(.*?)<\/sub>.*?<value>(.*?)<\/value>.*?<\/item>
content:
<item> fffffffffffff
<sub>aaaa</sub>
<value>111</value>
</item>
<item>
<sub>eeee</sub> arg34ddddddddddddddd
<atag>ddd</atag>
<sub>aaaa</sub>
<atag>dddg</atag>
<value>222</value>
</item>
Can I get it in a step or do I need running a regex several times?
UPDATE
I want to get the result like this:
[ [ 'aaaa', 111],['aaaa', 222] ]
Is it possible?
Try
<item>[\s\S]*?<sub>(.*?)<\/sub>((?!<sub>)[\s\S])*<\/item>
Demo
This takes only the last sub you have between items.
Explanation:
<item>[\s\S]*?<sub> matches lazily anything between item and sub tags
<sub>(.*?)<\/sub> matches sub tag and captures its content
((?!<sub>)[\s\S])*<\/item> uses Tempered Greedy Token to assure that after the sub that was matched before, there is no more sub tags before the closing item tag

Regex - Multiline extraction

Using the enclosed regex I'm able to match extract the 'model_name' value when nfc_support" value="true in a few instances. However, I'm unable to get it to match is other instances as displayed below. Any help in getting it to match in both instances would be greatly appreciated.
EX:
<capability name=\"model_name\"[A-Za-z1-9"=();,._/<>\s]*<capability name=\"nfc_support\" value=\"true\"/>
Will work with:
<capability name="model_name" value="T11"/>
<capability name="brand_name" value="Turkcell"/>
<capability name="marketing_name" value="Campaign"/>
</group>
<group id="chips">
<capability name="nfc_support" value="true"/>
</group>
But cannot match this:
<capability name="model_name" value="U8650"/>
<capability name="brand_name" value="Huawei"/>
<capability name="marketing_name" value="Sonic"/>
</group>
<group id="chips">
<capability name="nfc_support" value="true"/>
Your regex will match everything between the first model_name and the last nfc_support = true, because you use the greedy * quantifier. This is a problem if you have multiple occurences of nfc_support in the same string you are applying the regex to, as it will keep searching until it finds <capability name = "nfc_support" value = "true"/>. A better practice to selectively match text that may appear multiple times is to use the reluctant greedy quantifier: *?, to avoid matching too much.
Assuming all lines will follow a format of model_name, brand_name, marketing_name, /group, group id, then nfc_support, a regex that enforces this format is:
(?s)<capability name=\"model name\" value=\"(.*?)\"/>\n<capability name=\"brand_name\" value=\"(.*?)\"/>\n<capability name=\"marketing_name\" value=\"(.*?)\"/>\n</group>\n<group_id=\"chips\">\n<capability name=\"nfc_support\" value=\"true\"/>
Apologies in advance if there are typos in this regex, but you get the gist of it...
This regex will store the values of model_name, brand_name, and marketing_name into groups $1, $2, and $3, respectively, only if nfc_support is "true." The (?s) enables multiline searching.
Forgive me if I'm wrong, but it looks like your expression of:
[A-Za-z1-9"=();,._/<>\s]
does not account for a 0 in your character class (showing as 1-9) and should thus be:
[A-Za-z0-9"=();,._/<>\s]
EDIT: This is in regards to your example of a non-match for "model_name" value="U8650"

RegEx in Powershell to replace space in XML element name with underscore

I have an XML document with element names with a space or multiple spaces in it (which is not allowed in XML) and I am looking for a regex to replace the space with an _ and after modifications replace the _ with a space again. The regex can be applied to a string.
Simplified Sample XML where I want to replace <User Blob> to <User_Blob> but I don't want to replace eg My Space to My_Space. So the RegEx needs to match a < followed by one or more words with a space in it, followed by > I think.
<User Data Blob>
<Item>
<Key>SomeKey</Key>
<Value>false</Value>
</Item>
<Item>
<Key>AnotherKey</Key>
<Value></Value>
</Item>
</User Data Blob>
Get-Content .\file.xml | Foreach-Object {
[regex]::replace($_,'<([^>]+)>',{$args[0] -replace ' ','_'})
}
From Space to Underscore:
(gc .\FileWithSpace.xml)| % { $_ -replace "<(/?)(\w+) (\w+)>", '<$1$2_$3>'}
From Underscore to Space:
(gc .\FileWithUnderscore.xml)| % { $_ -replace "<(/?)(\w+)_(\w+)>", '<$1$2 $3>'}
If the regex flavor used supports lookahead you could do things like:
(?=[<>]*>)
(Notice the space in front.) Replace with _.
To reverse do:
_(?=[<>]*>)
Replaced with a space.