Perl regular expression

Perl regular expression - regex

<ref id="ch02_ref1"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>J.M.</surname><given-names>Astilleros</given-names></name>
This is a single line. I just need to extract the word between the tags <given-names> and </given-names> which in this case is Astilleros. Is there a regex to do this. The problem I am facing is that there is no space between each word and the end tag </given-names> where '/' is a character in perl regex.. please help..
The idea is to get the names out, find them in the text on the page and put <given-names>Astilleros</given-names> tags around them.. I will definitely try XML parsers..

Don't parse XML with regexes – it is just too damn hard to get right. There are good parsers lying around, just waiting to be utilized by you. Let's use XML::LibXML:
use strict; use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => <<'END');
<ref id="ch02_ref1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>J.M.</surname>
<given-names>Astilleros</given-names>
</name>
</person-group>
</mixed-citation>
</ref>
END
# use XPath to find your element
my ($name) = $dom->findnodes('//given-names');
print $name->textContent, "\n";
(whatever you try, do not use XML::Simple!)

This should work as as a regex:
/<given-names>(.*?)</
From your input, it will capture Astilleros
This matches:
A literal <given-names>
Captures (0 to infinite times) any character (except newline)
Until it reaches a literal <

Related

Match particular CDATA sections in XML data

I am trying to do a PowerShell Regex, I have the following page (further below) that I want to do a match from, the two parts in bold is the information that I want to capture and assign to a variable. So I need two regex's. From the text below, the two area's I need to find exactly are King and Years & Years. Please note, these two areas change (hence the reason I need to capture them), the rest of of the code stays the same.
This is the regex I have at the moment, but it's not working for me.
\s+artist\s*>\s*<\s*!\s*[CDATA\s*[(.*)\s*]\s*]\s*>\s*<\s*/artist
And here is the page (or data) I am trying to use regex with.
<on_air>
<publishedInfo publishedDate="2015-07-18 16:24:28" />
<stationName><![CDATA[Mix 106.5]]></stationName>
<stationPrefix><![CDATA[mix1065]]></stationPrefix>
<generic_coverart><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></generic_coverart>
<now_playing>
<audio ID="id_1705168034_30458146" type="song">
<title generic="False"><![CDATA[King*]]></title>
<artist><![CDATA[Years & Years]]></artist>
<number><![CDATA[46029]]></number>
<cut><![CDATA[1]]></cut>
<ref><![CDATA[]]></ref>
<played_datetime><![CDATA[2015-07-18 16:24:27]]></played_datetime>
<length><![CDATA[00:03:28]]></length>
<coverart generic="true"><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></coverart>
<options>
<option><![CDATA[KIIS S Integrated]]></option>
</options>
</audio>
</now_playing>

If it is a valid XML, then you does not need to use regular expressions. PowerShell adapt XML objects and you can use standard property syntax to navigate on them:
$xml=[xml]#'
<on_air>
<publishedInfo publishedDate="2015-07-18 16:24:28" />
<stationName><![CDATA[Mix 106.5]]></stationName>
<stationPrefix><![CDATA[mix1065]]></stationPrefix>
<generic_coverart><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></generic_coverart>
<now_playing>
<audio ID="id_1705168034_30458146" type="song">
<title generic="False"><![CDATA[King*]]></title>
<artist><![CDATA[Years & Years]]></artist>
<number><![CDATA[46029]]></number>
<cut><![CDATA[1]]></cut>
<ref><![CDATA[]]></ref>
<played_datetime><![CDATA[2015-07-18 16:24:27]]></played_datetime>
<length><![CDATA[00:03:28]]></length>
<coverart generic="true"><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></coverart>
<options>
<option><![CDATA[KIIS S Integrated]]></option>
</options>
</audio>
</now_playing>
</on_air>
'#
$xml.on_air.now_playing.audio.title.'#cdata-section'
$xml.on_air.now_playing.audio.artist.'#cdata-section'

You want to escape bracket literals.
Also, it's a good practice to avoid using the dot "match almost any character" metacharacter when your intentions are more specific. In your case, what you really want to do is match until you hit the closing bracket, so it's safer to specify that:
'\s+artist\s*>\s*<\s*!\s*\[CDATA\s*\[([^]]*)\s*\]\s*\]\s*>\s*<\s*\/artist'
Note: Regex is contextual, so the reason I don't have to escape the closing bracket within the character class is because of its position, i.e., being the first character specified in the negated class--in that context, it cannot be the closing bracket for the character class. In other words, it's not ambiguous.

To help get off the ground, here is a suggestion for y&y (insert whitespace-selector whereever possible):
artist><!\[CDATA\[Years & Years\]\]></artist

How can I parse an XML file and delete text between two tags using PowerShell?

I have a file that has multiple instances of the following:
<password encrypted="True">271NFANCMnd8BFdERjHoAwEA7BTuX</password>
But for each instance the password is different.
I would like the output to delete the encyrpted password:
<password encrypted="True"></password>
What is the best method using PowerShell to loop through all instances of the pattern within the file and output to a new file?
Something like:
gc file1.txt | (regex here) > new_file.txt
where (regex here) is something like:
s/"True">.*<\/pass//

This one is fairly easy in regex, and you can do it that way, or you can parse it as actual XML, which may be more appropriate. I'll demonstrate both ways. In each case, we'll start with this common bit:
$raw = #"
<xml>
<something>
<password encrypted="True">hudhisd8sd9866786863rt</password>
</something>
<another>
<thing>
<password encrypted="True">nhhs77378hd8y3y8y282yr892</password>
</thing>
</another>
<test>
<password encrypted="False">plain password here</password>
</test>
</xml>
"#
Regex
$raw -ireplace '(<password encrypted="True">)[^<]+(</password>)', '$1$2'
or:
$raw -ireplace '(?<=<password encrypted="True">).+?(?=</password>)', ''
XML
$xml = [xml]$raw
foreach($password in $xml.SelectNodes('//password')) {
$password.InnerText = ''
}
Only replace the encrypted passwords:
$xml = [xml]$raw
foreach($password in $xml.SelectNodes('//password[#encrypted="True"]')) {
$password.InnerText = ''
}
Explanations
Regex 1
(<password encrypted="True">)[^<]+(</password>)
Debuggex Demo
The first regex method uses 2 capture groups to capture the opening and closing tags, and replaces the entire match with those tags (so the middle is omitted).
Regex 2
(?<=<password encrypted="True">).+?(?=</password>)
Debuggex Demo
The second regex method uses positive lookaheads and lookbehinds. It finds 1 or more characters which are preceded by the opening tag and followed by the closing tag. Since lookarounds are zero-width, they are not part of the match, therefore they don't get replaced.
XML
Here we're using a simple xpath query to find all of the password nodes. We iterate through each one with a foreach loop and set its innerText to an empty string.
The second version checks that the encrypted attribute is set to True and only operates on those.
Which to Choose
I personally think that the XML method is more appropriate, because it means you don't have to account for variations in XML syntax so much. You can also more easily account for different attributes specified on the nodes or different attribute values.
By using xpath you have a lot more flexibility than with regex for processing XML.
File operations
I noticed your sample to read the data used gc (short for Get-Content). Be aware that this reads the file line-by-line.
You can use this to get your raw content in one string, for conversion to XML or processing by regex:
$raw = Get-Content file1.txt -Raw
You can write it out pretty easily too:
$raw | Out-File file1.txt

Limiting a character after a wildcard in regex to it's first occurrence,

How can I tell a character that comes after a wildcard to use the first occurrence of it?
I did the following to find any tag with the word "title" in it:
<(.*?)(title)(.*?)>
but clearly what happens is I end up with the entire tag to the end of
</title>
So that in
<Bla bla ="nametitle">Yada yada</title>
I want
<Bla bla ="nametitle">
but end up with the whole tag.
Please if anyone is offended by the use of parsing html with regex simply move on and accept my apologies for the transgression. I am simply trying to find out how to use the wildcard which I have not used before correctly and apply as I see fit. Thank you.

You can use this regex:
<title.+?>
The above matches <title and goes till it encounters a >

Stop parsing at the first >. Using your example, you could do this with: <(.*?)(title)([^>]*?)>

<(?![\/]).*?title.*?>
This will find title inside any set of < > tags except for closing tags beginning with </
Example:
https://regex101.com/r/QFs4ny/1

SLRE regex doesn't work properly

I have a problem with SLRE library, I can't figure out how to stop grabbing everything after my match. Let's say I have a html output and somewhere in the middle of buffer there is line I want to parse
name="id" value="1a2b3c4d5e6f" />
Here is my regular expression
slre_compile(&test, "name=\"id\" value=\"(.*?)\" />")
I have read about greedy and non-greedy flags in other threads where people used to have similar problem as me, but in my case adding ? to the expression doesn't change anything.
SLRE returns me match starting from 1a2b3c4d5e6f" /> and shows rest of the html page ending on </html> tag, just I don't know why. It is cutting the beginning of the html source but leaves everything after my expression. I have also tried following regex
slre_compile(&test, "^.*?name=\"id\" value=\"(.*?)\" />.*?$")
and some others, modified with greedy and non-reedy flags, which gave me same results. Does anyone know why SLRE can't stop at " /> and continues capturing characters till the source string ends?

it seems that SLRE does not understand non-greedy qualifiers and parses .*? instead as if it were (?:.*)?. However, in this case \"[^\"]*\" should work...

Regex: Skip/Ignore pattern

Given that the following string is embedded in text, how can I extract the whole line but not matching on the inner "<" and ">"?
<test type="yippie<innertext>" />
EDIT:
Being more specific, we need to handle both use cases below where "type" has or does not have "<" and ">" chars.
<h:test type="yippie<innertext>" />
<h:test type="yippie">
Group 1: 'h:test'
Group 2: ' type="yippie<innertext>" ' -or- ' type="yippie"' (ie, remaining content before ">" or "/>")
So far, I have something like this, but it's a little off how it Group 2 stops at the first ">". Tweaking first part of Group 2's condition.
(<([a-zA-Z0-9_:-]+)([^>"]*|[^>]*?)\s*(/)?>)
Thanks for your help.

Try this:
<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>
Example usage (Python):
>>> x = '<h:test type="yippie<innertext>" />'
>>> re.search('<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>', x).groups()
('h:test', ' type="yippie<innertext>" ')
Also note that if your document is HTML or XML then you should use an HTML or XML parser instead of trying to do this with regular expressions.

It looks like you are trying to parse XML/HTML with a regex. I would say that your approach is fundamentally wrong. A sufficiently advanced regex is not indistinguishable from an XML parser. After all, what if you needed to parse:
<test type="yippie<inner\"text\"_with_quotes,_literal_slash_and_quote\\\">" />
Furthermore, you probably need to escape the inner < and > as < and >
For further reasons why you should not parse XML with a regex, I can only yield to this superior answer:
RegEx match open tags except XHTML self-contained tags

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Perl regular expression - regex

This should work as as a regex: /<given-names>(.*?)</ From your input, it will capture Astilleros This matches: A literal <given-names> Captures (0 to infinite times) any character (except newline) Until it reaches a literal <

Related

Match particular CDATA sections in XML data

How can I parse an XML file and delete text between two tags using PowerShell?

Limiting a character after a wildcard in regex to it's first occurrence,

SLRE regex doesn't work properly

Regex: Skip/Ignore pattern

Categories

Resources