Given that the following string is embedded in text, how can I extract the whole line but not matching on the inner "<" and ">"?
<test type="yippie<innertext>" />
EDIT:
Being more specific, we need to handle both use cases below where "type" has or does not have "<" and ">" chars.
<h:test type="yippie<innertext>" />
<h:test type="yippie">
Group 1: 'h:test'
Group 2: ' type="yippie<innertext>" ' -or- ' type="yippie"' (ie, remaining content before ">" or "/>")
So far, I have something like this, but it's a little off how it Group 2 stops at the first ">". Tweaking first part of Group 2's condition.
(<([a-zA-Z0-9_:-]+)([^>"]*|[^>]*?)\s*(/)?>)
Thanks for your help.
Try this:
<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>
Example usage (Python):
>>> x = '<h:test type="yippie<innertext>" />'
>>> re.search('<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>', x).groups()
('h:test', ' type="yippie<innertext>" ')
Also note that if your document is HTML or XML then you should use an HTML or XML parser instead of trying to do this with regular expressions.
It looks like you are trying to parse XML/HTML with a regex. I would say that your approach is fundamentally wrong. A sufficiently advanced regex is not indistinguishable from an XML parser. After all, what if you needed to parse:
<test type="yippie<inner\"text\"_with_quotes,_literal_slash_and_quote\\\">" />
Furthermore, you probably need to escape the inner < and > as < and >
For further reasons why you should not parse XML with a regex, I can only yield to this superior answer:
RegEx match open tags except XHTML self-contained tags
Related
I have a XML file I use to manually route users to specific pages in a website.
Currently, we have separate entries for every variation of possible searches (plural, typos etc.). I would like to know if there is a way I can condense it with regex to something like so:
<OnSiteSearch>
...
<Search>
<SearchTerm>(horses?|cows?) for sale</SearchTerm>
<Destination>~/some/path.html</Destination>
</Search>
...
</OnSiteSearch>
Is something like this possible? I've looked online for regex and XML but it seems to be about validating content between the XML tags and not about using regex as the content.
Yes, a regex can be stored in XML as long as you mind XML escaping rules to keep the XML well-formed:
Element content: Escape any < as < and any & as & when writing
the regex; reverse the substitution before using the regex.
Attribute value: Follow rules for element content plus escape any " as
"e; or any ' as ' to avoid conflict with chosen attribute
value delimiters.
CDATA: No escaping needed, but make sure your regex doesn't include
the string ]]>.
I have to edit a stored procedure that builds xml strings so that all the element values are wrapped in cdata. Some of the values have already been wrapped in cdata so I need to ignore those.
I figured this is a good attempt to learn some regex
From: <element>~DATA_04</element>
to: <element><![CDATA[~DATA_04]]></element>
What are my options on how to do this? I can do simple regex, this is way more advanced.
NOTE: The <element> is generic for illustration purposes, in reality, it could be anything and is unknown.
Sample text:
declare #sql nvarchar(max) =
' <data>
<header></header>
<docInfo>Blah</docInfo>
<someelement>~DATA_04</someelement>
<anotherelement><![CDATA[~DATA_05]]></anotherelement>
</data>
'
Using the sample xml, the regex would need to find someelement and add cdata to it like <someelement><![CDATA[~DATA_04]]></someelement> and leave the other elements alone.
Bear in mind, I did not write this horrible sql code, i just have to edit it.
This is c#:
string text = Regex.Replace( inputString, #"<element>~(.+)</element>", "<element>![CDATA[~$1]]</element>" , RegexOptions.None );
The find is:
<element>~(.+)</element>
The replace is:
<element>![CDATA[~$1]]</element>
I'm assuming there is a ~ at the start of the inside of the element tag.
You will also want to watch out for whitespace if that is an issue...
You may want to add some
\s*
Any whitespace characters, zero or more matches
Try with (<[^>]+>)(\~data_([^<]+))(<[^>]+>)
and replace for \1<![CDATA[\2]]>\4
this will give you: <element><![CDATA[~DATA_04]]></element>,
where element could be anything else. Check the DEMO
Good luck
I have a file that has multiple instances of the following:
<password encrypted="True">271NFANCMnd8BFdERjHoAwEA7BTuX</password>
But for each instance the password is different.
I would like the output to delete the encyrpted password:
<password encrypted="True"></password>
What is the best method using PowerShell to loop through all instances of the pattern within the file and output to a new file?
Something like:
gc file1.txt | (regex here) > new_file.txt
where (regex here) is something like:
s/"True">.*<\/pass//
This one is fairly easy in regex, and you can do it that way, or you can parse it as actual XML, which may be more appropriate. I'll demonstrate both ways. In each case, we'll start with this common bit:
$raw = #"
<xml>
<something>
<password encrypted="True">hudhisd8sd9866786863rt</password>
</something>
<another>
<thing>
<password encrypted="True">nhhs77378hd8y3y8y282yr892</password>
</thing>
</another>
<test>
<password encrypted="False">plain password here</password>
</test>
</xml>
"#
Regex
$raw -ireplace '(<password encrypted="True">)[^<]+(</password>)', '$1$2'
or:
$raw -ireplace '(?<=<password encrypted="True">).+?(?=</password>)', ''
XML
$xml = [xml]$raw
foreach($password in $xml.SelectNodes('//password')) {
$password.InnerText = ''
}
Only replace the encrypted passwords:
$xml = [xml]$raw
foreach($password in $xml.SelectNodes('//password[#encrypted="True"]')) {
$password.InnerText = ''
}
Explanations
Regex 1
(<password encrypted="True">)[^<]+(</password>)
Debuggex Demo
The first regex method uses 2 capture groups to capture the opening and closing tags, and replaces the entire match with those tags (so the middle is omitted).
Regex 2
(?<=<password encrypted="True">).+?(?=</password>)
Debuggex Demo
The second regex method uses positive lookaheads and lookbehinds. It finds 1 or more characters which are preceded by the opening tag and followed by the closing tag. Since lookarounds are zero-width, they are not part of the match, therefore they don't get replaced.
XML
Here we're using a simple xpath query to find all of the password nodes. We iterate through each one with a foreach loop and set its innerText to an empty string.
The second version checks that the encrypted attribute is set to True and only operates on those.
Which to Choose
I personally think that the XML method is more appropriate, because it means you don't have to account for variations in XML syntax so much. You can also more easily account for different attributes specified on the nodes or different attribute values.
By using xpath you have a lot more flexibility than with regex for processing XML.
File operations
I noticed your sample to read the data used gc (short for Get-Content). Be aware that this reads the file line-by-line.
You can use this to get your raw content in one string, for conversion to XML or processing by regex:
$raw = Get-Content file1.txt -Raw
You can write it out pretty easily too:
$raw | Out-File file1.txt
<ref id="ch02_ref1"><mixed-citation publication-type="journal"><person-group person-group-type="author"><name><surname>J.M.</surname><given-names>Astilleros</given-names></name>
This is a single line. I just need to extract the word between the tags <given-names> and </given-names> which in this case is Astilleros. Is there a regex to do this. The problem I am facing is that there is no space between each word and the end tag </given-names> where '/' is a character in perl regex.. please help..
The idea is to get the names out, find them in the text on the page and put <given-names>Astilleros</given-names> tags around them.. I will definitely try XML parsers..
Don't parse XML with regexes – it is just too damn hard to get right. There are good parsers lying around, just waiting to be utilized by you. Let's use XML::LibXML:
use strict; use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_xml(string => <<'END');
<ref id="ch02_ref1">
<mixed-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>J.M.</surname>
<given-names>Astilleros</given-names>
</name>
</person-group>
</mixed-citation>
</ref>
END
# use XPath to find your element
my ($name) = $dom->findnodes('//given-names');
print $name->textContent, "\n";
(whatever you try, do not use XML::Simple!)
This should work as as a regex:
/<given-names>(.*?)</
From your input, it will capture Astilleros
This matches:
A literal <given-names>
Captures (0 to infinite times) any character (except newline)
Until it reaches a literal <
Here is a line in my xyz.csproj file:
<Reference Include="SomeDLLNameHere, Version=10.2.6.0, Culture=neutral, PublicKeyToken=b88d1754d700e49a, processorArchitecture=MSIL" />
All I need to do is replace the 'Version=10.2.6.0' to 'Version=11.0.0.0' .
The program I need to do this in is VSBuild which uses VBScript so I believe.
The problem is that I can't hardcode the 'old' version number. I therefore need to replace the following :
<Reference Include="SomeDLLNameHere, Version=10.2.6.0,
I therefor need a regex that will match the above bearing in mind that that in the example quoted, the 10.2.6.0 could be anything.
I believe that a regex that would select the text including and between
'<Reference Include="SomeDLLNameHere' and '>' is what I need.
There are other references to similar requests but none seem top work for me.
I would normally use C# to do this sort of thing and VBScript/Regex is something I avoid like the plague.
For most regex flavors, you would use this:
<Reference Include="SomeDLLNameHere.*?/>
For visual studio, I am not sure if the *? would work... Try this:
\<Reference Include="SomeDLLNameHere[^/]*\/\>
This regex pattern should work
"(<Reference[^>]+Version=)([^,]+),"
Applied with VBScript
str1 = "<Reference Include=""SomeDLLNameHere, Version=10.2.6.0,"
' Create regular expression.
Set regEx = New RegExp
regEx.Pattern = "(<Reference[^>]+Version=)([^,]+),"
' Make replacement.
ReplaceText = regEx.Replace(str1, "$111.0.0.0,")
WScript.echo ReplaceText
Gives the correct result
<Reference Include="SomeDLLNameHere, Version=11.0.0.0,
UPDATE
if you need something that matches between Version= and the end of the tag use > instead of ,
"(<Reference[^>]+Version=)([^>]+)>"
Using Regex with C# or VBScript is pretty much the same because it all comes to developing the regular expression. Something like this could help:
<Reference\s+Include\s*=\s*\".+\",\s*Version\s*=\s*.+,
Not sure what are the rules about case sensitivity and white spaces in csproj files, but this covers the form you described previously. Note that the "+" operator means one or kleen.