Regex to fetch xml node string value - regex

I have an output, where i'd like to fetch the value of CMEngine node i.e., everything inside CMEngine node. Please help me with a regex, I already have a java code in place which uses the regex, so I just need the regex. Thanks
My XML
<General>
<LanguageID>en_US</LanguageID>
<CMEngine>
<CMServer/> <!-- starting here -->
<DaysToKeepHistory>4</DaysToKeepHistory>
<PreprocessorMaxBuf>5000000</PreprocessorMaxBuf>
<ServiceRefreshInterval>30</ServiceRefreshInterval>
<ReuseMemoryBetweenRequests>true</ReuseMemoryBetweenRequests>
<Trace Enabled="false">
<ActiveCategories>
<Category>ENVIRONMENT</Category>
<Category>EXEC</Category>
<Category>EXTERNALS</Category>
<Category>FILESYSTEM</Category>
<Category>INPUT_DOC</Category>
<Category>INTERFACES</Category>
<Category>NETWORKING</Category>
<Category>OUTPUT_DOC</Category>
<Category>PREPROCESSOR_INPUT</Category>
<Category>REQUEST</Category>
<Category>SYSTEMRESOURCES</Category>
<Category>VIEWIO</Category>
</ActiveCategories>
<SeverityLevel>ERROR</SeverityLevel>
<MessageInfo>
<ProcessAndThreadIds>true</ProcessAndThreadIds>
<TimeStamp>true</TimeStamp>
</MessageInfo>
<TraceFile>
<FileName>CMEngine_log.txt</FileName>
<MaxFileSize>1000000</MaxFileSize>
<RecyclingMethod>Restart</RecyclingMethod>
</TraceFile>
</Trace>
<JVMLocation>C:\Informatica\9.1.0\java\jre\bin\server</JVMLocation>
<JVMInitParamList/> <!-- Ending here -->
</CMEngine>
</General>

If it has to be a regex, and if there is only one CMEngine tag per string:
Pattern regex = Pattern.compile("(?<=<CMEngine>)(?:(?!</CMEngine>).)*", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Since that output appears to be machine-generated and is unlikely to contain comments or other stuff that might confuse the regex, this should work quite reliably.
It starts at a position right after a <CMEngine> tag: (?<=<CMEngine>)and matches all characters until the next </CMEngine> tag: (?:(?!</CMEngine>).)*.

Related

How do I match a string which contains a specific string using regex lazily?

I would like to match a string using regex in python which contains a specific string (lazy match) but haven't figured out how to do so.
For instance, in the following example, how do I return just '<tag1>some text<tag2>some other text</tag2><tag1>'
and not the whole string
#!/bin/python3
import re
pattern = r'(<([a-zA-Z0-9]+?)\b[^>]*>.*?<tag2>some other text</tag2>.*?</\2>)'
text = '<root> <tag1>some text<tag2>some other text</tag2></tag1> </root>'
print(re.search(pattern, text, re.DOTALL).groups(0))
The code above prints <root> <tag1>some text<tag2>some other text</tag2></tag1> </root> when I want it to print <tag1>some text<tag2>some other text</tag2></tag1>
Of course, all of this assuming that there can be any tag in the place of tag1
Turns out, the solution is quite simple,here's the regex that works:
.*(<([a-zA-Z0-9]+?)\b[^>]*>.*?<tag2>some other text</tag2>.*?</\2>).*

Find several groups of ocurrences in String with regex Scala

I have a String like this:
val rawData = "askljdld<a>content to extract</a>lkdsjkdj<a>more content to extract</a>sdkdljk
and I want to extract the content between the tags <a>
I've tried this, but the end part of the regex is not working as I expected:
val regex = "<a>(.*)</a>".r
for(m <- regex.findAllIn(rawData)){
println(m)
}
the output is:
<a>content to extract</a>lkdsjkdj<a>more content to extract</a>
I understand what's happening: the regex finds the first <a> and the last </a>.
How can I have an iterator with the two entries?
<a>content to extract</a>
<a>more content to extract</a>
thanks in advance
All is very simple: "<a>(.*?)</a>"
.*? - means anything until something. In your case until </a>
Your regex is not the right one. You should use <a>(.*?)</a> instead
val rawData = "askljdld<a>content to extract</a>lkdsjkdj<a>more content to extract</a>sdkdljk"
val regex = "<a>(.*?)</a>".r
regex.findAllIn(rawData).foreach(println)

I need a regular Expression for Starts with and ends with

I am looking for a regular expression for locating numerous expressions to find and replace. The expression looks like s360a__fieldname__c. I need to find all the instances where the s360a__ is then followed by the __c.
The issue is that it has to be within the one line so it is not finding a starting s360a__ and then the next __c which may be several lines below.
Here is an example of some of the xml I am changing.
<fields>
<fullName>s360a__AddressPreferredStreetAddressCity__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street City</label>
<length>255</length>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<trackHistory>false</trackHistory>
<type>Text</type>
<unique>false</unique>
</fields>
<fields>
<fullName>s360a__AddressPreferredStreetAddressCountry__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street Country</label>
<picklist>
You'd better of using a parser, combined with an xpath instead. Here's an example with PHP (can easily be adopted for e.g. Python as well). The idea is to load the DOM, then use a function to filter out elements (starts-with() and text() in this example):
<?php
$xml = '<fields>
<fullName>s360a__AddressPreferredStreetAddressCity__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street City</label>
<length>255</length>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<trackHistory>false</trackHistory>
<type>Text</type>
<unique>false</unique>
</fields>';
$dom = simplexml_load_string($xml);
// find everything where the text starts with 's360a_'
$fields = $dom->xpath("//*[starts-with(text(), 's360a_')]");
print_r($fields);
# s360a__AddressPreferredStreetAddressCity__c
The code checks if the text starts with s360a_. To actually check if it also ends with some specific string, you need to fiddle quite a bit (as the corresponding function ends-with() is not yet supported).
# check if the node text starts and ends with a specific string
$fields = $dom->xpath("//*[starts-with(., 's360a_') and substring(text(), string-length(text()) - string-length('_c') +1) = '_c']");
?>

Regex without brackets

I have the following tag from an XML file:
<msg><![CDATA[Method=GET URL=http://test.de:80/cn?OP=gtm&Reset=1(Clat=[400441379], Clon=[-1335259914], Decoding_Feat=[], Dlat=[0], Dlon=[0], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[7363], ntCoent-Length=[15783], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:270/CSI:-/Me:1/Total:271]]>
Now I try to get from this message: Clon, Dlat, Dlon and Clat.
However, I already created the following regex:
(?<=Clat=)[\[\(\d+\)\n\n][^)n]+]
But the problem is here, I would like to get only the numbers without the brackets. I tried some other expressions.
Do you maybe know, how I can expand this expression, in order to get only the values without the brackets?
Thank you very much in advance.
Best regards
The regex
(clon|dlat|dlon|clat)=\[(-?\d+)\]
Gives
As I stated before, if you use this regex to extract the information out of this CDATA element, that's okay. But you really want to get to the contents of that element using an XML parser.
Example usage
Regex r = new Regex(#"(clon|dlat|dlon|clat)=\[(-?\d+)\]");
string s = ".. here's your cdata content .. ";
foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnoreCase))
{
var name = match.Groups[1].Value; //will contain "clon", "dlat", "dlon" or "clat"
var inner_value = match.Groups[2].Value; //will contin the value inside the square-brackets, e.g. "400441379"
//Do something with the matches
}

Reverse a regex?

I am using AHK to automatically do something but it involves parsing XML. I am aware that it is a bad habit to parse XML with regex, however I pretty much have my regex working. The issue is AHK only has regexreplace as a method and I need something along the lines of regexkeep.
So what happens is the part I want to keep gets deleted and the part I want deleted gets kept.
Here is the code:
RegExReplace(response, "(?<=.dt.\n:)(.*)(?=\n..dt.)")
Is there a way to have everything but the match match? If not is there a better way to go about this?
Edit:
I have no attempted using the inverse regex and regexmatch but neither work in AHK. Both regexs work properly at regex101.com however neither work in AHK. The regexmath returns 0 (meaning it found nothing) and the inverse regex returns nothing as well.
Here is a link to what is being searched by the regex:http://www.dictionaryapi.com/api/v1/references/collegiate/xml/Endoderm?key=17594df4-ff21-4045-88d9-a537fd4bcd61
Here is the entire code:
;responses := RegExReplace(response, "([\w\W])(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W])")
responses := RegExMatch(response, "(?<=.dt.\n:)(.*)(?=\n..dt.)")
MsgBox %responses%
Here is the "reversed" regex:
s).*dt.\n:|\n..dt.*
The parts in the look-arounds need matching with the * quantifier to match from the start and up to the end. To match the newline with a dot, use singleline mode.
Debuggex Demo (where endings are \r\n)
However, there is a better option with RegExMatch OutputVar:
If any capturing subpatterns are present inside NeedleRegEx, their
matches are stored in a pseudo-array whose base name is OutputVar.
Use
RegExMatch(response, "(?<=.dt.\n:)(?<Val>.*)(?=\n..dt.)")
Then, just refer to this value as MatchVal.
Here's a solution that should work, assuming you want to get whatever's between the <dt> tags. Make sure you're using the latest version of AHK if possible.
xml =
(
<entry_list version="1.0">
<entry id="endoderm">
<ew>endoderm</ew>
<subj>EM#AN</subj>
<hw>en*do*derm</hw>
<sound>
<wav>endode01.wav</wav>
<wpr>!en-du-+durm</wpr>
</sound>
<pr>ˈen-də-ˌdərm</pr>
<fl>noun</fl>
<et>French
<it>endoderme,</it>from
<it>end-</it>+ Greek
<it>derma</it>skin
<ma>derm-</ma>
</et>
<def>
<date>1861</date>
<dt>:the innermost of the three primary germ layers of an embryo that is
the source of the epithelium of the digestive tract and
its derivatives and of the lower respiratory tract</dt>
<sd>also</sd>
<dt>:a tissue derived from this layer</dt>
</def>
<uro>
<ure>en*do*der*mal</ure>
<sound>
<wav>endode02.wav</wav>
<wpr>+en-du-!dur-mul</wpr>
</sound>
<pr>ˌen-də-ˈdər-məl</pr>
<fl>adjective</fl>
</uro>
</entry>
</entry_list>
)
; Remove linebreaks and indentation whitespace
xml := RegExReplace(xml, "\n|\s{2,}|\t", "")
matchArray := []
matchPos := 1
; Keep looping until we're out of matches
while ( matchPos := RegExMatch(xml, "<dt>:([^<]*)", matchVar, matchPos + StrLen(matchVar1)) )
{
; Add matches to array
matchArray.insert(matchVar1)
}
; Show what's in the array
for each, value in matchArray {
; Index = Each, Output = Value
msgBox, Ittr: %each%, Value: %value%
}
Esc::ExitApp
You really shouldn't use RegEx for parsing XML though, it's very simple to read XML in AHK using COM, I know it's outside the scope of your question, but here's a simple example using a COM object to read the same data:
xmlData =
(LTrim
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="endoderm"><ew>endoderm</ew><subj>EM#AN</subj><hw>en*do*derm</hw><sound><wav>endode01.wav</wav><wpr>!en-du-+durm</wpr></sound><pr>ˈen-də-ˌdərm</pr><fl>noun</fl><et>French <it>endoderme,</it> from <it>end-</it> + Greek <it>derma</it> skin <ma>derm-</ma></et><def><date>1861</date><dt>:the innermost of the three primary germ layers of an embryo that is the source of the epithelium of the digestive tract and its derivatives and of the lower respiratory tract</dt> <sd>also</sd> <dt>:a tissue derived from this layer</dt></def><uro><ure>en*do*der*mal</ure><sound><wav>endode02.wav</wav><wpr>+en-du-!dur-mul</wpr></sound> <pr>ˌen-də-ˈdər-məl</pr> <fl>adjective</fl></uro></entry>
</entry_list>
)
xmlObj := ComObjCreate("MSXML2.DOMDocument.6.0")
xmlObj.loadXML(xmlData)
nodes := xmlObj.selectSingleNode("/entry_list/entry/def").childNodes
for node in nodes {
if (node.nodeName == "dt")
msgBox % node.text
}
Esc::ExitApp
For more information on how to use this, see this post: http://www.autohotkey.com/board/topic/56987-com-object-reference-autohotkey-v11/?p=367838
If the given phrase only occurs once, you can probably just fetch everything around it, can't you?
RegExReplace(response, "([\w\W]*)(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W]*)", "$1$5")
looks like the easiest solution to me, but surely not the prettiest
update: in your question update, you quoted responses := RegExReplace(response, "([\w\W])(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W])"), but it should be responses := RegExReplace(response, "([\w\W]*)(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W]*)", "$1$5") - meaming keep the first ($1) and the last ($5) key of braces, which include an arbitrary amount of any characters ([\w\W]*) around your initial phrase. seems you copied it wrong. I can't say that it will work for sure tho since I don't have any code to test it on
edit - one thing I don't understand - how does regexMatch help here? it just tells us IF and WHERE there is a substring present, but surely doesn't replace anything?