I am looking for a regular expression for locating numerous expressions to find and replace. The expression looks like s360a__fieldname__c. I need to find all the instances where the s360a__ is then followed by the __c.
The issue is that it has to be within the one line so it is not finding a starting s360a__ and then the next __c which may be several lines below.
Here is an example of some of the xml I am changing.
<fields>
<fullName>s360a__AddressPreferredStreetAddressCity__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street City</label>
<length>255</length>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<trackHistory>false</trackHistory>
<type>Text</type>
<unique>false</unique>
</fields>
<fields>
<fullName>s360a__AddressPreferredStreetAddressCountry__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street Country</label>
<picklist>
You'd better of using a parser, combined with an xpath instead. Here's an example with PHP (can easily be adopted for e.g. Python as well). The idea is to load the DOM, then use a function to filter out elements (starts-with() and text() in this example):
<?php
$xml = '<fields>
<fullName>s360a__AddressPreferredStreetAddressCity__c</fullName>
<deprecated>false</deprecated>
<externalId>false</externalId>
<label>Preferred Street City</label>
<length>255</length>
<required>false</required>
<trackFeedHistory>false</trackFeedHistory>
<trackHistory>false</trackHistory>
<type>Text</type>
<unique>false</unique>
</fields>';
$dom = simplexml_load_string($xml);
// find everything where the text starts with 's360a_'
$fields = $dom->xpath("//*[starts-with(text(), 's360a_')]");
print_r($fields);
# s360a__AddressPreferredStreetAddressCity__c
The code checks if the text starts with s360a_. To actually check if it also ends with some specific string, you need to fiddle quite a bit (as the corresponding function ends-with() is not yet supported).
# check if the node text starts and ends with a specific string
$fields = $dom->xpath("//*[starts-with(., 's360a_') and substring(text(), string-length(text()) - string-length('_c') +1) = '_c']");
?>
Related
Working with TCL and I am trying to setup a regex to get the data within my xml string. The code that I provided has an example string of what I am dealing with and the regexp is attempting to find the first close bracket and keep the data until the next open bracket then place that into variable number. Unfortunately the output I am getting is: "< RouteLabel>Hurdman<" instead of the expected "Hurdman". Any help would really be appreciated.
set direction(1) {<RouteLabel>Hurdman</RouteLabel>}
regexp {^.*>(.*)<} $direction(1) number
The issue here is not with the regex but with how you are using it.
The syntax you need is
regexp <PATTERN> <INPUT> <WHOLE_MATCH_VAR> <CAPTURE_1_VAR> ... <CAPTURE_n_VAR>
So, in your case, as you are not interested in the whole match, just put _ where the whole match is expected:
set direction(1) {<RouteLabel>Hurdman</RouteLabel>}
regexp {^.*>(.*)<} $direction(1) _ number
puts $number
printing Hurdman. See the online Tcl demo.
Crash course in tDOM for this exact task:
Get tDOM (note different spelling in package name):
% package require tdom
0.8.3
Create an empty document with a root element called foobar:
% set doc [dom createDocument foobar]
domDoc02569130
Get a fix on the root:
% set root [$doc documentElement]
domNode025692E0
Setup one of your XML strings:
% set direction(1) {<RouteLabel>Hurdman</RouteLabel>}
<RouteLabel>Hurdman</RouteLabel>
Add it to the DOM tree at the root:
% $root appendXML $direction(1)
domNode025692E0
Get the string you want by XPath expression:
% $root selectNodes {string(//RouteLabel/text())}
Hurdman
Or by querying the root (only works if there is only one single text node inserted at a time, otherwise you get them all concatenated):
% $root asText
Hurdman
If you want to clear the DOM tree from the root to make it ready for appending new strings without the old ones interfering:
% foreach node [$root childNodes] {$node delete}
But if you use XPath expressions you should be able to append any number of XML strings and still retrieve their content.
Once again:
package require tdom
set doc [dom createDocument foobar]
set root [$doc documentElement]
set direction(1) {<RouteLabel>Hurdman</RouteLabel>}
$root appendXML $direction(1)
$root selectNodes {string(//RouteLabel/text())}
# => Hurdman
Documentation:
tdom (package)
I am using AHK to automatically do something but it involves parsing XML. I am aware that it is a bad habit to parse XML with regex, however I pretty much have my regex working. The issue is AHK only has regexreplace as a method and I need something along the lines of regexkeep.
So what happens is the part I want to keep gets deleted and the part I want deleted gets kept.
Here is the code:
RegExReplace(response, "(?<=.dt.\n:)(.*)(?=\n..dt.)")
Is there a way to have everything but the match match? If not is there a better way to go about this?
Edit:
I have no attempted using the inverse regex and regexmatch but neither work in AHK. Both regexs work properly at regex101.com however neither work in AHK. The regexmath returns 0 (meaning it found nothing) and the inverse regex returns nothing as well.
Here is a link to what is being searched by the regex:http://www.dictionaryapi.com/api/v1/references/collegiate/xml/Endoderm?key=17594df4-ff21-4045-88d9-a537fd4bcd61
Here is the entire code:
;responses := RegExReplace(response, "([\w\W])(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W])")
responses := RegExMatch(response, "(?<=.dt.\n:)(.*)(?=\n..dt.)")
MsgBox %responses%
Here is the "reversed" regex:
s).*dt.\n:|\n..dt.*
The parts in the look-arounds need matching with the * quantifier to match from the start and up to the end. To match the newline with a dot, use singleline mode.
Debuggex Demo (where endings are \r\n)
However, there is a better option with RegExMatch OutputVar:
If any capturing subpatterns are present inside NeedleRegEx, their
matches are stored in a pseudo-array whose base name is OutputVar.
Use
RegExMatch(response, "(?<=.dt.\n:)(?<Val>.*)(?=\n..dt.)")
Then, just refer to this value as MatchVal.
Here's a solution that should work, assuming you want to get whatever's between the <dt> tags. Make sure you're using the latest version of AHK if possible.
xml =
(
<entry_list version="1.0">
<entry id="endoderm">
<ew>endoderm</ew>
<subj>EM#AN</subj>
<hw>en*do*derm</hw>
<sound>
<wav>endode01.wav</wav>
<wpr>!en-du-+durm</wpr>
</sound>
<pr>ˈen-də-ˌdərm</pr>
<fl>noun</fl>
<et>French
<it>endoderme,</it>from
<it>end-</it>+ Greek
<it>derma</it>skin
<ma>derm-</ma>
</et>
<def>
<date>1861</date>
<dt>:the innermost of the three primary germ layers of an embryo that is
the source of the epithelium of the digestive tract and
its derivatives and of the lower respiratory tract</dt>
<sd>also</sd>
<dt>:a tissue derived from this layer</dt>
</def>
<uro>
<ure>en*do*der*mal</ure>
<sound>
<wav>endode02.wav</wav>
<wpr>+en-du-!dur-mul</wpr>
</sound>
<pr>ˌen-də-ˈdər-məl</pr>
<fl>adjective</fl>
</uro>
</entry>
</entry_list>
)
; Remove linebreaks and indentation whitespace
xml := RegExReplace(xml, "\n|\s{2,}|\t", "")
matchArray := []
matchPos := 1
; Keep looping until we're out of matches
while ( matchPos := RegExMatch(xml, "<dt>:([^<]*)", matchVar, matchPos + StrLen(matchVar1)) )
{
; Add matches to array
matchArray.insert(matchVar1)
}
; Show what's in the array
for each, value in matchArray {
; Index = Each, Output = Value
msgBox, Ittr: %each%, Value: %value%
}
Esc::ExitApp
You really shouldn't use RegEx for parsing XML though, it's very simple to read XML in AHK using COM, I know it's outside the scope of your question, but here's a simple example using a COM object to read the same data:
xmlData =
(LTrim
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="endoderm"><ew>endoderm</ew><subj>EM#AN</subj><hw>en*do*derm</hw><sound><wav>endode01.wav</wav><wpr>!en-du-+durm</wpr></sound><pr>ˈen-də-ˌdərm</pr><fl>noun</fl><et>French <it>endoderme,</it> from <it>end-</it> + Greek <it>derma</it> skin <ma>derm-</ma></et><def><date>1861</date><dt>:the innermost of the three primary germ layers of an embryo that is the source of the epithelium of the digestive tract and its derivatives and of the lower respiratory tract</dt> <sd>also</sd> <dt>:a tissue derived from this layer</dt></def><uro><ure>en*do*der*mal</ure><sound><wav>endode02.wav</wav><wpr>+en-du-!dur-mul</wpr></sound> <pr>ˌen-də-ˈdər-məl</pr> <fl>adjective</fl></uro></entry>
</entry_list>
)
xmlObj := ComObjCreate("MSXML2.DOMDocument.6.0")
xmlObj.loadXML(xmlData)
nodes := xmlObj.selectSingleNode("/entry_list/entry/def").childNodes
for node in nodes {
if (node.nodeName == "dt")
msgBox % node.text
}
Esc::ExitApp
For more information on how to use this, see this post: http://www.autohotkey.com/board/topic/56987-com-object-reference-autohotkey-v11/?p=367838
If the given phrase only occurs once, you can probably just fetch everything around it, can't you?
RegExReplace(response, "([\w\W]*)(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W]*)", "$1$5")
looks like the easiest solution to me, but surely not the prettiest
update: in your question update, you quoted responses := RegExReplace(response, "([\w\W])(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W])"), but it should be responses := RegExReplace(response, "([\w\W]*)(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W]*)", "$1$5") - meaming keep the first ($1) and the last ($5) key of braces, which include an arbitrary amount of any characters ([\w\W]*) around your initial phrase. seems you copied it wrong. I can't say that it will work for sure tho since I don't have any code to test it on
edit - one thing I don't understand - how does regexMatch help here? it just tells us IF and WHERE there is a substring present, but surely doesn't replace anything?
I'm trying to write a script to simplify the process of searching through a particular applications log files for specific information. So I thought maybe there's a way to convert them into an XML tree, and I'm off to a decent start....but The problem is, the application log files are an absolute mess if you ask me
Some entries are simple
2014/04/09 11:27:03 INFO Some.code.function - Doing stuff
Ideally I'd like to turn the above into something like this
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
</Message>
Other entries are something like this where there's additional information and line breaks
2014/04/09 11:27:04 INFO Some.code.function - Something happens
changes:
this stuff happened
I'd like to turn this last chunk into something like the above, but add the additional info into a section
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
<details>changes:
this stuff happened</details>
</Message>
and then other messages, errors will be in the form of
2014/04/09 11:27:03 ERROR Some.code.function - Something didn't work right
Log Entry: LONGARSEDGUID
Error Code: E3145
Application: Name
Details:
message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry
This last chunk I'd like to convert as the last to above examples, but adding XML nodes for log entry, error code, application, and again, details like so
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>ERROR </Type>
<Source>Some.code.function</Source>
<Sub>Something didn't work right</Sub>
<Entry>LONGARSEDGUID</Entry>
<Code>E3145</Code>
<Application>Name</Application>
<details>message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry</details>
</Message>
Now I know that Select-String has a context option which would let me select a number of lines after the line I've filtered, the problem is, this isn't a constant number.
I'm thinking a regular expression would also me to select the paragraph chunk before the date string, but regular expressions are not a strong point of mine, and I thought there might be a better way because the one constant is that new entries start with a date string
the idea though is to either break these up into xml or tables of sorts and then from there I'm hoping it might take the last or filtering non relevant or recurring messages a little easier
I have a sample I just tossed on pastebin after removing/replacing a few bits of information for privacy reasons
http://pastebin.com/raw.php?i=M9iShyT2
Sorry this is kind of late, I got tied up with work for a bit there (darn work expecting me to be productive while on their dime). I ended up with something similar to Ansgar Wiechers solution, but formatted things into objects and collected those into an array. It doesn't manage your XML that you added later, but this gives you a nice array of objects to work with for the other records. I'll explain the main RegEx line here, I'll comment in-line where it's practical.
'(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) [\d+?] (\w+?) {1,2}(.+?) - (.+)$' is the Regex that detects the start of a new record. I started to explain it, but there are probably better resources for you to learn RegEx than me explaining it to me. See this RegEx101.com link for a full breakdown and examples.
$Records=#() #Create empty array that we will populate with custom objects later
$Event = $Null #make sure nothing in $Event to give script a clean start
Get-Content 'C:\temp\test1.txt' | #Load file, and start looping through it line-by-line.
?{![string]::IsNullOrEmpty($_)}|% { #Filter out blank lines, and then perform the following on each line
if ($_ -match '(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[\d+?] (\w+?) {1,2}(.+?) - (.+)$') { #New Record Detector line! If it finds this RegEx match, it means we're starting a new record.
if ($Event) { #If there's already a record in progress, add it to the Array
$Records+=$Event
}
$Event = New-Object PSObject -Property #{ #Create a custom PSObject object with these properties that we just got from that RegEx match
DateStamp = [datetime](get-date $Matches[1]) #We convert the date/time stamp into an actual DateTime object. That way sorting works better, and you can compare it to real dates if needed.
Type = $Matches[2]
Source = $Matches[3]
Message = $Matches[4]}
Ok, little pause for the cause here. $Matches isn't defined by me, why am I referencing it? . When PowerShell gets matches from a RegEx expression it automagically stores the resulting matches in $Matches. So all the groups that we just matched in parenthesis become $Matches[1], $Matches[2], and so on. Yes, it's an array, and there is a $Matches[0], but that is the entire string that was matched against, not just the groups that matched. We now return you to your regularly scheduled script...
} else { #End of the 'New Record' section. If it's not a new record if does the following
if($_ -match "^((?:[^ ^\[])(?:\w| |\.)+?):(.*)$"){
RegEx match again. It starts off by stating that this has to be the beginning of the string with the carat character (^). Then it says (in a non-capturing group noted by the (?:<stuff>) format, which really for my purposes just means it won't show up in $Matches) [^ \[]; that means that the next character can not be a space or opening bracket (escaped with a ), just to speed things up and skip those lines for this check. If you have things in brackets [] and the first character is a carat it means 'don't match anything in these brackets'.
I actually just changed this next part to include periods, and used \w instead of [a-zA-Z0-9] because it's essentially the same thing but shorter. \w is a "word character" in RegEx, and includes letters, numbers, and the underscore. I'm not sure why the underscore is considered part of a word, but I don't make the rules I just play the game. I was using [a-zA-Z0-9] which matches anything between 'a' and 'z' (lowercase), anything between 'A' and 'Z' (uppercase), and anything between '0' and '9'. At the risk of including the underscore character \w is a lot shorter and simpler.
Then the actual capturing part of this RegEx. This has 2 groups, the first is letters, numbers, underscores, spaces, and periods (escaped with a \ because '.' on it's own matches any character). Then a colon. Then a second group that is everything else until the end of the line.
$Field = $Matches[1] #Everything before the colon is the name of the field
$Value = $Matches[2].trim() #everything after the colon is the data in that field
$Event | Add-Member $Field $Value #Add the Field to $Event as a NoteProperty, with a value of $Value. Those two are actually positional parameters for Add-Member, so we don't have to go and specify what kind of member, specify what the name is, and what the value is. Just Add-Member <[string]name> <value can be a string, array, yeti, whatever... it's not picky>
} #End of New Field for current record
else{$Value = $_} #If it didn't find the regex to determine if it is a new field then this is just more data from the last field, so don't change the field, just set it all as data.
} else { #If it didn't find the regex then this is just more data from the last field, so don't change the field, just set it all as data.the field does not 'not exist') do this:
$Event.$Field += if(![string]::isNullOrEmpty($Event.$Field)){"`r`n$_"}else{$_}}
This is a long explanation for a fairly short bit of code. Really all it does is add data to the field! This has an inverted (prefixed with !) If check to see if the current field has any data, if it, or if it is currently Null or Empty. If it is empty it adds a new line, and then adds the $Value data. If it doesn't have any data it skips the new line bit, and just adds the data.
}
}
}
$Records+=$Event #Adds the last event to the array of records.
Sorry, I'm not very good with XML. But at least this gets you clean records.
Edit: Ok, code is notated now, hopefully everything is explained well enough. If something is still confusing perhaps I can refer you to a site that explains better than I can. I ran the above against your sample input in PasteBin.
One possible way to deal with such files is to process them line by line. Each log entry starts with a timestamp and ends when the next line starting with a timestamp appears, so you could do something like this:
Get-Content 'C:\path\to\your.log' | % {
if ($_ -match '^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}') {
if ($logRecord) {
# If a current log record exists, it is complete now, so it can be added
# to your XML or whatever, e.g.:
$logRecord -match '^(\d{4}/\d{2}/\d{2}) (\d{2}:\d{2}:\d{2}) (\S+) ...'
$message = $xml.CreateElement('Message')
$date = $xml.CreateElement('Date')
$date.InnerText = $matches[1]
$message.AppendChild($date)
$time = $xml.CreateElement('Time')
$time.InnerText = $matches[2]
$message.AppendChild($time)
$type = $xml.CreateElement('Type')
$type.InnerText = $matches[3]
$message.AppendChild($type)
...
$xml.SelectSingleNode('...').AppendChild($message)
}
$logRecord = $_ # start new record
} else {
$logRecord += "`r`n$_" # append to current record
}
}
I'm beginner in regular expressions and I want to cut some text placed beeween two other words. I'm using QT to do it. Some exapmle:
<li class="wx-feels">
Feels like <i><span class="wx-value" itemprop="feels-like-temperature-fahrenheit">55</span>°</i>
</li>
I want to get
Feels like <i><span class="wx-value" itemprop="feels-like-temperature-fahrenheit">55</span>°
From code above, sespecially a number 55 , my idea was to cut whole line from text first and then search it for nubers, but I cannot recover it from whole text.
I typed somthing like that:
QRegExp rx("(Feels like <i><span class=\"wx-value\" itemprop=\"feels-like-temperature-fahrenheit\">)[0-9]{1,3}(</span>°</i>)");
QStringList list;
list = all.split(rx);
Where all is a whole text, but a list contains only those substrings I didn't wanted, is there a posibity split QString into three pieces?
First - text at the beginning (which I don't want)
Second - wanted text
Third - rest of text?
Description
This regex will collect the inner string within the li tags where the li tag has a class of wx-feels, it'll also capture the numeric value inside the span tag.
<li\b[^>]*\bclass=(["'])wx-feels\1[^>]*?>(.*?\bitemprop=(['"])feels-like-temperature-fahrenheit\3[^>]*>(\d+).*?)<\/li>
Groups
Group 0 gets the entire string including the open and close LI tags
gets the open quote for the LI class attribute. This allows us to find the correct close quote after the value
get the string directly inside the LI tag
gets the open quote for the itemprop attribute
gets the digits from the span inner text
Example
This PHP example is simply to show how the regex works.
<?php
$sourcestring="<li class=\"wx-feels\">
Feels like <i><span class=\"wx-value\" itemprop=\"feels-like-temperature-fahrenheit\">55</span>°</i>
</li>";
preg_match('/<li\b[^>]*\bclass=(["\'])wx-feels\1[^>]*?>(.*?\bitemprop=([\'"])feels-like-temperature-fahrenheit\3[^>]*>(\d+).*?)<\/li>/ims',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => <li class="wx-feels">
Feels like <i><span class="wx-value" itemprop="feels-like-temperature-fahrenheit">55</span>°</i>
</li>
[1] => "
[2] =>
Feels like <i><span class="wx-value" itemprop="feels-like-temperature-fahrenheit">55</span>°</i>
[3] => "
[4] => 55
)
Disclaimer
Parsing html with a regex can be problematic because of the high number of edge cases. If you are in control of the input text or if it's always as basic as your sample, then you should have no problem.
If QT has one, I recommend using an HTML parsing tool to capture this data.
I have an output, where i'd like to fetch the value of CMEngine node i.e., everything inside CMEngine node. Please help me with a regex, I already have a java code in place which uses the regex, so I just need the regex. Thanks
My XML
<General>
<LanguageID>en_US</LanguageID>
<CMEngine>
<CMServer/> <!-- starting here -->
<DaysToKeepHistory>4</DaysToKeepHistory>
<PreprocessorMaxBuf>5000000</PreprocessorMaxBuf>
<ServiceRefreshInterval>30</ServiceRefreshInterval>
<ReuseMemoryBetweenRequests>true</ReuseMemoryBetweenRequests>
<Trace Enabled="false">
<ActiveCategories>
<Category>ENVIRONMENT</Category>
<Category>EXEC</Category>
<Category>EXTERNALS</Category>
<Category>FILESYSTEM</Category>
<Category>INPUT_DOC</Category>
<Category>INTERFACES</Category>
<Category>NETWORKING</Category>
<Category>OUTPUT_DOC</Category>
<Category>PREPROCESSOR_INPUT</Category>
<Category>REQUEST</Category>
<Category>SYSTEMRESOURCES</Category>
<Category>VIEWIO</Category>
</ActiveCategories>
<SeverityLevel>ERROR</SeverityLevel>
<MessageInfo>
<ProcessAndThreadIds>true</ProcessAndThreadIds>
<TimeStamp>true</TimeStamp>
</MessageInfo>
<TraceFile>
<FileName>CMEngine_log.txt</FileName>
<MaxFileSize>1000000</MaxFileSize>
<RecyclingMethod>Restart</RecyclingMethod>
</TraceFile>
</Trace>
<JVMLocation>C:\Informatica\9.1.0\java\jre\bin\server</JVMLocation>
<JVMInitParamList/> <!-- Ending here -->
</CMEngine>
</General>
If it has to be a regex, and if there is only one CMEngine tag per string:
Pattern regex = Pattern.compile("(?<=<CMEngine>)(?:(?!</CMEngine>).)*", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
Since that output appears to be machine-generated and is unlikely to contain comments or other stuff that might confuse the regex, this should work quite reliably.
It starts at a position right after a <CMEngine> tag: (?<=<CMEngine>)and matches all characters until the next </CMEngine> tag: (?:(?!</CMEngine>).)*.