regex match all but not specific string - regex

I have some xml file and want to remove everything but a specific string.
There are quite other similar questions at StackOverflow but none of them works for my file and after a few hours of trying different regex I would like to ask for a help.
so far the closest regex which succeeded partly but not completely is:
^((?!<query.*<\/query>).)*$
a sample of the xml file:
<search>
<query>index=_internal [`set_local_host`] source=*license_usage.log* type="Usage" | eval h=if(len(h)=0 OR isnull(h),"(SQUASHED)",h) | eval s=if(len(s)=0 OR isnull(s),"(SQUASHED)",s) | eval idx=if(len(idx)=0 OR isnull(idx),"(UNKNOWN)",idx) | bin _time span=1d | stats sum(b) as b by _time, pool, s, st, h, idx | timechart span=1d sum(b) AS volumeB by st fixedrange=false | join type=outer _time [search index=_internal [`set_local_host`] source=*license_usage.log* type="RolloverSummary" | eval _time=_time - 43200 | bin _time span=1d | stats latest(stacksz) AS "stack size" by _time] | fields - _timediff | foreach * [eval <<FIELD>>=round('<<FIELD>>'/1024/1024/1024, 3)] </query>
<earliest>$central_time.earliest$</earliest>
<latest>$central_time.latest$</latest>
<sampleRatio>1</sampleRatio>
</search>
<option name="charting.axisLabelsX.majorLabelStyle.overflowMode">ellipsisNone</option>
<option name="charting.chart.stackMode">stacked</option>
<option name="charting.chart.style">shiny</option>
<option name="trellis.scales.shared">1</option>
<option name="trellis.size">medium</option>
</chart>
</panel>
</row>
<row>
<panel>
<chart>
<search>
<query>index=_introspection sourcetype=splunk_resource_usage component=hostwide saxsa
| eval tcu = ('data.cpu_system_pct' + 'data.cpu_user_pct')
| timechart limit=0 span=1d avg(tcu) by host</query>
<earliest>$central_time.earliest$</earliest>
<latest>$central_time.latest$</latest>
<sampleRatio>1</sampleRatio>
</search>
I use regex101 so the sample can be paste there in order to see why the rex is working only partly. To tell shortly , it doesn't match the first occurrence of but it matches the second occurrence. What I expect is that the regex does not match any of the occurrence of <query>.*</query>
fx. I want to match anything but not the following string:
<query>anything between(can be multiple lines*)</query>

Sorry for the delayed response. Partially because I'm at work, partially due to the fact that this type of situation was actually rather new to me (I love regular expressions, but I haven't had any exposure to this situation so it's a learning experience for both of us), but I think I may have a solution that you're looking for.
What I basically tried to do was use a bit of recursion in the expression with a combination of a negative look-ahead and negative lookbehind to ensure I'm not capturing any <query> tags
<(?!query).*(?<!<\/query)(?R)*>
< - Matches the literal character < to match the beginning of an opening tag
(?!query) - matches the all of the text opening tags proceeding < excluding query
.* - Matches all of the characters (including the > of the opening tag) up until:
(?<!<\/query) which is a negative look-behind assertion to ensure I'm not getting text from the .* with anything before a </query closing tag (note the missing >).
(?R)* - This is the part that took me a while to wrap my mind around so I may possibly butcher this explanation because I haven't used this before. Whatever the pattern is BEFORE this pattern, it recurses the entire regular expression from the current string position. Now I know that sounded confusing because it was (still is slightly) confusing to me too. But, I believe once <(?!query).*(?<!<\/query) finds it's first match, which would be <search>, from there, it repeats that entire pattern from the end of <search>. Thus, it will then check for an opening <query and a closing </query tag. And if it finds it, it skips it.
> - Matches the literal closing tag >, in which if the XML is written correctly, it should match <search's closing tag.
Test of your example with the following regex here
I sincerely hope this helps!

Related

RegEx to format Wikipedia's infoboxes code [SOLVED]

I am a contributor to Wikipedia and I would like to make a script with AutoHotKey that could format the wikicode of infoboxes and other similar templates.
Infoboxes are templates that displays a box on the side of articles and shows the values of the parameters entered (they are numerous and they differ in number, lenght and type of characters used depending on the infobox).
Parameters are always preceded by a pipe (|) and end with an equal sign (=). On rare occasions, multiple parameters can be put on the same line, but I can sort this manually before running the script.
A typical infobox will be like this:
{{Infobox XYZ
| first parameter = foo
| second_parameter =
| 3rd parameter = bar
| 4th = bazzzzz
| 5th =
| etc. =
}}
But sometime, (lazy) contributors put them like this:
{{Infobox XYZ
|first parameter=foo
|second_parameter=
|3rd parameter=bar
|4th=bazzzzz
|5th=
|etc.=
}}
Which isn't very easy to read and modify.
I would like to know if it is possible to make a regex (or a serie of regexes) that would transform the second example into the first.
The lines should start with a space, then a pipe, then another space, then the parameter name, then any number of spaces (to match the other lines lenght), then an equal sign, then another space, and if present, the parameter value.
I try some things using multiple capturing groups, but I'm going nowhere... (I'm even ashamed to show my tries as they really don't work).
Would someone have an idea on how to make it work?
Thank you for your time.
The lines should start with a space, then a pipe, then another space, then the parameter name, then a space, then an equal sign, then another space, and if present, the parameter value.
First the selection, it's relatively trivial:
^\s*\|\s*([^=]*?)\s*=(.*)$
Then the replacement, literally your description of what you want (note the space at the beginning):
| $1 = $2
See it in action here.
#Blindy:
The best code I have found so far is the following : https://regex101.com/r/GunrUg/1
The problem is it doesn't align the equal signs vertically...
I got an answer on AutoHotKey forums:
^i::
out := ""
Send, ^x
regex := "O)\s*\|\s*(.*?)\s*=\s*(.*)", width := 1
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
width := Max(width, StrLen(_[1]))
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
out .= Format(" | {:-" width "} = {2}", _[1],_[2]) "`n"
else
out .= A_LoopField "`n"
Clipboard := out
Send, ^v
Return
With this script, pressing Ctrl+i formats the infobox code just right (I guess a simple regex isn't enough to do the job).

How to replace a period within string and not in Numeric using singlestore (MemSQL) DB REGEXP_REPLACE function

I have a scenario wherein I want to replace a period when its surrounded by Alphabets and not when surrounded by Numbers. I figured out a Regular Expression pattern that can identify only the periods in Key names but the pattern is not working in SQL
SELECT REGEXP_REPLACE("Amount.fee:0.75,Amount.tot:645.55","(?<!\d)(\.)(?!\d)","_","ig");
Expected output: Amount_fee:0.75,Amount_tot:645.55
Note, I am trying this because, In MemSQL I couldn't access JSON key when it has period in it.
Also verified the pattern "(?<!\d)(.)(?!\d)" using https://coding.tools/regex-replace and it working fine. But, SQL is not working. Am using MemSQL 7.1.9 and POSIX Enhanced Regular expression are supposed to be work. Any help is much appreciated.
Since it looks like you are trying to workaround accessing a JSON key with a period, I will show you how to do that.
This can be done by either surrounding the json key name with backtics while using the shorthand json extract syntax:
select col::%`Amount.fee` from (select '{"Amount.fee":0.75,"Amount.tot":645.55}' col);
+--------------------+
| col::%`Amount.fee` |
+--------------------+
| 0.75 |
+--------------------+
or by using the json_extract_ builtins directly:
select json_extract_double('{"Amount.fee":0.75,"Amount.tot":645.55}', 'Amount.fee');
+------------------------------------------------------------------------------+
| json_extract_double('{"Amount.fee":0.75,"Amount.tot":645.55}', 'Amount.fee') |
+------------------------------------------------------------------------------+
| 0.75 |
+------------------------------------------------------------------------------+
Assuming you only want to target dots that are in between two non digit characters, where the dot is not the first or last character in the string, you may match on ([^\d])\.([^\d]) and replace with \1_\2:
SELECT REGEXP_REPLACE("Amount.fee:0.75,Amount.tot:645.55", "([^\d])\.([^\d])", "\1_\2", "ig");
Here is a regex demo showing that the replacement is working. Note that you might have to use $1_$2 instead of \1_\2 as the replacement, depending on the regex flavor of your SQL tool.

Vim search replace regex + incremental function

I'm currently stuck in vim trying to find a search/replace oneliner to replace a number with another + increment for each new iteration = when it finds a new match.
I'm working in xml svg code to batch process files Inkscape cannot process the text (plain svg multiline text bug).
<tspan
x="938.91315"
y="783.20563"
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
y="833.20563"
id="tspan13019">Text2</tspan><tspan
x="938.91315"
y="883.20563"
id="tspan13021">✗Text3</tspan>
etc.
So what I want to do is to change that to this result:
<tspan
x="938.91315"
y="200"
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
y="240"
id="tspan13019">Text2</tspan><tspan
x="938.91315"
y="280"
id="tspan13021">✗Text3</tspan>
etc.
So I duckducked and found the best vim tips resource from zzapper, but I cannot understand it:
convert yy to 10,11,12 :
:let i=10 | ’a,’bg/Abc/s/yy/\=i/ |let i=i+1
I then adapted it to something I can understand and should work in my home vim:
:let i=300 | 327,$ smagic ! y=\"[0-9]\+.[0-9]\+\" ! \=i ! g | let i=i+50
But somehow it doesn't loop, all I get is that:
<tspan
x="938.91315"
300
id="tspan13017"
style="font-weight:bold">Text1:</tspan><tspan
x="938.91315"
300
id="tspan13019">Text2</tspan><tspan
x="938.91315"
300
id="tspan13021">✗Text3</tspan>
So here I'm seriously stuck. I cannot figure out what doesn't work :
My adaptation of the original formula ?
My data layout ?
My .vimrc ?
I'll try to find other resources by myself, but on that kind of trick they are pretty rare I find, and like in zzapper tips, not always delivered with a manual.
One way to fix it:
:let i = 300 | g/\m\<y=/ s/\my="\zs\d\+.\d\+\ze"/\=i/ | let i += 50
Translation:
let i = 300 - hopefully obvious
g/\m\<y=/ ... - for all lines matching \m\<y=, apply the following command; the "following command" is s/.../.../ | let ...; the regexp:
\m - "magic" regexp
\< - match only at word boundary
s/\my="\zs\d\+.\d\+\ze"/\=i/ - substitute; the regexp:
\m - "magic" regexp
\d\+ - one or more digits
\zs...\ze - replace only what is matched between these points
\=i - replace with the value of expression i
let i += 50 - hopefully obvious again.
For more information: :help :g, :help \zs, :help \ze, help s/\\=.
Just to add my take as a memo (wrote this as an answer as an EDIT didn't seem right). Sorry it is not the best vim scripting here but it enables me to understand (I'm not a vim specialist).
:let i=300 | 323,$g/y="/smagic![0-9]\+.[0-9]\+!\=i!g | let i+=50
Assign the initial value to i :
:let i=300
Start :global (:g) function from line 323 to the end of file:
323,$g
Pattern to match for executing the commands (litteral text here)
y="
Substitution with magic on (magic meaning special characters "enabled")
smagic
Pattern to find
[0-9]\+.[0-9]\+
(numbers between 0-9 one or more times, a litteral dot, the numbers again)
Replaced with
\=i
\= tells vim to evaluate i not to write it litterally
Increment i with 50 for the next iteration
let i+=50
This part is still in the g function.
The separators, in bold:
| are the separators between the different functions
/ are the separators in the :g function
! are the separators in the smagic function

Reverse a regex?

I am using AHK to automatically do something but it involves parsing XML. I am aware that it is a bad habit to parse XML with regex, however I pretty much have my regex working. The issue is AHK only has regexreplace as a method and I need something along the lines of regexkeep.
So what happens is the part I want to keep gets deleted and the part I want deleted gets kept.
Here is the code:
RegExReplace(response, "(?<=.dt.\n:)(.*)(?=\n..dt.)")
Is there a way to have everything but the match match? If not is there a better way to go about this?
Edit:
I have no attempted using the inverse regex and regexmatch but neither work in AHK. Both regexs work properly at regex101.com however neither work in AHK. The regexmath returns 0 (meaning it found nothing) and the inverse regex returns nothing as well.
Here is a link to what is being searched by the regex:http://www.dictionaryapi.com/api/v1/references/collegiate/xml/Endoderm?key=17594df4-ff21-4045-88d9-a537fd4bcd61
Here is the entire code:
;responses := RegExReplace(response, "([\w\W])(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W])")
responses := RegExMatch(response, "(?<=.dt.\n:)(.*)(?=\n..dt.)")
MsgBox %responses%
Here is the "reversed" regex:
s).*dt.\n:|\n..dt.*
The parts in the look-arounds need matching with the * quantifier to match from the start and up to the end. To match the newline with a dot, use singleline mode.
Debuggex Demo (where endings are \r\n)
However, there is a better option with RegExMatch OutputVar:
If any capturing subpatterns are present inside NeedleRegEx, their
matches are stored in a pseudo-array whose base name is OutputVar.
Use
RegExMatch(response, "(?<=.dt.\n:)(?<Val>.*)(?=\n..dt.)")
Then, just refer to this value as MatchVal.
Here's a solution that should work, assuming you want to get whatever's between the <dt> tags. Make sure you're using the latest version of AHK if possible.
xml =
(
<entry_list version="1.0">
<entry id="endoderm">
<ew>endoderm</ew>
<subj>EM#AN</subj>
<hw>en*do*derm</hw>
<sound>
<wav>endode01.wav</wav>
<wpr>!en-du-+durm</wpr>
</sound>
<pr>ˈen-də-ˌdərm</pr>
<fl>noun</fl>
<et>French
<it>endoderme,</it>from
<it>end-</it>+ Greek
<it>derma</it>skin
<ma>derm-</ma>
</et>
<def>
<date>1861</date>
<dt>:the innermost of the three primary germ layers of an embryo that is
the source of the epithelium of the digestive tract and
its derivatives and of the lower respiratory tract</dt>
<sd>also</sd>
<dt>:a tissue derived from this layer</dt>
</def>
<uro>
<ure>en*do*der*mal</ure>
<sound>
<wav>endode02.wav</wav>
<wpr>+en-du-!dur-mul</wpr>
</sound>
<pr>ˌen-də-ˈdər-məl</pr>
<fl>adjective</fl>
</uro>
</entry>
</entry_list>
)
; Remove linebreaks and indentation whitespace
xml := RegExReplace(xml, "\n|\s{2,}|\t", "")
matchArray := []
matchPos := 1
; Keep looping until we're out of matches
while ( matchPos := RegExMatch(xml, "<dt>:([^<]*)", matchVar, matchPos + StrLen(matchVar1)) )
{
; Add matches to array
matchArray.insert(matchVar1)
}
; Show what's in the array
for each, value in matchArray {
; Index = Each, Output = Value
msgBox, Ittr: %each%, Value: %value%
}
Esc::ExitApp
You really shouldn't use RegEx for parsing XML though, it's very simple to read XML in AHK using COM, I know it's outside the scope of your question, but here's a simple example using a COM object to read the same data:
xmlData =
(LTrim
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="endoderm"><ew>endoderm</ew><subj>EM#AN</subj><hw>en*do*derm</hw><sound><wav>endode01.wav</wav><wpr>!en-du-+durm</wpr></sound><pr>ˈen-də-ˌdərm</pr><fl>noun</fl><et>French <it>endoderme,</it> from <it>end-</it> + Greek <it>derma</it> skin <ma>derm-</ma></et><def><date>1861</date><dt>:the innermost of the three primary germ layers of an embryo that is the source of the epithelium of the digestive tract and its derivatives and of the lower respiratory tract</dt> <sd>also</sd> <dt>:a tissue derived from this layer</dt></def><uro><ure>en*do*der*mal</ure><sound><wav>endode02.wav</wav><wpr>+en-du-!dur-mul</wpr></sound> <pr>ˌen-də-ˈdər-məl</pr> <fl>adjective</fl></uro></entry>
</entry_list>
)
xmlObj := ComObjCreate("MSXML2.DOMDocument.6.0")
xmlObj.loadXML(xmlData)
nodes := xmlObj.selectSingleNode("/entry_list/entry/def").childNodes
for node in nodes {
if (node.nodeName == "dt")
msgBox % node.text
}
Esc::ExitApp
For more information on how to use this, see this post: http://www.autohotkey.com/board/topic/56987-com-object-reference-autohotkey-v11/?p=367838
If the given phrase only occurs once, you can probably just fetch everything around it, can't you?
RegExReplace(response, "([\w\W]*)(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W]*)", "$1$5")
looks like the easiest solution to me, but surely not the prettiest
update: in your question update, you quoted responses := RegExReplace(response, "([\w\W])(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W])"), but it should be responses := RegExReplace(response, "([\w\W]*)(?<=.dt.\n:)(.*)(?=\n..dt.)([\w\W]*)", "$1$5") - meaming keep the first ($1) and the last ($5) key of braces, which include an arbitrary amount of any characters ([\w\W]*) around your initial phrase. seems you copied it wrong. I can't say that it will work for sure tho since I don't have any code to test it on
edit - one thing I don't understand - how does regexMatch help here? it just tells us IF and WHERE there is a substring present, but surely doesn't replace anything?

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"