RegEx to format Wikipedia's infoboxes code [SOLVED]

RegEx to format Wikipedia's infoboxes code [SOLVED] - regex

I am a contributor to Wikipedia and I would like to make a script with AutoHotKey that could format the wikicode of infoboxes and other similar templates.
Infoboxes are templates that displays a box on the side of articles and shows the values of the parameters entered (they are numerous and they differ in number, lenght and type of characters used depending on the infobox).
Parameters are always preceded by a pipe (|) and end with an equal sign (=). On rare occasions, multiple parameters can be put on the same line, but I can sort this manually before running the script.
A typical infobox will be like this:
{{Infobox XYZ
| first parameter = foo
| second_parameter =
| 3rd parameter = bar
| 4th = bazzzzz
| 5th =
| etc. =
}}
But sometime, (lazy) contributors put them like this:
{{Infobox XYZ
|first parameter=foo
|second_parameter=
|3rd parameter=bar
|4th=bazzzzz
|5th=
|etc.=
}}
Which isn't very easy to read and modify.
I would like to know if it is possible to make a regex (or a serie of regexes) that would transform the second example into the first.
The lines should start with a space, then a pipe, then another space, then the parameter name, then any number of spaces (to match the other lines lenght), then an equal sign, then another space, and if present, the parameter value.
I try some things using multiple capturing groups, but I'm going nowhere... (I'm even ashamed to show my tries as they really don't work).
Would someone have an idea on how to make it work?
Thank you for your time.

The lines should start with a space, then a pipe, then another space, then the parameter name, then a space, then an equal sign, then another space, and if present, the parameter value.
First the selection, it's relatively trivial:
^\s*\|\s*([^=]*?)\s*=(.*)$
Then the replacement, literally your description of what you want (note the space at the beginning):
| $1 = $2
See it in action here.

#Blindy:
The best code I have found so far is the following : https://regex101.com/r/GunrUg/1
The problem is it doesn't align the equal signs vertically...

I got an answer on AutoHotKey forums:
^i::
out := ""
Send, ^x
regex := "O)\s*\|\s*(.*?)\s*=\s*(.*)", width := 1
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
width := Max(width, StrLen(_[1]))
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
out .= Format(" | {:-" width "} = {2}", _[1],_[2]) "`n"
else
out .= A_LoopField "`n"
Clipboard := out
Send, ^v
Return
With this script, pressing Ctrl+i formats the infobox code just right (I guess a simple regex isn't enough to do the job).

Related

Highlight line for specific commit in git log graph

I am trying to highlight the whole line for a specific commit in my git log graph. I have since before created a git log alias to format the output of my logs. I have attempted to highlight a specific line containing the commit-id, using my alias.
Alias in ~/.gitconfig
# Base command for log formatting
lg-base = "log --graph --decorate=short --decorate-refs-exclude='refs/tags/*' --color=always"
# Version 1 log format
lg1 = !"git lg-base --format=format:'%C(#f0890c)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(#d10000)%d%C(reset)'"
Doing a test with searching for 6 months just because it should behave the same and might showcase my issue a bit better.
git lg1 | grep --color=always -E '(6 months).*|$'
Matches the correct lines. But it doesn't highlight the whole line to the right and when trying to highlight the left part of the line as well, it doesn't work as expected. Probably because of my lack in skills of using regex.
git lg1 | grep --color=always -E '.*(6 months).*|$'
Instead it marks the * in the beginning.
If you have a total other approach, that is fine with me as long as I can use my formatted git log alias.

Thomas' comment is the key to the issue here: although grep is adding its own color (or colour) changing escape sequences to highlight the line, Git has already put in color changing directives. Each such directive, for one of the named colors, looks like this:
ESC[numberm
where the number part is 30 through 37 for a foreground color and 40 through 47 for a background color (plus some extra codes for bold or dim, which I won't include here). (%C(reset) sends ESC [ m and your orange selector uses a 24-bit color directive, which is less widely supported than the eight base colors, which go back to the 1990s). Hence the original output reads:
* <sp> <orange> <hash-ID> <reset> <sp> - <sp> <blue> (n months ago) <reset> ...
The grep adds red, which is ESC [ 31 m, and a reset, around the matched expression—but the existing escapes within the expression remain.
The easiest way by far to avoid all this is to stop using color escape sequences at all, so that grep's added ones stick out like a sore red thumb. Of course that defeats your goal, which is to keep the color-changing escapes in lines that aren't highlighted. But you haven't explained what you'd like done with the color-changing escapes in lines, or parts of lines, that are highlighted. Answering that will determine what to do next.
There are any number of ways you could handle this. For instance, instead of %C(color)%<directive>%C(reset) you could use %x1b(name-of-color)%<directive>%x1b(reset) to insert the literal sequences ESC ( name of color or reset ), or assume that the terminal in question will use ANSI style escapes that end with the lowercase m character, and try to write something up in sed or awk (I'd use awk for something this complex, just because it's less like writing line noise) that does the match—awk supports regex matching—and if found, strips out the color sequences from the matched part and adds its own. Post-process this with something that inserts the appropriate terminal-dependent color-change sequences, or keep the original ESC [ ... m sequences on the assumption that you're in a window that uses that form, and you'll have the output you want (which you can now pipe through less -R if desired).
A skeleton awk program that does what you want is:
/<desired regex>/ { handle matched line; next; }
{ print }
The hard part is the "handle matched line". GNU awk has RSTART and RLENGTH to help out a lot; see, e.g., this answer. The substring of the line from the beginning to RSTART-1 wasn't matched (this may be empty), and the substring from RSTART+RLENGTH to the end of the line (which may also be empty) also was not matched; the substring of $0 at RSTART for length RLENGTH was matched and here's where you would strip out any color-changing sequences, if you want your basic red (or whatever) applied throughout.
Sample script (by Robin Hellmers)
Creating a script and placing it where you please, e.g.
~/.local/bin/highlight-commit.awk
with the contents
#!/usr/bin/nawk -f
BEGIN {
n = split(commits,arrayCommits," ");
background="145;0;0"
foreground="255;255;255"
}
{
# Compare with every given input e.g. commit id
for (i=1; i <= n; i++) {
if(match($0,arrayCommits[i])) {
# Remove any ANSI color escape sequence for matching row
gsub("\x1b\\[[0-9;]*m","",$0)
# Create ANSI color escape sequence for whole row
$0 = sprintf("\x1b[48;2;%sm\x1b[38;2;%sm%s\x1b[0m\x1b[0m",
background,
foreground,
$0);
break;
}
}
printf("%s\n", $0);
}
In ~/.gitconfig, add the following alias:
[alias]
highlight-commit = "!f() { git lg | awk -v commits=\"$*\" -f ~/.local/bin/highlight-commit.awk | less -XR; }; f"
By calling with e.g. two commits:
git highlight-commit 82451f8 310fca4

A single Regex Match Entire String first to then break up into multiple components

I am trying to come up with a RegEx (POSIX like) in a vendor application that returns data looking like illustrated below and presents a single line of data at a time so I do not need to account for multiple rows and need to match a row indvidually.
It can return one or more values in the string result
The application doesn't just let me use a "\d+\.\d+" to capture the component out of the string and I need to map all components of a row of data to a variable unfortunately even if I am going to discard it or otherwise it returns a negative match result.
My data looks like the following with the weird underscore padding.
USER | ___________ 3.58625 | ___________ 7.02235 |
USER | ___________ 10.02625 | ___________ 15.23625 |
The syntax is supports is
Matches REGEX "(Var1 Regex), (Var2 Regex), (Var3 Regex), (Var 4 regex), (Var 5 regex)" and the entire string must match the aggregation of the RegEx components, a single character off and you get nothing.
The "|" characters are field separators for the data.
So in the above what I need is a RegEx that takes it up to the beginning of the numeric and puts that in Var1, then capture the numeric value with decimal point in var 2, then capture up to the next numeric in Var 3, and then keep the numeric in var 4, then capture the space and end field | character into var 5. Only Var 2 and 4 will be useful but I have to capture the entire string.
I have mainly tried capturing between the bars "|" using ^.*\|(.*).\|*$ from this question.
I have also tried the multiple variable ([0-9]+k?[.,]?[0-9]+)\s*-\s*.*?([0-9]+k?[.,]?[0-9]+) mentioned in this question.
I seem to be missing something to get it right when I try using them via RegExr and I feel like I am missing something pretty simple.
In RegExr I never get more than one part of the string I either get just the number, the equivalent of the entire string in a single variable, or just the number which don't work in this context to accomplish the required goal.
The only example the documentation provides is the following from like a SysLog entry of something like in this example I'm consolidating there with "Fault with Resource Name: Disk Specific Problem: Offline"
WHERE value matches regex "(.)Resource Name: (.), Specific Problem: ([^,]),(.)"
SET _Rrsc = var02
SET _Prob = var03
I've spun my wheels on this for several hours so would appreciate any guidance / help to get me over this hump.

Something like this should work:
(\D+)([\d.]+)(\D+)([\d.]+)(.*)
Or in normal words: Capture everything but numbers, capture a decimal number, capture everything but numbers, capture a decimal number, capture everything.
Using USER | ___________ 10.02625 | ___________ 15.23625 |
$1 = USER | ___________  
$2 = 10.02625
$3 =  | ___________  
$4 = 15.23625
$5 =  |

Parsing log files

I'm trying to write a script to simplify the process of searching through a particular applications log files for specific information. So I thought maybe there's a way to convert them into an XML tree, and I'm off to a decent start....but The problem is, the application log files are an absolute mess if you ask me
Some entries are simple
2014/04/09 11:27:03 INFO Some.code.function - Doing stuff
Ideally I'd like to turn the above into something like this
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
</Message>
Other entries are something like this where there's additional information and line breaks
2014/04/09 11:27:04 INFO Some.code.function - Something happens
changes:
this stuff happened
I'd like to turn this last chunk into something like the above, but add the additional info into a section
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>INFO</Type>
<Source>Some.code.function</Source>
<Sub>Doing stuff</Sub>
<details>changes:
this stuff happened</details>
</Message>
and then other messages, errors will be in the form of
2014/04/09 11:27:03 ERROR Some.code.function - Something didn't work right
Log Entry: LONGARSEDGUID
Error Code: E3145
Application: Name
Details:
message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry
This last chunk I'd like to convert as the last to above examples, but adding XML nodes for log entry, error code, application, and again, details like so
<Message>
<Date>2014/04/09</Date>
<Time>11:48:38</Time>
<Type>ERROR </Type>
<Source>Some.code.function</Source>
<Sub>Something didn't work right</Sub>
<Entry>LONGARSEDGUID</Entry>
<Code>E3145</Code>
<Application>Name</Application>
<details>message information etc etc and more line breaks, this part of the message may add up to an unknown number of lines before the next entry</details>
</Message>
Now I know that Select-String has a context option which would let me select a number of lines after the line I've filtered, the problem is, this isn't a constant number.
I'm thinking a regular expression would also me to select the paragraph chunk before the date string, but regular expressions are not a strong point of mine, and I thought there might be a better way because the one constant is that new entries start with a date string
the idea though is to either break these up into xml or tables of sorts and then from there I'm hoping it might take the last or filtering non relevant or recurring messages a little easier
I have a sample I just tossed on pastebin after removing/replacing a few bits of information for privacy reasons
http://pastebin.com/raw.php?i=M9iShyT2

Sorry this is kind of late, I got tied up with work for a bit there (darn work expecting me to be productive while on their dime). I ended up with something similar to Ansgar Wiechers solution, but formatted things into objects and collected those into an array. It doesn't manage your XML that you added later, but this gives you a nice array of objects to work with for the other records. I'll explain the main RegEx line here, I'll comment in-line where it's practical.
'(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) [\d+?] (\w+?) {1,2}(.+?) - (.+)$' is the Regex that detects the start of a new record. I started to explain it, but there are probably better resources for you to learn RegEx than me explaining it to me. See this RegEx101.com link for a full breakdown and examples.
$Records=#() #Create empty array that we will populate with custom objects later
$Event = $Null #make sure nothing in $Event to give script a clean start
Get-Content 'C:\temp\test1.txt' | #Load file, and start looping through it line-by-line.
?{![string]::IsNullOrEmpty($_)}|% { #Filter out blank lines, and then perform the following on each line
if ($_ -match '(^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[\d+?] (\w+?) {1,2}(.+?) - (.+)$') { #New Record Detector line! If it finds this RegEx match, it means we're starting a new record.
if ($Event) { #If there's already a record in progress, add it to the Array
$Records+=$Event
}
$Event = New-Object PSObject -Property #{ #Create a custom PSObject object with these properties that we just got from that RegEx match
DateStamp = [datetime](get-date $Matches[1]) #We convert the date/time stamp into an actual DateTime object. That way sorting works better, and you can compare it to real dates if needed.
Type = $Matches[2]
Source = $Matches[3]
Message = $Matches[4]}
Ok, little pause for the cause here. $Matches isn't defined by me, why am I referencing it? . When PowerShell gets matches from a RegEx expression it automagically stores the resulting matches in $Matches. So all the groups that we just matched in parenthesis become $Matches[1], $Matches[2], and so on. Yes, it's an array, and there is a $Matches[0], but that is the entire string that was matched against, not just the groups that matched. We now return you to your regularly scheduled script...
} else { #End of the 'New Record' section. If it's not a new record if does the following
if($_ -match "^((?:[^ ^\[])(?:\w| |\.)+?):(.*)$"){
RegEx match again. It starts off by stating that this has to be the beginning of the string with the carat character (^). Then it says (in a non-capturing group noted by the (?:<stuff>) format, which really for my purposes just means it won't show up in $Matches) [^ \[]; that means that the next character can not be a space or opening bracket (escaped with a ), just to speed things up and skip those lines for this check. If you have things in brackets [] and the first character is a carat it means 'don't match anything in these brackets'.
I actually just changed this next part to include periods, and used \w instead of [a-zA-Z0-9] because it's essentially the same thing but shorter. \w is a "word character" in RegEx, and includes letters, numbers, and the underscore. I'm not sure why the underscore is considered part of a word, but I don't make the rules I just play the game. I was using [a-zA-Z0-9] which matches anything between 'a' and 'z' (lowercase), anything between 'A' and 'Z' (uppercase), and anything between '0' and '9'. At the risk of including the underscore character \w is a lot shorter and simpler.
Then the actual capturing part of this RegEx. This has 2 groups, the first is letters, numbers, underscores, spaces, and periods (escaped with a \ because '.' on it's own matches any character). Then a colon. Then a second group that is everything else until the end of the line.
$Field = $Matches[1] #Everything before the colon is the name of the field
$Value = $Matches[2].trim() #everything after the colon is the data in that field
$Event | Add-Member $Field $Value #Add the Field to $Event as a NoteProperty, with a value of $Value. Those two are actually positional parameters for Add-Member, so we don't have to go and specify what kind of member, specify what the name is, and what the value is. Just Add-Member <[string]name> <value can be a string, array, yeti, whatever... it's not picky>
} #End of New Field for current record
else{$Value = $_} #If it didn't find the regex to determine if it is a new field then this is just more data from the last field, so don't change the field, just set it all as data.
} else { #If it didn't find the regex then this is just more data from the last field, so don't change the field, just set it all as data.the field does not 'not exist') do this:
$Event.$Field += if(![string]::isNullOrEmpty($Event.$Field)){"`r`n$_"}else{$_}}
This is a long explanation for a fairly short bit of code. Really all it does is add data to the field! This has an inverted (prefixed with !) If check to see if the current field has any data, if it, or if it is currently Null or Empty. If it is empty it adds a new line, and then adds the $Value data. If it doesn't have any data it skips the new line bit, and just adds the data.
}
}
}
$Records+=$Event #Adds the last event to the array of records.
Sorry, I'm not very good with XML. But at least this gets you clean records.
Edit: Ok, code is notated now, hopefully everything is explained well enough. If something is still confusing perhaps I can refer you to a site that explains better than I can. I ran the above against your sample input in PasteBin.

One possible way to deal with such files is to process them line by line. Each log entry starts with a timestamp and ends when the next line starting with a timestamp appears, so you could do something like this:
Get-Content 'C:\path\to\your.log' | % {
if ($_ -match '^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}') {
if ($logRecord) {
# If a current log record exists, it is complete now, so it can be added
# to your XML or whatever, e.g.:
$logRecord -match '^(\d{4}/\d{2}/\d{2}) (\d{2}:\d{2}:\d{2}) (\S+) ...'
$message = $xml.CreateElement('Message')
$date = $xml.CreateElement('Date')
$date.InnerText = $matches[1]
$message.AppendChild($date)
$time = $xml.CreateElement('Time')
$time.InnerText = $matches[2]
$message.AppendChild($time)
$type = $xml.CreateElement('Type')
$type.InnerText = $matches[3]
$message.AppendChild($type)
...
$xml.SelectSingleNode('...').AppendChild($message)
}
$logRecord = $_ # start new record
} else {
$logRecord += "`r`n$_" # append to current record
}
}

VB.Net Beginner: Replace with Wildcards, Possibly RegEx?

I'm converting a text file to a Tab-Delimited text file, and ran into a bit of a snag. I can get everything I need to work the way I want except for one small part.
One field I'm working with has the home addresses of the subjects as a single entry ("1234 Happy Lane Somewhere, St 12345") and I need each broken down by Street(Tab)City(Tab)State(Tab)Zip. The one part I'm hung up on is the Tab between the State and the Zip.
I've been using input=input.Replace throughout, and it's worked well so far, but I can't think of how to untangle this one. The wildcards I'm used to don't seem to be working, I can't replace ("?? #####") with ("??" + ControlChars.Tab + "#####")...which I honestly didn't expect to work, but it's the only idea on the matter I had.
I've read a bit about using Regex, but have no experience with it, and it seems a bit...overwhelming.
Is Regex my best option for this? If not, are there any other suggestions on solutions I may have missed?
Thanks for your time. :)
EDIT: Here's what I'm using so far. It makes some edits to the line in question, taking care of spaces, commas, and other text I don't need, but I've got nothing for the State/Zip situation; I've a bad habit of wiping something if it doesn't work, but I'll append the last thing I used to the very end, if that'll help.
If input Like "Guar*###/###-####" Then
input = input.Replace("Guar:", "")
input = input.Replace(" ", ControlChars.Tab)
input = input.Replace(",", ControlChars.Tab)
input = "C" + ControlChars.Tab + strAccount + ControlChars.Tab + input
End If
input = System.Text.RegularExpressions.Regex.Replace(" #####", ControlChars.Tab + "#####") <-- Just one example of something that doesn't work.
This is what's written to input in this example
" Guar: LASTNAME,FIRSTNAME 999 E 99TH ST CITY,ST 99999 Tel: 999/999-9999"
And this is what I can get as a result so far
C 99999/9 LASTNAME FIRSTNAME 999 E 99TH ST CITY ST 99999 999/999-9999
With everything being exactly what I need besides the "ST 99999" bit (with actual data obviously omitted for privacy and professional whatnots).
UPDATE: Just when I thought it was all squared away, I've got another snag. The raw data gives me this.
# TERMINOLOGY ######### ##/##/#### # ###.##
And the end result is giving me this, because this is a chunk of data that was just fine as-is...before I removed the Tabs. Now I need a way to replace them after they've been removed, or to omit this small group of code from a document-wide Tab genocide I initiate the code with.
#TERMINOLOGY###########/##/########.##
Would a variant on rgx.Replace work best here? Or can I copy the code to a variable, remove Tabs from the document, then insert the variable without losing the tabs?

I think what you're looking for is
Dim r As New System.Text.RegularExpressions.Regex(" (\d{5})(?!\d)")
Dim input As String = rgx.Replace(input, ControlChars.Tab + "$1")
The first line compiles the regular expression. The \d matches a digit, and the {5}, as you can guess, matches 5 repetitions of the previous atom. The parentheses surrounding the \d{5} is known as a capture group, and is responsible for putting what's captured in a pseudovariable named $1. The (?!\d) is a more advanced concept known as a negative lookahead assertion, and it basically peeks at the next character to check that it's not a digit (because then it could be a 6-or-more digit number, where the first 5 happened to get matched). Another version is
" (\d{5})\b"
where the \b is a word boundary, disallowing alphanumeric characters following the digits.

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T

pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.

If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js