REGEX - Automatic text selection and restructering - regex

I am kinda new to AHK, I've written some scripts. But with my latest script, I'm kind of stuck with REGEX in AHK.
I want to make the report of a structure of texts I make.
To do this I've set up a system:
sentences ending on a '.', are the important sentences with "-". (variable 'Vimportant') BUT WITHOUT the words mentioned for 'Vanecdotes2' or 'Vdelete2' cfr. 4
sentences ending on a '.*', are the anecdotes (variable 'Vanecdotes1') where I've put a star manualy after the point.
sentences ending on a '.!', are irrelevant sentences and need to be deleted (variable 'Vdelete1') were I've put a star manually after the point.
an extra option I want to implement are words to detect in a sentence so that the sentence will be automatically added to the variable 'Vanecdotes2' or 'Vdelete2'
An random example would be this (I already have put ! and * after the sentence (why is not important) and of which "acquisition" is an example op Vanecdotes2 of my point 4 above):
Last procedure on 19/8/2019.
Normal structure x1.!
Normal structure x2.!
Abberant structure x3, needs follow-up within 2 months.
Structure x4 is lower in activity, but still above p25.
Abberant structure x4, needs follow-up within 6 weeks.
Normal structure x5.
Good aqcuisition of x6.
So the output of the Regex in the variables should be
Last procedure on 19/8/2019.
Normal structure x1.! --> regex '.!' --> Vdelete1
Normal structure x2.! --> regex '.!' --> Vdelete1
Abberant structure x3, needs follow-up within 2 months. --> Regex '.' = Vimportant
Structure x4 is lower in activity, but still above p25.* --> regex '.*' = Vanecdote1
Abberant structure x4, needs follow-up within 6 weeks. --> Regex '.' = Vimportant
Normal structure x5.! --> regex '.!' --> Vdelete1
Good aqcuisition of x6. --> Regex 'sentence with the word acquisition' = Vanecdote2
And the output should be:
'- Last procedure on 19/8/2019.
- Abberant structure x3, needs follow-up within 2 months.
- Abberant structure x4, needs follow-up within 6 weeks.
. Structure x4 is lower inactivity, but still above p25.
. Good aqcuisition of x6.
But I have been having a lot of trouble with the regex, especialy with the selection of sentences ending on a * or !. But also with the exclusion criteria, they just don't want to do it.
Because AHT doesn't have a real good tester, I first tested it in another regex tester and I was planning to 'translate' it later on to AHK code.. but it just doesn't work. (so I know in the script below I'm using AHK language with nonAHK regex, but I've just put the to together for illustration)
This is what i have now:
Send ^c
clipwait, 1000
Temp := Clipboard
Regexmatch(Temp, "^.*[.]\n(?!^.*\(Anecdoteword1|Anecdoteword2|deletewordX|deletewordY)\b.*$)", Vimportant)
Regexmatch(Temp, "^.*[.][*]\n")", Vanecdotes1)
Regexmatch(Temp, "^.*[.][!]\n")", Vdelete1)
Regexmatch(Temp, "^.*\b(Anecdoteword1|Anecdoteword2)\b.*$")", Vanecdotes2)
Regexmatch(Temp, "^.*\b(deletewordX|deletewordY)\b.*$")", Vdelete2)
Vanecdotes_tot := Vanecdotes1 . Vanecdotes2
Vdelete_tot := Vdelete1 . Vdelete2
Vanecdotes_ster := "* " . StrReplace(Vanecdotes_tot, "`r`n", "`r`n* ")
Vimportant_stripe := "- " . StrReplace(Vimportant, "`r`n", "`r`n- ")
Vresult := Vimportant_stripe . "`n`n" . Vanecdotes_ster
For "translation to AHK" I tried to make ^.*\*'n from the working (non ahk) regex ^.*[.][*]\n.

There isn't really such a thing as AHK regex. AHK pretty much uses PCRE, apart from the options.
So don't try to turn a linefeed \n into an AHK linefeed `n.
And there seem to be some syntax errors in your regexes. Not quite sure what those extra ") in there are supposed to be. Also, instead of using [.][*], you're supposed to use \.\*. The \ is required with those specific characters to escape their normal functionality (any character and match between zero and unlimited).
[] is to match any character in that group, like if you wanted to match either . or * you'd do [.*].
And seems like you got the idea of using capture groups, but just in case, here's a minimal example about them:
RegexMatch("TestTest1233334Test", "(\d+)", capture)
MsgBox, % capture
And lastly, about your approach to the problem, I'd recommend looping through the input line by line. It'll be much better/easier. Use e.g LoopParse.
Minimal example for it as well:
inp := "
(
this is
a multiline
textblock
we're going
to loop
through it
line by line
)"
Loop, Parse, inp, `n, `r
MsgBox, % "Line " A_Index ":`n" A_LoopField
Hope this was of help.

This i were i al up till now, nothing works (i will try the suggested loop when Regex is working): ^m::
BlockInput, On
MouseGetPos, , ,TempID, control
WinActivate, ahk_id %TempID%
if WinActive("Pt.")
Send ^c
clipwait, 1000
Temp := Clipboard
Regexmatch(Temp, "(^(?:..\n)((?! PAX|PAC|Normaal|Geen).)$)", Vimportant)
Vimportant := Vimportant.1
Regexmatch(Temp, "(^..*\n)", Vanecdotes1_ster)
Regexmatch(Temp, "(^..!\n)" , Vdelete1_uitroep)
Regexmatch(Temp, "(^.\b(PAX|PAC)\b.$)", Vanecdotes2)
Regexmatch(Temp, "(^.\b(Normaal|Geen)\b.$)", Vdelete2)
Vanecdotes1 := StrReplace(Vanecdotes1_ster, ".", ".")
Vdelete1 := StrReplace(Vdelete1_uitroep, ".!", ".")
Vanecdotes_tot := Vanecdotes1 . Vanecdotes2
Vdelete_tot := Vdelete1 . Vdelete2
Vanecdotes_ster := " " . StrReplace(Vanecdotes_tot, "rn", "rn* ")
Vimportant_stripe := "- " . StrReplace(Vimportant, "rn", "rn- ")
Vresult := Vimportant_stripe . "nn" . Vanecdotes_ster
Clipboard := Vresult
Send ^v
return

Related

RegEx to format Wikipedia's infoboxes code [SOLVED]

I am a contributor to Wikipedia and I would like to make a script with AutoHotKey that could format the wikicode of infoboxes and other similar templates.
Infoboxes are templates that displays a box on the side of articles and shows the values of the parameters entered (they are numerous and they differ in number, lenght and type of characters used depending on the infobox).
Parameters are always preceded by a pipe (|) and end with an equal sign (=). On rare occasions, multiple parameters can be put on the same line, but I can sort this manually before running the script.
A typical infobox will be like this:
{{Infobox XYZ
| first parameter = foo
| second_parameter =
| 3rd parameter = bar
| 4th = bazzzzz
| 5th =
| etc. =
}}
But sometime, (lazy) contributors put them like this:
{{Infobox XYZ
|first parameter=foo
|second_parameter=
|3rd parameter=bar
|4th=bazzzzz
|5th=
|etc.=
}}
Which isn't very easy to read and modify.
I would like to know if it is possible to make a regex (or a serie of regexes) that would transform the second example into the first.
The lines should start with a space, then a pipe, then another space, then the parameter name, then any number of spaces (to match the other lines lenght), then an equal sign, then another space, and if present, the parameter value.
I try some things using multiple capturing groups, but I'm going nowhere... (I'm even ashamed to show my tries as they really don't work).
Would someone have an idea on how to make it work?
Thank you for your time.
The lines should start with a space, then a pipe, then another space, then the parameter name, then a space, then an equal sign, then another space, and if present, the parameter value.
First the selection, it's relatively trivial:
^\s*\|\s*([^=]*?)\s*=(.*)$
Then the replacement, literally your description of what you want (note the space at the beginning):
| $1 = $2
See it in action here.
#Blindy:
The best code I have found so far is the following : https://regex101.com/r/GunrUg/1
The problem is it doesn't align the equal signs vertically...
I got an answer on AutoHotKey forums:
^i::
out := ""
Send, ^x
regex := "O)\s*\|\s*(.*?)\s*=\s*(.*)", width := 1
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
width := Max(width, StrLen(_[1]))
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
out .= Format(" | {:-" width "} = {2}", _[1],_[2]) "`n"
else
out .= A_LoopField "`n"
Clipboard := out
Send, ^v
Return
With this script, pressing Ctrl+i formats the infobox code just right (I guess a simple regex isn't enough to do the job).

Remove lines from buffer that match the selected text

When analyzing large log files, I often remove lines containing text I find irrelevant:
:g/whatever/d
Sometimes I find text that spans multiple lines, like stacktraces. For that, I record the steps taken (search, go to start anchor, delete to end anchor) and replay that macro with 100000#q. I'm searching for a function or a feature vim already has included that allows me to mark text and remove all lines containing this text. Ideally this would also work for block selection.
If I understood your problem right, this command should do what you want:
:g/NullPointer/,/omitt/d
Example:
Before:
1
2
3
NullPointerException1
4
5
6
omitted
7
NullPointerException2
8
9
omitted
10
After:
1
2
3
7
10
Please read :h edit-paragraph-join, there is good explanation for the command, your case is just changing join into d
:g/whatever/d2
will delete a line with whatever and the line after it. If you can find text that always happens in the first line, you can strip out all of the following text if it has the same number of lines by changing 2 to whatever you need.
You could actually just use some normal commands in a global command to achieve what you want, look at your example (hope i understood it more or less right):
someText
NullPointerException
...
omitted
you want to delte from the line above NPE until the line with omitted right?
Just use the following:
:g/NullPointerException/execute "normal! kddd/omitted\<cr>dd"
It maybe looks complex, but it isn't. It is not better than a macro1
, but i like commands more, because I always make errors recording macros.
Since it only uses normal vim movements, it is easy to adopt. If you f.e. not know where your previous anchor is, you could use ?anchor\<cr> instead of kd. For a better demonstration you will have to submit a realistic example.
[1] You could argue, that this only needs to be run once, but that is also true for a recursive macro http://vim.wikia.com/wiki/Record_a_recursive_macro
Thanks to the answers here, I was able to code a very handy function: The sources below enables one to select text and remove all lines with the same (or similar) text in the current buffer. That works with both in-line and multiline selection. As I said I was searching for something that made me faster in analyzing log files. Log files typically contain dates and times and these change all the time, so it's a good idea to have something that let's us ignore numbers. Let's see. I'm using these two mappings:
vnoremap d :<C-U>echo RemoveSelectionFromBuffer(0)<CR>
vnoremap D :<C-U>echo RemoveSelectionFromBuffer(1)<CR>
Typical usage:
Remove similar lines ignoring numbers: Shift+v, then Shift+d
Remove same matches (single line): Mark text inline (leaving out dates and times), then d
Remove same matches (multiline): Mark text across lines (leaving out dates and times), then d
Here's the source code:
" Removes lines matching the selected text from buffer.
function! RemoveSelectionFromBuffer(ignoreNumbers)
let lines = GetVisualSelection() " selected lines
" Escape backslashes and slashes (delimiters)
call map(lines, {k, v -> substitute(v, '\\\|/', '\\&', 'g')})
if a:ignoreNumbers == 1
" Substitute all numbers with \s*\d\s* - in formatted output matching
" lines may have whitespace instead of numbers. All backslashes need
" to be escaped because \V (very nomagic) will be used.
call map(lines, {k, v -> substitute(v, '\s*\d\+\s*', '\\s\\*\\d\\+\\s\\*', 'g')})
endif
let blc = line('$') " number of lines in buffer (before deletion)
let vlc = len(lines) " number of selected lines
let pattern = join(lines, '\_.') " support multiline patterns
let cmd = ':g/\V' . pattern . '/d_' . vlc " delete matching lines (d_3)
let pos = getpos('v') " save position
execute "silent " . cmd
call setpos('.', pos) " restore position
let dlc = blc - line('$') " number of deleted lines
let dmc = dlc / vlc " number of deleted matches
let cmd = substitute(cmd, '\(.\{50\}\).*', '\1...', '') " command output
let lout = dlc . ' line' . (dlc == 1 ? '' : 's')
let mout = '(' . dmc . ' match' . (dmc == 1 ? '' : 'es') . ')'
return printf('%s removed: %s', (vlc == 1 ? lout : lout . ' ' . mout), cmd)
endfunction
I took the GetVisualSelection() code from this answer.
function! GetVisualSelection()
if mode() == "v"
let [line_start, column_start] = getpos("v")[1:2]
let [line_end, column_end] = getpos(".")[1:2]
else
let [line_start, column_start] = getpos("'<")[1:2]
let [line_end, column_end] = getpos("'>")[1:2]
end
if (line2byte(line_start)+column_start) > (line2byte(line_end)+column_end)
let [line_start, column_start, line_end, column_end] =
\ [line_end, column_end, line_start, column_start]
end
let lines = getline(line_start, line_end)
if len(lines) == 0
return ''
endif
let lines[-1] = lines[-1][: column_end - 1]
let lines[0] = lines[0][column_start - 1:]
return lines
endfunction
Thanks, aepksbuck, DoktorOSwaldo and Kent.

Delphi multiline regex

I have some non-regression test code in Delphi that calls an external diff tool. Then my code loads the diff results and should remove acceptable differences, such as dates in the compared results. I'm trying to do this with a multiline TRegEx.Replace , but no match is found ...
https://regex101.com/r/QBZuws/2 shows the pattern I came up with and a sample test diff file. I need to delete the matching "pararaphs" of 3 lines
Here is my code :
function FilterDiff(AText:string):string;
var
LStr:string;
Regex: TRegEx;
begin
// AText:=StringReplace(AText,#13+#10,'\n',[rfReplaceAll]); // doesn't help ...
LStr := '\d\d.\d\d.20\d\d \d\d:\d\d:\d\d'; // regex for date and time
LStr := '##.*##\n-'+LStr+'\n\+'+LStr; // regex for paragraphs to remove
Regex := TRegEx.Create(LStr, [roMultiLine]);
Result := Regex.Replace(AText,'');
end;
procedure TReportTest.NonRegression;
var
LDiff : TStringList;
// others removed for clarity
begin
// removed section code that call an external tool and produces diff.txt file
LDiff := TStringList.Create;
LDiff.LoadFromFile('diff.txt');
Status(FilterDiff(LDiff.Text)); // show the diffs in DUnit GUI for now
LDiff.Free;
end;
Besides, while tracing TRegEx.Replace down to
System.RegularExpressionsAPI.pcre_exec($4D72A50,nil,'--- '#$D#$A'+++ '#$D#$A'## -86 +86 ##'#$D#$A'-16.11.2017 15:00:36'#$D#$A'+15.11.2017 10:47:58'#$D#$A'## -400 +400 ##'#$D#$A'-16.11.2017 15:00:36'#$D#$A'+15.11.2017 10:47:58'#$D#$A,132,0,1024,$7D56800,300)
System.RegularExpressionsCore.TPerlRegEx.Match
System.RegularExpressionsCore.TPerlRegEx.ReplaceAll
System.RegularExpressions.TRegEx.Replace(???,???)
TestReportAuto.FilterDiff('--- '#$D#$A'+++ '#$D#$A'## -86 +86 ##'#$D#$A'-16.11.2017 15:00:36'#$D#$A'+15.11.2017 10:47:58'#$D#$A'## -400 +400 ##'#$D#$A'-16.11.2017 15:00:36'#$D#$A'+15.11.2017 10:47:58'#$D#$A)
I was surprised to see quotes before and after each newline #$D#$A in the debugger, but they don't look "real" ... or are they ?
As you seem to have issues with different kinds of line breaks, I would recommend to adjust your Regex to use \R instead of \n which matches Windows style linebreaks (CR + LF) as well as Unix style linebreaks (LF).
Well, I just noticed the \n in regex matches only LF, not CR+LF, so I added
AText:=StringReplace(AText,#13+#10,#10,[rfReplaceAll]); // \n matches only LF !
at the beginning of my function and it's much better now...
Sometimes writing down a problem helps ...

Join lines after specific word till another specific word

I have a .txt file of a transcript that looks like this
MICHEAL: blablablabla.
further talk by Michael.
more talk by Michael.
VALERIE: blublublublu.
Valerie talks more.
MICHAEL: blibliblibli.
Michael talks again.
........
All in all this pattern goes on for up to 4000 lines and not just two speakers but with up to seven different speakers, all with unique names written with upper-case letters (as in the example above).
For some text mining I need to rearrange this .txt file in the following way
Join the lines following one speaker - but only the ones that still belong to him - so that the above file looks like this:
MICHAEL: blablablabla. further talk by Michael. more talk by Michael.
VALERIE: blublublublu. Valerie talks more.
MICHAEL: blibliblibli. Michael talks again.
Sort the now properly joined lines in the .txt file alphabetically, so that all lines spoken by a speaker are now together. But, the sort function should not sort the sentences spoken by one speaker (after having sorted each speakers lines together).
I know some basic vim commands, but not enough to figure this out. Especially, the first one. I do not know what kind of pattern I can implement in vim so that it only joins the lines of each speaker.
Any help would be greatly apperciated!
Alright, first the answer:
:g/^\u\+:/,/\n\u\+:\|\%$/join
And now the explanation:
g stands for global and executes the following command on every line that matches
/^\u+:/ is the pattern :g searches for : ^ is start of line, \u is a upper case character, + means one or more matches and : is unsurprisingly :
then comes the tricky bit, we make the executed command a range, from the match so some other pattern match. /\n\u+:\|\%$ is two parts parted by the pipe \| . \n\u+: is a new line followed by the last pattern, i.e. the line before the next speaker. \%$ is the end of the file
join does what it says on the tin
So to put it together: For each speaker, join until the line before the next speaker or the end of the file.
The closest to the sorting I now of is
:sort /\u+:/ r
which will only sort by speaker name and reverse the other line so it isn't really what you are looking for
Well I don't know much about vim, but I was about to match lines corresponding particular speaker and here is the regex for that.
Regex: /([A-Z]+:)([A-Za-z\s\.]+)(?!\1)$/gm
Explanation:
([A-Z]+:) captures the speaker's name which contains only capital letters.
([A-Za-z\s\.]+) captures the dialogue.
(?!\1)$ backreferences to the Speaker's name and compares if the next speaker was same as the last one. If not then it matches till the new speaker is found.
I hope this will help you with matching at least.
In vim you might take a two step approach, first replace all newlines.
:%s/\n\+/ /g
Then insert a new line before the terms UPPERCASE: except the first one:
:%s/ \([[:upper:]]\+:\)/\r\1/g
For the sorting you can leverage the UNIX sort program:
:%sort!
You can combine them using a pipe symbol:
:%s/\n\+/ /g | %s/ \([[:upper:]]\+:\)/\r\1/g | %!sort
and map them to a key in your vimrc file:
:nnoremap <F5> :%s/\n\+/ /g \| %s/ \([[:upper:]]\+:\)/\r\1/g \| %sort! <CR>
If you press F5 in normal mode, the transformation happens. Note that the | needs to get escaped in the nnoremap command.
Here is a script solution to your problem.
It's not well tested, so I added some comments so you can fix it easily.
To make it run, just:
fill the g:speakers var in the top of the script with the uppercase names you need;
source the script (ex: :sav /tmp/script.vim|so %);
run :call JoinAllSpeakLines() to join the lines by speakers;
run :call SortSpeakLines() to sort
You may adapt the different patterns to better fit your needs, for example adding some space tolerance (\u\{2,}\s*\ze:).
Here is the code:
" Fill the following array with all the speakers names:
let g:speakers = [ 'MICHAEL', 'VALERIE', 'MATHIEU' ]
call sort(g:speakers)
function! JoinAllSpeakLines()
" In the whole file, join all the lines between two uppercase speaker names
" followed by ':', first inclusive:
silent g/\u\{2,}:/call JoinSpeakLines__()
endf
function! SortSpeakLines()
" Sort the whole file by speaker, keeping the order for
" each speaker.
" Must be called after JoinAllSpeakLines().
" Create a new dict, with one key for each speaker:
let speakerlines = {}
for speaker in g:speakers
let speakerlines[speaker] = []
endfor
" For each line in the file:
for line in getline(1,'$')
let speaker = GetSpeaker__(line)
if speaker == ''
continue
endif
" Add the line to the right speaker:
call add(speakerlines[speaker], line)
endfor
" Delete everything in the current buffer:
normal gg"_dG
" Add the sorted lines, speaker by speaker:
for speaker in g:speakers
call append(line('$'), speakerlines[speaker])
endfor
" Delete the first (empty) line in the buffer:
normal gg"_dd
endf
function! GetOtherSpeakerPattern__(speaker)
" Returns a pattern which matches all speaker names, except the
" one given as a parameter.
" Create an new list with a:speaker removed:
let others = copy(g:speakers)
let idx = index(others, a:speaker)
if idx != -1
call remove(others, idx)
endif
" Create and return the pattern list, which looks like
" this : "\v<MICHAEL>|<VALERIE>..."
call map(others, 'printf("<%s>:",v:val)')
return '\v' . join(others, '|')
endf
function! GetSpeaker__(line)
" Returns the uppercase name followed by a ':' in a line
return matchstr(a:line, '\u\{2,}\ze:')
endf
function! JoinSpeakLines__()
" When cursor is on a line with an uppercase name, join all the
" following lines until another uppercase name.
let speaker = GetSpeaker__(getline('.'))
if speaker == ''
return
endif
normal V
" Search for other names after the cursor line:
let srch = search(GetOtherSpeakerPattern__(speaker), 'W')
echo srch
if srch == 0
" For the last one only:
normal GJ
else
normal kJ
endif
endf

Parse input from a particular format

Let us say I have the following string: "Algorithms 1" by Robert Sedgewick. This is input from the terminal.
The format of this string will always be:
1. Starts with a double quote
2. Followed by characters (may contain space)
3. Followed by double quote
4. Followed by space
5. Followed by the word "by"
6. Followed by space
7. Followed by characters (may contain space)
Knowing the above format, how do I read this?
I tried using fmt.Scanf() but that would treat a word after each space as a separate value. I looked at regular expressions but I could not make out if there is a function using which I could GET values and not just test for validity.
1) With character search
The input format is so simple, you can simply use character search implemented in strings.IndexRune():
s := `"Algorithms 1" by Robert Sedgewick`
s = s[1:] // Exclude first double qote
x := strings.IndexRune(s, '"') // Find the 2nd double quote
title := s[:x] // Title is between the 2 double qotes
author := s[x+5:] // Which is followed by " by ", exclude that, rest is author
Printing results with:
fmt.Println("Title:", title)
fmt.Println("Author:", author)
Output:
Title: Algorithms 1
Author: Robert Sedgewick
Try it on the Go Playground.
2) With splitting
Another solution would be to use strings.Split():
s := `"Algorithms 1" by Robert Sedgewick`
parts := strings.Split(s, `"`)
title := parts[1] // First part is empty, 2nd is title
author := parts[2][4:] // 3rd is author, but cut off " by "
Output is the same. Try it on the Go Playground.
3) With a "tricky" splitting
If we cut off the first double quote, we may do a splitting by the separator
`" by `
If we do so, we will have exactly the 2 parts: title and author. Since we cut off first double quote, the separator can only be at the end of the title (the title cannot contain double quotes as per your rules):
s := `"Algorithms 1" by Robert Sedgewick`
parts := strings.Split(s[1:], `" by `)
title := parts[0] // First part is exactly the title
author := parts[1] // 2nd part is exactly the author
Try it on the Go Playground.
4) With regexp
If after all the above solutions you still want to use regexp, here's how you could do it:
Use parenthesis to define submatches you want to get out. You want 2 parts: the title between quotes and the author that follows by. You can use regexp.FindStringSubmatch() to get the matching parts. Note that the first element in the returned slice will be the complete input, so relevant parts are the subsequent elements:
s := `"Algorithms 1" by Robert Sedgewick`
r := regexp.MustCompile(`"([^"]*)" by (.*)`)
parts := r.FindStringSubmatch(s)
title := parts[1] // First part is always the complete input, 2nd part is the title
author := parts[2] // 3rd part is exactly the author
Try it on the Go Playground.
You should use groups (parentheses) to get out the information you want:
"([\w\s]*)"\sby\s([\w\s]+)\.
This returns two groups:
[1-13] Algorithms 1
[18-34] Robert Sedgewick
Now there should be a regex method to get all matches out of a text. The result will contain a match object which then contains the groups.
I think in go it is: FindAllStringSubmatch
(https://github.com/StefanSchroeder/Golang-Regex-Tutorial/blob/master/01-chapter2.markdown)
Test it out here:
https://regex101.com/r/cT2sC5/1