Join lines after specific word till another specific word - regex

I have a .txt file of a transcript that looks like this
MICHEAL: blablablabla.
further talk by Michael.
more talk by Michael.
VALERIE: blublublublu.
Valerie talks more.
MICHAEL: blibliblibli.
Michael talks again.
........
All in all this pattern goes on for up to 4000 lines and not just two speakers but with up to seven different speakers, all with unique names written with upper-case letters (as in the example above).
For some text mining I need to rearrange this .txt file in the following way
Join the lines following one speaker - but only the ones that still belong to him - so that the above file looks like this:
MICHAEL: blablablabla. further talk by Michael. more talk by Michael.
VALERIE: blublublublu. Valerie talks more.
MICHAEL: blibliblibli. Michael talks again.
Sort the now properly joined lines in the .txt file alphabetically, so that all lines spoken by a speaker are now together. But, the sort function should not sort the sentences spoken by one speaker (after having sorted each speakers lines together).
I know some basic vim commands, but not enough to figure this out. Especially, the first one. I do not know what kind of pattern I can implement in vim so that it only joins the lines of each speaker.
Any help would be greatly apperciated!

Alright, first the answer:
:g/^\u\+:/,/\n\u\+:\|\%$/join
And now the explanation:
g stands for global and executes the following command on every line that matches
/^\u+:/ is the pattern :g searches for : ^ is start of line, \u is a upper case character, + means one or more matches and : is unsurprisingly :
then comes the tricky bit, we make the executed command a range, from the match so some other pattern match. /\n\u+:\|\%$ is two parts parted by the pipe \| . \n\u+: is a new line followed by the last pattern, i.e. the line before the next speaker. \%$ is the end of the file
join does what it says on the tin
So to put it together: For each speaker, join until the line before the next speaker or the end of the file.
The closest to the sorting I now of is
:sort /\u+:/ r
which will only sort by speaker name and reverse the other line so it isn't really what you are looking for

Well I don't know much about vim, but I was about to match lines corresponding particular speaker and here is the regex for that.
Regex: /([A-Z]+:)([A-Za-z\s\.]+)(?!\1)$/gm
Explanation:
([A-Z]+:) captures the speaker's name which contains only capital letters.
([A-Za-z\s\.]+) captures the dialogue.
(?!\1)$ backreferences to the Speaker's name and compares if the next speaker was same as the last one. If not then it matches till the new speaker is found.
I hope this will help you with matching at least.

In vim you might take a two step approach, first replace all newlines.
:%s/\n\+/ /g
Then insert a new line before the terms UPPERCASE: except the first one:
:%s/ \([[:upper:]]\+:\)/\r\1/g
For the sorting you can leverage the UNIX sort program:
:%sort!
You can combine them using a pipe symbol:
:%s/\n\+/ /g | %s/ \([[:upper:]]\+:\)/\r\1/g | %!sort
and map them to a key in your vimrc file:
:nnoremap <F5> :%s/\n\+/ /g \| %s/ \([[:upper:]]\+:\)/\r\1/g \| %sort! <CR>
If you press F5 in normal mode, the transformation happens. Note that the | needs to get escaped in the nnoremap command.

Here is a script solution to your problem.
It's not well tested, so I added some comments so you can fix it easily.
To make it run, just:
fill the g:speakers var in the top of the script with the uppercase names you need;
source the script (ex: :sav /tmp/script.vim|so %);
run :call JoinAllSpeakLines() to join the lines by speakers;
run :call SortSpeakLines() to sort
You may adapt the different patterns to better fit your needs, for example adding some space tolerance (\u\{2,}\s*\ze:).
Here is the code:
" Fill the following array with all the speakers names:
let g:speakers = [ 'MICHAEL', 'VALERIE', 'MATHIEU' ]
call sort(g:speakers)
function! JoinAllSpeakLines()
" In the whole file, join all the lines between two uppercase speaker names
" followed by ':', first inclusive:
silent g/\u\{2,}:/call JoinSpeakLines__()
endf
function! SortSpeakLines()
" Sort the whole file by speaker, keeping the order for
" each speaker.
" Must be called after JoinAllSpeakLines().
" Create a new dict, with one key for each speaker:
let speakerlines = {}
for speaker in g:speakers
let speakerlines[speaker] = []
endfor
" For each line in the file:
for line in getline(1,'$')
let speaker = GetSpeaker__(line)
if speaker == ''
continue
endif
" Add the line to the right speaker:
call add(speakerlines[speaker], line)
endfor
" Delete everything in the current buffer:
normal gg"_dG
" Add the sorted lines, speaker by speaker:
for speaker in g:speakers
call append(line('$'), speakerlines[speaker])
endfor
" Delete the first (empty) line in the buffer:
normal gg"_dd
endf
function! GetOtherSpeakerPattern__(speaker)
" Returns a pattern which matches all speaker names, except the
" one given as a parameter.
" Create an new list with a:speaker removed:
let others = copy(g:speakers)
let idx = index(others, a:speaker)
if idx != -1
call remove(others, idx)
endif
" Create and return the pattern list, which looks like
" this : "\v<MICHAEL>|<VALERIE>..."
call map(others, 'printf("<%s>:",v:val)')
return '\v' . join(others, '|')
endf
function! GetSpeaker__(line)
" Returns the uppercase name followed by a ':' in a line
return matchstr(a:line, '\u\{2,}\ze:')
endf
function! JoinSpeakLines__()
" When cursor is on a line with an uppercase name, join all the
" following lines until another uppercase name.
let speaker = GetSpeaker__(getline('.'))
if speaker == ''
return
endif
normal V
" Search for other names after the cursor line:
let srch = search(GetOtherSpeakerPattern__(speaker), 'W')
echo srch
if srch == 0
" For the last one only:
normal GJ
else
normal kJ
endif
endf

Related

RegEx to format Wikipedia's infoboxes code [SOLVED]

I am a contributor to Wikipedia and I would like to make a script with AutoHotKey that could format the wikicode of infoboxes and other similar templates.
Infoboxes are templates that displays a box on the side of articles and shows the values of the parameters entered (they are numerous and they differ in number, lenght and type of characters used depending on the infobox).
Parameters are always preceded by a pipe (|) and end with an equal sign (=). On rare occasions, multiple parameters can be put on the same line, but I can sort this manually before running the script.
A typical infobox will be like this:
{{Infobox XYZ
| first parameter = foo
| second_parameter =
| 3rd parameter = bar
| 4th = bazzzzz
| 5th =
| etc. =
}}
But sometime, (lazy) contributors put them like this:
{{Infobox XYZ
|first parameter=foo
|second_parameter=
|3rd parameter=bar
|4th=bazzzzz
|5th=
|etc.=
}}
Which isn't very easy to read and modify.
I would like to know if it is possible to make a regex (or a serie of regexes) that would transform the second example into the first.
The lines should start with a space, then a pipe, then another space, then the parameter name, then any number of spaces (to match the other lines lenght), then an equal sign, then another space, and if present, the parameter value.
I try some things using multiple capturing groups, but I'm going nowhere... (I'm even ashamed to show my tries as they really don't work).
Would someone have an idea on how to make it work?
Thank you for your time.
The lines should start with a space, then a pipe, then another space, then the parameter name, then a space, then an equal sign, then another space, and if present, the parameter value.
First the selection, it's relatively trivial:
^\s*\|\s*([^=]*?)\s*=(.*)$
Then the replacement, literally your description of what you want (note the space at the beginning):
| $1 = $2
See it in action here.
#Blindy:
The best code I have found so far is the following : https://regex101.com/r/GunrUg/1
The problem is it doesn't align the equal signs vertically...
I got an answer on AutoHotKey forums:
^i::
out := ""
Send, ^x
regex := "O)\s*\|\s*(.*?)\s*=\s*(.*)", width := 1
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
width := Max(width, StrLen(_[1]))
Loop, Parse, Clipboard, `n, `r
If RegExMatch(A_LoopField, regex, _)
out .= Format(" | {:-" width "} = {2}", _[1],_[2]) "`n"
else
out .= A_LoopField "`n"
Clipboard := out
Send, ^v
Return
With this script, pressing Ctrl+i formats the infobox code just right (I guess a simple regex isn't enough to do the job).

Parse a log file to fetch some values in a line

I am reading a log file where i am trying to fetch some values from lines which contains a substring "edited by:" and ending with " bye".
This is how a log file is designed.
Error nothing reported
19-06-2021 LOGGER:INFO edited by : James Cooper Person Administrator bye. //Line 2
No data match.
19-06-2021 LOGGER:INFO edited by : Harry Rhodes Person External bye. //Line 4
.......
So i am trying to fetch:
James Cooper Person Administrator //from line 2
Harry Rhodes Person External //from line 4
And assign them to variables in my tcl program.
I am assuming the fetched lines are in a list name line2.
like
set splitList[$line2 ' ']
set agent [lindex $splitList 0]
set firstName [lindex $splitList 1]
set lastName [lindex $splitList 2]
set role [lindex $splitList 3]
I understand that having the fetched or extracted lines from log file in a list is not a good idea as they are unstructured input. Using Tcl list functions can lead to weird things when they aren't in proper Tcl list format.
I am very new to tcl. And don't have much idea using regex in tcl.
So I tried extracting values from the matched line using regex. Suppose line2 is a variable holding the extracted matched line2 from the log file,
regexp -- {edited by:(.*) bye.$} $line2 match agent
I was able to get the expected output like below.
Person Harry Rhodes External
However, on this extracted string I don't know how I can further drill to get my variables assigned values. Any suggestion on this approach or any other functions which are present in tcl library which can help me with this task please let me know.
Updated the question by editing the log format. The format of the log file was not correct.
To err on the safe side, I would modify the regex to look for whitespace ([[:space:]]) between words, using * (= "any amount") and + (= "at least one") as appropriate and storing each variable in a capturing group (surrounded by parentheses ()):
edited[[:space:]]+by[[:space:]]*:[[:space:]]*([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+([^[:space:]]*)[[:space:]]+bye.$
Please note that [^[:space:]] matches any character except whitespace.
Regex101 demo: https://regex101.com/r/78l4HJ/1
First off, taking apart the name of a person into its components is extremely difficult. For example, some people have a multi-word family name. (Yes, I know specific examples of this.) Other people put the parts in different orders. Can you avoid splitting the name?
The other parts of parsing that substring are easier as we can assume that agent and role will not have spaces in. The trick to this RE is that \w+ matches a “word” character sequence, \s+ matches a space character sequence (more robustly than a single space), and .*? matches anything, but as little of it as possible.
regexp {^\s*(\w+)\s+(.*?)\s+(\w+)\s*$} $substring -> agent name role
OK, that's great for the substring, but what about the whole line? It's really just a matter of adjusting the anchors. (\y matches a word boundary.)
regexp {\yedited by:\s*(\w+)\s+(.*?)\s+(\w+)\s+bye\y} $line -> agent name role
It's often not a good idea to feed more than a line at a time into a regular expression search, not unless you need to. Fortunately your records are newline-delimited so that's not a problem here.

Tcl - How to Add Text after last character through regex?

I need a tip, tip or suggestion followed by some example of how I can add an extension in .txt format after the last character of a variable's output line.
For example:
set txt " ONLINE ENGLISH COURSE - LESSON 5 "
set result [concat "$txt" .txt]
Print:
Note that there is space in the start, means and fin of the variable phrase (txt). What must be maintained are the spaces of the start and means. But replace the last space after the end of the sentence, with the format of the extension [.txt].
With the built-in concat method of Tcl, it does not achieve the desired effect.
The expected result was something like this:
ONLINE ENGLISH COURSE - LESSON 5.txt
I know I could remove spaces with string map but I don't know how to remove just the last occurrence on the line.
And otherwise I don’t know how to remove the last space to add the text [.txt]
If anyone can point me to one or more solutions, thank you in advance.
set result "[string trimright $txt].txt"
or
set result [regsub {\s*$} $txt ".txt"]

Remove lines from buffer that match the selected text

When analyzing large log files, I often remove lines containing text I find irrelevant:
:g/whatever/d
Sometimes I find text that spans multiple lines, like stacktraces. For that, I record the steps taken (search, go to start anchor, delete to end anchor) and replay that macro with 100000#q. I'm searching for a function or a feature vim already has included that allows me to mark text and remove all lines containing this text. Ideally this would also work for block selection.
If I understood your problem right, this command should do what you want:
:g/NullPointer/,/omitt/d
Example:
Before:
1
2
3
NullPointerException1
4
5
6
omitted
7
NullPointerException2
8
9
omitted
10
After:
1
2
3
7
10
Please read :h edit-paragraph-join, there is good explanation for the command, your case is just changing join into d
:g/whatever/d2
will delete a line with whatever and the line after it. If you can find text that always happens in the first line, you can strip out all of the following text if it has the same number of lines by changing 2 to whatever you need.
You could actually just use some normal commands in a global command to achieve what you want, look at your example (hope i understood it more or less right):
someText
NullPointerException
...
omitted
you want to delte from the line above NPE until the line with omitted right?
Just use the following:
:g/NullPointerException/execute "normal! kddd/omitted\<cr>dd"
It maybe looks complex, but it isn't. It is not better than a macro1
, but i like commands more, because I always make errors recording macros.
Since it only uses normal vim movements, it is easy to adopt. If you f.e. not know where your previous anchor is, you could use ?anchor\<cr> instead of kd. For a better demonstration you will have to submit a realistic example.
[1] You could argue, that this only needs to be run once, but that is also true for a recursive macro http://vim.wikia.com/wiki/Record_a_recursive_macro
Thanks to the answers here, I was able to code a very handy function: The sources below enables one to select text and remove all lines with the same (or similar) text in the current buffer. That works with both in-line and multiline selection. As I said I was searching for something that made me faster in analyzing log files. Log files typically contain dates and times and these change all the time, so it's a good idea to have something that let's us ignore numbers. Let's see. I'm using these two mappings:
vnoremap d :<C-U>echo RemoveSelectionFromBuffer(0)<CR>
vnoremap D :<C-U>echo RemoveSelectionFromBuffer(1)<CR>
Typical usage:
Remove similar lines ignoring numbers: Shift+v, then Shift+d
Remove same matches (single line): Mark text inline (leaving out dates and times), then d
Remove same matches (multiline): Mark text across lines (leaving out dates and times), then d
Here's the source code:
" Removes lines matching the selected text from buffer.
function! RemoveSelectionFromBuffer(ignoreNumbers)
let lines = GetVisualSelection() " selected lines
" Escape backslashes and slashes (delimiters)
call map(lines, {k, v -> substitute(v, '\\\|/', '\\&', 'g')})
if a:ignoreNumbers == 1
" Substitute all numbers with \s*\d\s* - in formatted output matching
" lines may have whitespace instead of numbers. All backslashes need
" to be escaped because \V (very nomagic) will be used.
call map(lines, {k, v -> substitute(v, '\s*\d\+\s*', '\\s\\*\\d\\+\\s\\*', 'g')})
endif
let blc = line('$') " number of lines in buffer (before deletion)
let vlc = len(lines) " number of selected lines
let pattern = join(lines, '\_.') " support multiline patterns
let cmd = ':g/\V' . pattern . '/d_' . vlc " delete matching lines (d_3)
let pos = getpos('v') " save position
execute "silent " . cmd
call setpos('.', pos) " restore position
let dlc = blc - line('$') " number of deleted lines
let dmc = dlc / vlc " number of deleted matches
let cmd = substitute(cmd, '\(.\{50\}\).*', '\1...', '') " command output
let lout = dlc . ' line' . (dlc == 1 ? '' : 's')
let mout = '(' . dmc . ' match' . (dmc == 1 ? '' : 'es') . ')'
return printf('%s removed: %s', (vlc == 1 ? lout : lout . ' ' . mout), cmd)
endfunction
I took the GetVisualSelection() code from this answer.
function! GetVisualSelection()
if mode() == "v"
let [line_start, column_start] = getpos("v")[1:2]
let [line_end, column_end] = getpos(".")[1:2]
else
let [line_start, column_start] = getpos("'<")[1:2]
let [line_end, column_end] = getpos("'>")[1:2]
end
if (line2byte(line_start)+column_start) > (line2byte(line_end)+column_end)
let [line_start, column_start, line_end, column_end] =
\ [line_end, column_end, line_start, column_start]
end
let lines = getline(line_start, line_end)
if len(lines) == 0
return ''
endif
let lines[-1] = lines[-1][: column_end - 1]
let lines[0] = lines[0][column_start - 1:]
return lines
endfunction
Thanks, aepksbuck, DoktorOSwaldo and Kent.

vim: search, capture & replace on different lines using regex

Relatively new linux/vim/regex user here. I want to use regex to search for a numerical patterns, capture it, and then use the captured value to append a string to the previous line. In other words...I have a file of format:
title: description_id
text: {en: '2. text description'}
I want to capture the values from the text field and append them to the beginning of the title field...to yield something like this:
title: q2_description_id
text: {en: '2. text description'}
I feel like I've come across a way to reference other lines in a search & replace but am having trouble finding that now. Or maybe a macro would be suitable. Any help would be appreciated...thanks!
Perhaps something like:
:%s/\(title: \)\(.*\n\)\(text: \D*\)\(\d*\)/\1q\4_\2\3\4/
Where we are searching for 4 parts:
"title: "
rest of line and \n
"text: " and everything until next digit in line
first string of consecutive digits in line
and spitting them back out, with 4) inserted between 1) and 2).
EDIT: Shorter solution by Peter in the comments:
:%s/title: \zs\ze\_.\{-}text: \D*\(\d*\)/q\1_/
Use \n for the new lines (and ^v+enter for new lines on the substitute line): A quick and not very elegant example:
:%s/title: description_id\n\ntext: {en: '\(\i*\)\(.*\)/title: q\1_description_id^Mtext: {en: '\1\2/