What would be the best approach to this substitution in Vim? - regex

A several line document has a header/title section and then about 10 listings under each. I need to put the header/title info in with each of the listings so that they can be properly uploaded into a website (using comma and pipe delimiters). It looks like this:
SectionName1 and TitleName1
1111 - The SubSectionName A
222 - The SubSectionName B
3333 - The SubSectionName C
SectionName2 and TitleName2
444 - The SubSectionName D
55555 - The SubSectionName E
66 - The SubSectionName F
Repeating several hundred times. What I need is to produce something like:
SectionName1,TitleName1,1111,SubSectionNameA
SectionName1,TitleName1,222,SubSectionNameB
SectionName1,TitleName1,3333,SubSectionNameC
SectionName2,TitleName2,444,SubSectionNameD
SectionName2,TitleName2,55555,SubSectionNameE
SectionName2,TitleName2,66,SubSectionNameF
I realize there can multiple approaches to this solution, but I'm having a difficult time pulling the trigger on any one method. I understand submatches, joins and getline but I am not good at practical use of them in this scenario.
Any help to get me mentally started would be greatly appreciated.

Let me propose the following quite general Ex command solving the
issue.1
:g/^\s*\h/d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|
\ 'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
At the top level, this is the :global command that enumerates the lines
starting with zero or more whitespace characters followed by a Latin letter or
an underscore (see :help /\h). The lines matching this pattern are supposed
to be the header lines containing section and title names. The rest of the
command, after the pattern describing the header lines, are instructions to be
executed for each of those lines.
The actions to be performed on the headers can be divided into three steps.
Delete the current header line, at the same time extracting section
and title names from it.
:d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')
First, remove the current line, saving it into the unnamed register,
using the :delete command. Then, update the contents of that
register (referred to as #"; see :help #r and :help "") to be
result of the substitution changing the word and surrounded by
whitespace characters, to a single comma. The actual replacement is
carried out by the substitute() function.
However, the input is not the exact string containing the whole header
line, but its prefix leaving out the last character, which is
a newline symbol. The [:-2] notation is a short form of the
[0:-2] subscript expression that designates the substring from the
very first byte to the second one counting from the end (see :help
expr-[:]). This way, the unnamed register holds the section and the
title names separated by comma.
Determine the range of dependent subsection lines.
:ki|/\n\s*\h\|\%$/kj
After the first step, the subsection records belonging to the just
parsed header line are located starting from the current line (the one
followed the header) until the next header line or, if there is no
such line below, the end of buffer. The numbers of these lines are
stored in the marks i and j, respectively. (See :helpg ^A mark
is for description of marks.)
The marks are placed using the :k command that sets a specified mark
at the last line of a given range which is the current line, by
default. So, unlike the first line of the considered block, the last
one requires a specific line range to point out its location.
A particular form of range, denoting the next line where a given
pattern matches, is used in this case (see :help :range). The
pattern defining the location of the line to be found, is composed in
such a way that it matches a line immediately preceding a header (a
line starting with possible whitespace followed by an alphabetical
character), or the very last line. (See :help pattern for details
about syntax of Vim regular expressions.)
Transform the delineated subsection lines according to desired format,
prepending section and title names found in the corresponding header
line.
:'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g
This step comprised of the two :substitute commands that are run
over the range of lines delimited by the locations labelled by the
marks i and j (see :help [range]).
The first substitution command matches the beginning of a subsection
line—an identifier followed by a hyphen and the word The, all
floating in a whitespace—and replaces it with the contents of the
unnamed register, holding the section and title names concatenated
with a comma, the matched identifier, and another comma. The second
substitution finalizes the transformation by squeezing all whitespace
characters on the line to gum the subsection name and the following
letter together.
To construct the replacement string in the first :substitute
command, the substitute-with-an-expression feature is used (see :help
sub-replace-\=). The substitution part of the command should start
with \= for Vim to interpret the remaining text not in a regular
way, but as an expression (see :help expression). The result of
that expression's evaluation becomes the substitution string. Note
the use of the submatch() function in the substitute expression to
retrieve the text of a submatch by its number.
1 The command is wrapped for better readability, its one-line
version is listed below for ease of copy-pasting into Vim command line. Note
that the wrapped command can be used in a Vim script without any change.
:g/^\s*\h/d|let#"=substitute(#"[:-2],'\s\+and\s\+',',','')|ki|/\n\s*\h\|\%$/kj|'i,'js/^\s*\(\d\+\)\s\+-\s\+The/\=#".','.submatch(1).','/|'i,'js/\s\+//g

Simplest/fastest way I can think of is a simple macro. Do once, rinse, repeat.
Assuming your cursor is initially on the first character of the first line (S of SectionName), this macro should work as long as the document is exactly in the same format as posted above.
f ctT,<Esc>yyjpjjpjddkkkddkkkJr,f ctS,<Esc>f xjJr,f ctS,f xjJr,f ctS,<Esc>f xjdd

well I think the question is not that clear. why in your demo input, after "-", the text was like:
55555 - The SubSectionName E
but in your expected output, it turned into:
55555,SubSectionNameE
all spaces were removed, this is ok, but why "The" was removed as well? is there any pattern for "the" ?
I wrote an awk oneliner, it removes all spaces in output, but leave those "The" there, you can change it to get the right output you need.
awk -F' and ' -vOFS="," 'NF>1{s=$1;t=$2;next;}$1{gsub(/\s+/,"");gsub(/-/,",");print s,t,$0} ' input
test on your example input:
kent$ cat v
SectionName1 and TitleName1
1111 - The SubSectionName A
222 - The SubSectionName B
3333 - The SubSectionName C
SectionName2 and TitleName2
444 - The SubSectionName D
55555 - The SubSectionName E
66 - The SubSectionName F
kent$ awk -F' and ' -vOFS="," 'NF>1{s=$1;t=$2;next;}$1{gsub(/\s+/,"");gsub(/-/,",");print s,t,$0} ' v
SectionName1,TitleName1,1111,TheSubSectionNameA
SectionName1,TitleName1,222,TheSubSectionNameB
SectionName1,TitleName1,3333,TheSubSectionNameC
SectionName2,TitleName2,444,TheSubSectionNameD
SectionName2,TitleName2,55555,TheSubSectionNameE
SectionName2,TitleName2,66,TheSubSectionNameF

Related

RegEx in Notepad++ to find lines with less or more than n pipes

I have a large pipe-delimited text file that should have one 3-column record per line. Many of the records are split up by line breaks within a column.
I need to do a find/replace to get three, and only three, pipes per line/record.
Here's an example (I added the line breaks (\r\n) to demonstrate where they are and what needs to be replaced):
12-1234|The quick brown fox jumped over the lazy dog.|Every line should look similar to this one|\r\n
56-7890A|This record is split\r\n
\r\n
on to multiple lines|More text|\r\n
09-1234AS|\r\n
||\r\n
\r\n
56-1234|Some text|Some more text\r\n
|\r\n
76-5432ABC|A record will always start with two digits, a dash and four digits|There may or may not be up to three letters after the four digits|\r\n
The caveat is that I need to retain those mid-record line breaks for the target system. They need to be replaced with \.br\. So the final result of the above should look like this:
12-1234|The quick brown fox jumped over the lazy dog.|Every line should look similar to this one|\r\n
56-7890A|This record is split\.br\\.br\on multiple lines|More text|\r\n
09-1234AS|\.br\||\.br\\r\n
56-1234|Some text|Some more text\.br\|\r\n
76-5432ABC|A record will always start with two digits, a dash and four digits|There may or may not be up to three letters after the four digits|\r\n
As you can see the mid-record line breaks have all been replaced with \.br\ and the end-of-line line breaks have been retained to keep each three-column/pipe record on its own line. Note the last record's text, explaining how each line/record begins. I included that in case that would help in building a regex to properly identify the beginning of a record.
I'm not sure if this can be done in one find/replace step or if it needs to be (or just should be) split up into a couple of steps.
I had the thought to first search for |\r\n, since all records end with a pipe and a CRLF, and replace those with dummy text !##$. Then search for the remaining line breaks with \r\n, which will be mid-column line breaks and replace those with \.br\, then replace the dummy text with the original line breaks that I want to keep |\r\n.
That worked for all but records that looked like the third record in the first example, which has several line breaks after a pipe within the record. In such a large file as I am working with it wasn't until much later that I found that the above process I was using didn't properly catch those instances.
You can use
(?:\G(?!^(?<!.))|^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?)\K\R+
Replace with \\.br\\. See the regex demo. Details:
(?:\G(?!^(?<!.))|^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?) - either the end of the previous match (\G(?!^(?<!.))) or (|) start of a line, two digits, 0, one or more digits, zero or more letters, a |, then any zero or more chars other than |, as few as possible, and then an optional sequence of | and any zero or more chars other than |, as few as possible (see ^\d{2}-\d+[A-Z]*\|[^|]*?(?:\|[^|]*?)?)
\K - omit the text matched
\R+ - one or more line breaks.
See the Notepad++ demo:
If you need to remove empty lines after this, use Edit > Line Operations > Remove Empty Lines.

Renaming fasta headers to bracketed text

I have a file with 250 fasta sequences. Right now, the they look like this:
>NP_041982.1 DNA polymerase [Enterobacteria phage T7]
I want to change the headers so they look like this:
>Enterobacteria phage T7
For each header, I only want what is in-between the brackets. I'm trying to do this through linux commands.
Can anyone help with this?
file.fa contents
>Sequence One [Species 1]
actgtattagctaatcgatcagttacgattcga
tagctacgtacgtacgatcgatcagtcagctag
>Sequence Two [Species 2]
ttgtagctagctagctagctagctagctacgta
tgcatcgatcgattaatatcgcgccctaactcg
>Sequence Three
atgatagtctggtcatcgattcagtcagttcat
ttgcatgatctactagatcgatattagctagat
>Sequence Four [early bracket] text
tagctacgtacgatcgtacgatcgatcgtatat
gctagtcgactagctagctacgtacgtacgtaa
sed command:
sed 's#^>[^\[]*\[\([^\]*\)]$#>\1#g' file.fa
It looks a bit convoluted, but it means...
take any string of characters that matches the pattern of "a line that starts with >, followed by any number of characters besides [, followed by any number of characters besides ], followed by ]. Capture the string between the brackets, and replace the entire match with just the thing in the brackets.
prints the output
>Species 1
actgtattagctaatcgatcagttacgattcga
tagctacgtacgtacgatcgatcagtcagctag
>Species 2
ttgtagctagctagctagctagctagctacgta
tgcatcgatcgattaatatcgcgccctaactcg
>Sequence Three
atgatagtctggtcatcgattcagtcagttcat
ttgcatgatctactagatcgatattagctagat
>Sequence Four [early bracket] text
tagctacgtacgatcgtacgatcgatcgtatat
gctagtcgactagctagctacgtacgtacgtaa
the output can be saved to a new file with
sed 's#^>[^\[]*\[\([^\]*\)]$#>\1#g' file.fa > converted_filename.fa
Note that any headers without matches are printed as-is, and any lines that have characters after the final bracket will also be printed as-is. Might act odd if it encounters left brackets that are not closed on the same line. I'd recommend you double check that the new file has the same number of lines as the original.

regex Pattern Matching over two lines - search and replace

I have a text document that i require help with. In the below example is an extract of a tab delimited text doc whereby the first line of the 3 line pattern will always be a number. The Doc will always be in this format with the same tabbed formula on each of the three lines.
nnnn **variable** V -------
* FROM CLIP NAME - **variable**
* LOC: variable variable **variable**
I want to replace the second field on the first line with the fourth field on the third line. And then replace the field after the colon on the second line with the original second field on the first line. Is this possible with regex? I am used to single line search replace function but not multiline patterns.
000003 A009C001_151210_R6XO V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: 5-1A
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 B008C001_151210_R55E V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: 5-1B
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
The Result would look like :
000003 005_NST_010_E02 V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: A009C001_151210_R6XO
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 005_NST_010_E03 V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: B008C001_151210_R55E
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
Many Thanks in advance.
A regular expression defines a regular language. Alone, this only expresses a structure of some input. Performing operations on this input requires some kind of processing tool. You didn't specify which tool you were using, so I get to pick.
Multiline sed
You wrote that you are "used to single line search replace function but not multiline patterns." Perhaps you are referring to substitution with sed. See How can I use sed to replace a multi-line string?. It is more complicated than with a single line, but it is possible.
An AWK script
AWK is known for its powerful one-liners, but you can also write scripts. Here is a script that identifies the beginning of a new record/pattern using a regular expression to match the first number. (I hesitate to call it a "record" because this has a specific meaning in AWK.) It stores the fields of the first two lines until it encounters the third line. At the third line, it has all the information needed to make the desired replacements. It then prints the modified first two lines and continues. The third line is printed unchanged (you specified no replacements for the third line). If there are additional lines before the start of the next record/pattern, they will also be printed unchanged.
It's unclear exactly where the tab characters are in your sample input because the submission system has replaced them with spaces. I am assuming there is a tab between FROM CLIP NAME: and the following field and that the "variables" on the first and third line are also tab-separated. If the first number of each record/pattern is hexadecimal instead of decimal, replace the [[:digit:]] with [[:xdigit:]].
fixit.awk
#!/usr/bin/awk -f
BEGIN { FS="\t"; n=0 }
{n++}
/^[[:digit:]]+\t/ { n=1 }
# Split and save first two lines
n==1 { line1_NF = split($0, line1, FS); next }
n==2 { line2_NF = split($0, line2, FS); next }
n==3 {
# At the third line, make replacements
line1_2 = line1[2]
line1[2] = $4
line2[2] = line1_2
# Print modified first two lines
printf "%s", line1[1]
for ( i=2; i<=line1_NF; ++i )
printf "\t%s", line1[i]
print ""
printf "%s", line2[1]
for ( i=2; i<=line2_NF; ++i )
printf "\t%s", line2[i]
print ""
}
1 # Print lines after the second unchanged
You can use it like
$ awk -f fixit.awk infile.txt
or to pipe it in
$ cat infile.txt | awk -f fixit.awk
This is not the most regular expression inspired solution, but it should make the replacements that you want. For a more complex structure of input, an ideal solution would be to write a scanner and parser that correctly interprets the full input language. Using tools like string substitution might work for simple specific cases, but there could be nuances and assumptions you've made that don't apply in general. A parser can also be more powerful and implement grammars that can express languages which can't be recognized with regular expressions.

Math in Vim search-and-replace

I have a file with times (minutes and seconds), which looks approximately as follows:
02:53 rest of line 1...
03:10 rest of line 2...
05:34 rest of line 3...
05:35 rest of line 4...
10:02 rest of line 5...
...
I would like to replace the time by its equivalent in seconds. Ideally, I would like to run some magical command like this:
:%s/^\(\d\d\):\(\d\d\) \(.*\)/(=\1*60 + \2) \3/g
...where the (=\1*60 + \2) is the magical part. I know I can insert results of evaluation with the special register =, but is there a way to do this in the subst part of a regex?
Something like this?
:%s/^\(\d\d\):\(\d\d\)/\=submatch(1)*60+submatch(2)/
When the replacement starts with a \= the replacment is interpreted as an expression.
:h sub-replace-expression is copied below
Substitute with an expression *sub-replace-expression*
*sub-replace-\=*
When the substitute string starts with "\=" the remainder is interpreted as an
expression. This does not work recursively: a substitute() function inside
the expression cannot use "\=" for the substitute string.
The special meaning for characters as mentioned at |sub-replace-special| does
not apply except for "<CR>", "\<CR>" and "\\". Thus in the result of the
expression you need to use two backslashes to get one, put a backslash before a
<CR> you want to insert, and use a <CR> without a backslash where you want to
break the line.
For convenience a <NL> character is also used as a line break. Prepend a
backslash to get a real <NL> character (which will be a NUL in the file).
When the result is a |List| then the items are joined with separating line
breaks. Thus each item becomes a line, except that they can contain line
breaks themselves.
The whole matched text can be accessed with "submatch(0)". The text matched
with the first pair of () with "submatch(1)". Likewise for further
sub-matches in ().
Use submatch() to refer to a grouped part in the substitution place:
:%s/\v^(\d{2}):(\d{2})>/\=submatch(1) * 60 + submatch(2)/
With your example yields:
173 rest of line 1...
190 rest of line 2...
334 rest of line 3...
335 rest of line 4...
602 rest of line 5...
Hopefully it would be helpful to someone else, but i had similar problem where i wanted to replace "id" with a different number, in-fact any other number
{"id":1,"first_name":"Ruperto","last_name":"Bonifayipio","gender":"Male","ssn":"318-69-4987"},
used expression
%s/\v(\d+),/\=submatch(1)*1111/g
which results into following new value
{"id":1111,"first_name":"Ruperto","last_name":"Bonifayipio","gender":"Male","ssn":"318-69-4987"},

Remove the first character of each line and append using Vim

I have a data file as follows.
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
Using vim, I want to reomve the 1's from each of the lines and append them to the end. The resultant file would look like this:
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1
14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1
13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1
I was looking for an elegant way to do this.
Actually I tried it like
:%s/$/,/g
And then
:%s/$/^./g
But I could not make it to work.
EDIT : Well, actually I made one mistake in my question. In the data-file, the first character is not always 1, they are mixture of 1, 2 and 3. So, from all the answers from this questions, I came up with the solution --
:%s/^\([1-3]\),\(.*\)/\2,\1/g
and it is working now.
A regular expression that doesn't care which number, its digits, or separator you've used. That is, this would work for lines that have both 1 as their first number, or 114:
:%s/\([0-9]*\)\(.\)\(.*\)/\3\2\1/
Explanation:
:%s// - Substitute every line (%)
\(<something>\) - Extract and store to \n
[0-9]* - A number 0 or more times
. - Every char, in this case,
.* - Every char 0 or more times
\3\2\1 - Replace what is captured with \(\)
So: Cut up 1 , <the rest> to \1, \2 and \3 respectively, and reorder them.
This
:%s/^1,//
:%s/$/,1/
could be somewhat simpler to understand.
:%s/^1,\(.*\)/\1,1/
This will do the replacement on each line in the file. The \1 replaces everything captured by the (.*)
:%s/1,\(.*$\)/\1,1/gc
.........................
You could also solve this one using a macro. First, think about how to delete the 1, from the start of a line and append it to the end:
0 go the the start of the line
df, delete everything to and including the first ,
A,<ESC> append a comma to the end of the line
p paste the thing you deleted with df,
x delete the trailing comma
So, to sum it up, the following will convert a single line:
0df,A,<ESC>px
Now if you'd like to apply this set of modifications to all the lines, you will first need to record them:
qj start recording into the 'j' register
0df,A,<ESC>px convert a single line
j go to the next line
q stop recording
Finally, you can execute the macro anytime you want using #j, or convert your entire file with 99#j (using a higher number than 99 if you have more than 99 lines).
Here's the complete version:
qj0df,A,<ESC>pxjq99#j
This one might be easier to understand than the other solutions if you're not used to regular expressions!