Getting text that is on a different line, with ex in Vim - regex

Let's say I have the following text in Vim:
file1.txt
file2.txt
file3.txt
renamed1.txt
renamed2.txt
renamed3.txt
I want a transformation as follows:
file1.txt renamed1.txt
file2.txt renamed2.txt
file3.txt renamed3.txt
What I have in mind is something like the following:
:1,3 s/$/ <the text that is 4 lines below this line>
I'm stuck with how to specify the <the text that is 4 lines below this line> part.
I have tried something like .+4 (4 lines below the current line) but to no avail.

You can do it with blockwise cut & paste.
1) insert space at the start of each "renamed" line, e.g. :5,7s/^/ /
2) Use blockwise visual selection (ctrl-v) to select all the "file" lines, and press d to delete them
3) use blockwise visual selection again to select the space character at the start of all the renamed lines, and press p. This will paste the corresponding line from the block you deleted to the start of each line.

:1,3:s/\ze\n\%(.*\n\)\{3}\(.*\)/ \1
explained:
\ze - end of replaced part of match - the string matched by the rest of the pattern will not be consumed
\n - end of current line
\%(.*\n\)\{3} - next 3 lines
\(.*\) - content of 4th line from here
This will leave the later lines where they are.

I would make a macro for this really. Delete the lower line, move up, paste, Join lines, then run the macro on the others. The other method I think would be appropriate is a separate script to act as a filter.

Related

Mass regex search-and-replace BETWEEN patterns

I have a directory with a bunch of text files, all of which follow this structure:
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
- Again, some list items of random text
- Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
And I need to run a replace operation (let's say, I need to prepend CCC at the beginning of the line, just after the dash) on only those "list items", which are between PATTERN_A and PATTERN_B. The problem is they aren't really much different from the text above PATTERN_A, or below PATTERN_B, so an ordinary regex can't really catch them without also affecting the remaining text.
So, my question would be, what tool and what regex should I use to perform that replacement?
(Just in case, I'm fine with Vim, and I can collect those files in a QuickFix for a further :cdo, for example. I'm not that good with awk, unfortunately, and absolutely bad with Perl :))
Thanks!
If I have understood your questions, you can do so quite easily with a pattern-range selection and the general substitution form with sed (stream editor). For example, in your case:
$ sed '/PATTERN_A/,/PATTERN_B/s/^\([ ]*-\)/\1CCC/' file
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
(note: to substitute in place within the file add the -i option, and to create a backup of the original add -i.bak which will save the original file as file.bak)
Explanation
/PATTERN_A/,/PATTERN_B/ - select lines between PATTERN_A and PATTERN_B
s/^\([ ]*-\)/\1CCC/ - substitute (general form 's/find/replace/') where find is from beginning of line ^ capturing text between \(...\) that contains [ ]*- (any number of spaces and a hyphen) and then replace with \1 (called a backreference that contains all characters you captured with the capture group \(...\)) and appending CCC to its end.
Look things over and let me know if you have questions or if I misinterpreted your question.
With Perl also, you can get the results
> perl -pe ' { s/^(\s*-)/\1CCC/g if /PATTERN_A/../PATTERN_B/ } ' mass_replace.txt
...
- Some random number of list items of random text
- And even more of it
PATTERN_A (surrounded by empty lines)
-CCC Again, some list items of random text
-CCC Which does look similar as the first batch
PATTERN_B (surrounded by empty lines)
- And even more some random text
....
>

Renaming fasta headers to bracketed text

I have a file with 250 fasta sequences. Right now, the they look like this:
>NP_041982.1 DNA polymerase [Enterobacteria phage T7]
I want to change the headers so they look like this:
>Enterobacteria phage T7
For each header, I only want what is in-between the brackets. I'm trying to do this through linux commands.
Can anyone help with this?
file.fa contents
>Sequence One [Species 1]
actgtattagctaatcgatcagttacgattcga
tagctacgtacgtacgatcgatcagtcagctag
>Sequence Two [Species 2]
ttgtagctagctagctagctagctagctacgta
tgcatcgatcgattaatatcgcgccctaactcg
>Sequence Three
atgatagtctggtcatcgattcagtcagttcat
ttgcatgatctactagatcgatattagctagat
>Sequence Four [early bracket] text
tagctacgtacgatcgtacgatcgatcgtatat
gctagtcgactagctagctacgtacgtacgtaa
sed command:
sed 's#^>[^\[]*\[\([^\]*\)]$#>\1#g' file.fa
It looks a bit convoluted, but it means...
take any string of characters that matches the pattern of "a line that starts with >, followed by any number of characters besides [, followed by any number of characters besides ], followed by ]. Capture the string between the brackets, and replace the entire match with just the thing in the brackets.
prints the output
>Species 1
actgtattagctaatcgatcagttacgattcga
tagctacgtacgtacgatcgatcagtcagctag
>Species 2
ttgtagctagctagctagctagctagctacgta
tgcatcgatcgattaatatcgcgccctaactcg
>Sequence Three
atgatagtctggtcatcgattcagtcagttcat
ttgcatgatctactagatcgatattagctagat
>Sequence Four [early bracket] text
tagctacgtacgatcgtacgatcgatcgtatat
gctagtcgactagctagctacgtacgtacgtaa
the output can be saved to a new file with
sed 's#^>[^\[]*\[\([^\]*\)]$#>\1#g' file.fa > converted_filename.fa
Note that any headers without matches are printed as-is, and any lines that have characters after the final bracket will also be printed as-is. Might act odd if it encounters left brackets that are not closed on the same line. I'd recommend you double check that the new file has the same number of lines as the original.

Format a text file by regex match and replace

I have a text file that looks like the following:
Chanelle
Jettie
Winnie
Jen
Shella
Krysta
Tish
Monika
Lynwood
Danae
2649
2466
2890
2224
2829
2427
2816
2648
2833
2453
I need to make it look like this
Chanelle 2649
Jettie 2466
... ...
I tried a lot on sublime editor but couldn't figure out the regex to do that. Can somebody demonstrate if it can be done.
I tested the following in Notepad++ but it should work universally.
Use this as the search string:
(?:(\s+[A-Za-z]+)(\r?\n))((?:\s*[A-Za-z]*\r?\n)+)\s+(\d+)
and this as the replacement:
$1 $4$2$3
Running a replace with it once will do one line at a time, if you run it multiple times it'll continue to replace lines until there are no matching lines left.
Alternatively, you can use this as the replacement if you want to have the values aligned by tabs, but it's not going to match in all cases:
$1\t\t$4$2$3
While the regex answer by SeinopSys will work, you don't need a regex to do this - instead, you can take advantage of Sublime's multiple cursors.
Place your cursor at the beginning of line 1, then hold down Shift↓ to select all the names.
Hit CtrlShiftL (Selection -> Split into Lines) to split the selection into lines.
CtrlC to copy.
Place your cursor on line 11 (the first number line) and press CtrlShift↓ (Windows/OS X) or AltShift↓ (Linux) to place a cursor at the beginning of each number line.
Hit CtrlV to paste the names before the numbers.
You can now delete the names at the top and you're all set. Alternatively, you could use CtrlX to cut the names in step 3.

How to join lines adding a separator?

The command J joins lines.
The command gJ joins lines removing spaces
Is there also a command to Join lines adding a separator between the lines?
Example:
Input:
text
other text
more text
text
What I want to do:
- select these 4 lines
- if there are spaces at start and/or EOL remove them
- join lines adding a separator '//' between them
Output:
text//other text//more text//text
You can use :substitute for that, matching on \n:
:%s#\s*\n\s*#//#g
However, this appends the separator at the end, too (because the last line in the range also has a newline). You could remove that manually, or specify the c flag and quit the substitution before the last one, or reduce the range by one and :join the last one instead:
:1,$-1s#\s*\n\s*#//#g|join
I wrote a plugin "Join", could do what you wanted, and more.
https://github.com/sk1418/Join
Except for all features provided by the build-in :join command, Join can:
Join lines with separator (string)
Join lines with or without trimming the leading/trailing whitespaces
Join lines with negative count (backwards join)
Join lines in reverse
Join lines and keep joined lines (without removing joined lines)
Join lines with any combinations of above options
check the homepage for details and examples/screenshots.
There are few ways to do it, but I would recommend going by simplest route possible - recording a macro or doing multi-step command, for example by:
Appending to all lines excluding last by
Using substitution (:1,$-1s#$#//#)
Appending (:1,$-1norm A//)
And then join using visual selection (vGgJ) or by any other method.
Unless you're doing this operation very often you most likely forget any complex commands or existence of specialized plugin in your config, thus my recommendation of using generic, often used sub steps.
Another substitution, for the sake of diversity:
:%s:\n\ze.://
Will list 50 items per line seperated by spaces:
seq 0 70 | xargs -L 50 | sed 's/ /,/g'
Output:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70

Regular Expression over multiple lines

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.
Here is the problem:
I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
As you can probably see the titles ",Title1" and ",Title2" should actually be on the same line as the foregoing sentence. Then it would look something like this:
This is a long abstract describing something. What follows is the tile for this sentence.",Title1
This is another sentence that is running on one line. On the next line you can find the title.,Title2
Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.
Here is what I came up with so far:
sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv
This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)
The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.
Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).
Multiline in sed isn't necessarily tricky per se, it's just that it uses commands most people aren't familiar with and have certain side effects, like delimiting the current line from the next line with a '\n' when you use 'N' to append the next line to the pattern space.
Anyway, it's much easier if you match on a line that starts with a comma to decide whether or not to remove the newline, so that's what I did here:
sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
Input
$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line
Output
$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line
Yours works with a couple of small changes:
sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile
The ? needs to be escaped and . doesn't match newlines.
Here's another way to do it which doesn't require using the hold space:
sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile
Here is a commented version:
sed -n '
$ # for the last input line
{
p; # print
q # and quit
};
N; # otherwise, append the next line
/\n,/ # if it starts with a comma
{
s/"\?\n//p; # delete an optional comma and the newline and print the result
b # branch to the end to read the next line
};
P; # it doesn't start with a comma so print it
D # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile