Merge vertical lists in Vim - list

Is there a way which allows me to merge lists vertically?
For example, if I have have these two lists:
A E
B F
C G
D H
I would like to end up with the following:
A
E
B
F
C
G
D
H

This is simple, just place the cursor on the column between the lists.
Insert visualblock-mode <C-v>, mark the whole column, hit r to replace it, and then <CR> and you have what you want.

:%s/\v^(\w) /\1\r/g
: ........... command
% ........... whole file
\v .......... very magic (avoid backslashes)
(\w) ........ word character
\1 .......... all patter in parenthesis
\r .......... Carriage Return "Enter"
g ........... globally

You could also do it with an external filter. Mark the relevant lines in visual mode and press !. The following filter does what you want on a POSIX system:
paste -sd' ' | tr ' ' '\n'

Related

removing one letter except a compination

Trying to remove all characters except from the compination of 'r d`. To be more clear some examples:
a ball -> ball
r something -> something
d someone -> someone
r d something -> r d something
r something d -> something
Till now I managed to remove the letters except from r or d, but this is not what i want. I want to keep only the compination(ex.4). I use this:
\b(?!r|d)\w{1}\b
Any idea who to do it?
Edit:The reg engine supports lookbehinds.
You may capture the r d combination and use a backreference in the replacement pattern to restore that combination, and remove all other matches:
\b(r d)\b|\b\w\b\s*
See the regex demo (replace with $1 that will put the r d back into the result).
Details:
\b(r d)\b - a "whole word" r d that is captured into Group 1
| - or
\b\w\b\s* - a single whole word consisting of 1 letter/digit/underscore (\b\w\b) and followed with 0+ whitespaces (\s*, just for removing the excessive whitespace, might not be necessary).

Perl - split command with regex - split numeric and strings

My data look as follows:
20110627 ABC DBE EFG
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
I am trying to split in such a way that I keep numeric values in one cell, and strings in one cell.
Thus, I want "20110627" in one cell, "ABC DBE EFG" in another, "0.811585416264778" in another, "-0.157081048309312" in another, etc.
I have the following split command in perl with a regex
my #Fld = split(/[\d+][\s][\w+]/, $_);
But that doesn't seem to do what I want.. Can someone tell me which regex to use? Thanks in advance
EDIT : Following vks suggestion, I changed his regex a little bit to get rid of whitespace, take into account the string might have commas (,) or slash (/) or a dash (-) but then the negative sign (-) seems to be taken as a separate token in numbers:
(-?\d+(\.\d+)?)|([\/?,?\.?\-?a-zA-Z\/ ]+)
20110627 A B C
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
19950725 A C
16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933
19980323 G C I /DE/
20130516 A - E, INC.
33019 398 197 1.205366607105 0.596626184923832 0.608740422181168
20130516 A - E, INC.
24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345
19960327 A F C /DE 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
Expected output :
20110627 in one token
A B C in one token
-0.170130027949933 in one token
G C I /DE/ in one token
A - E, INC. in one token.. (of course all the other should be in separate tokens, in other words the strings in one token and the numbers in one token.. I cannot write every single one of them but I think it it straightforward)
2nd EDIT:
Brian found the right regex: /(-?\d+(?:.\d+)?)|([/,.-a-zA-Z]+(?:\s+[/,.-a-zA-Z]+)*)/ (see below). Thanks Brian ! I now have a follow up question: I am writing the results of the regex split to an Excel file, using the following code:
use warnings;
use strict;
use Spreadsheet::WriteExcel;
use Scalar::Util qw(looks_like_number);
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::SaveParser;
use Spreadsheet::ParseExcel::Workbook;
if (($#ARGV < 1) || ($#ARGV > 2)) {
die("Usage: tab2xls tabfile.txt newfile.xls\n");
};
open (TABFILE, $ARGV[0]) or die "$ARGV[0]: $!";
my $workbook = Spreadsheet::WriteExcel->new($ARGV[1]);
my $worksheet = $workbook->add_worksheet();
my $row = 0;
my $col = 0;
while (<TABFILE>) {
chomp;
# Split
my #Fld = split(/(-?\d+(?:\.\d+)?)|([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*)/, $_);
$col = 0;
foreach my $token (#Fld) {
$worksheet->write($row, $col, $token);
$col++;
}
$row++;
}
The problem is I get empty cells when I use that code:
> "EMPTY CELL" "1000" "EMPTY CELL" "EMPTY CELL" "ABC DEG" "EMPTY CELL"
> "2500" "EMPTY CELL" "EMPTY CELL" "1500" "3500"
Why am I getting these empty cells? Any way to avoid that? Thanks a lot
This is a broad scoped regex that does whitespace trim.
For some reason Perl always inserts the captures.
Since the regex is basically \d or \D, it matches everything,
so running split results through grep removes empty elements.
I'm using Perl 5.10, they probably have a noemptyelements flag by now.
Regex
# \s*([-\d.]+|\D+)(?<!\s)\s*
\s*
( [-\d.]+ | \D+ )
(?<! \s )
\s*
Perl
use strict;
use warnings;
$/ = undef;
my $data = <DATA>;
my #ary = grep { length($_) > 0 } split m/\s*([-\d.]+|\D+)(?<!\s)\s*/, $data;
for (#ary) {
print "'$_'\n";
}
__DATA__
20110627 A B C
217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
19950725 A C
16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933
19980323 G C I /DE/
20130516 A - E, INC.
33019 398 197 1.205366607105 0.596626184923832 0.608740422181168
20130516 A - E, INC.
24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345
19960327 A F C /DE 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
Output
'20110627'
'A B C'
'217722'
'1425'
'1767'
'0.654504367955466'
'0.811585416264778'
'-0.157081048309312'
'19950725'
'A C'
'16458'
'63'
'91'
'0.38279256288735'
'0.552922590837283'
'-0.170130027949933'
'19980323'
'G C I /DE/'
'20130516'
'A - E, INC.'
'33019'
'398'
'197'
'1.205366607105'
'0.596626184923832'
'0.608740422181168'
'20130516'
'A - E, INC.'
'24094'
'134'
'137'
'0.556155059350876'
'0.56860629202291'
'-0.0124512326720345'
'19960327'
'A F C /DE'
'38905'
'503'
'169'
'1.29289294435163'
'0.434391466392495'
'0.858501477959131'
Using your revised requirements that allow for /, ,, -, etc., here's a regex that will capture all numeric tokens in capture group #1 and alpha in capture group #2:
(-?\d+(?:\.\d+)?)|([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*)
(see regex101 example)
Breakdown:
(-?\d+(?:\.\d+)?) (capture group #1) matches numbers, with possible negative sign and possible decimal places (in non-capturing group)
([\/,.\-a-zA-Z]+(?:\s+[\/,.\-a-zA-Z]+)*) (capture group #2) matches alpha strings with possible embedded whitespace
(-?\d+(\.\d+)?)|([a-zA-Z ]+)
Try this.See demo.Grab the captures.Remove the empty ones.
http://regex101.com/r/lZ5mN8/35

Filter column with awk and regexp

I've a pretty simple question. I've a file containing several columns and I want to filter them using awk.
So the column of interest is the 6th column and I want to find every string containing :
starting with a number from 1 to 100
after that one "S" or a "M"
again a number from 1 to 100
after that one "S" or a "M"
So per example : 20S50M is ok
I tried :
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
but it didn't work... What am I doing wrong?
This should do the trick:
awk '$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/' file
Regexplanation:
^ # Match the start of the string
(([1-9]|[1-9][0-9]|100) # Match a single digit 1-9 or double digit 10-99 or 100
[SM] # Character class matching the character S or M
){2} # Repeat everything in the parens twice
$ # Match the end of the string
You have quite a few issue with your statement:
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
== is the string comparision operator. The regex comparision operator is ~.
You don't quote regex strings (you never quote anything with single quotes in awk beside the script itself) and your script is missing the final (legal) single quote.
[0-9] is the character class for the digit characters, it's not a numeric range. It means match against any character in the class 0,1,2,3,4,5,6,7,8,9 not any numerical value inside the range so [1-100] is not the regular expression for digits in the numerical range 1 - 100 it would match either a 1 or a 0.
[SM] is equivalent to (S|M) what you tried [S|M] is the same as (S|\||M). You don't need the OR operator in a character class.
Awk using the following structure condition{action}. If the condition is True the actions in the following block {} get executed for the current record being read. The condition in my solution is $6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/ which can be read as does the sixth column match the regular expression, if True the line gets printed because if you don't get any actions then awk will execute {print $0} by default.
Regexes cannot check for numeric values. "A number from 1 to 100" is outside what regexes can do. What you can do is check for "1-3 digits."
You want something like this
/\d{1,3}[SM]\d{1,3}[SM]/
Note that the character class [SM] doesn't have the ! alternation character. You would only need that if you were writing it as (S|M).
I would do the regex check and the numeric validation as different steps. This code works with GNU awk:
$ cat data
a b c d e 132x123y
a b c d e 123S12M
a b c d e 12S23M
a b c d e 12S23Mx
We'd expect only the 3rd line to pass validation
$ gawk '
match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) &&
1 <= m[1] && m[1] <= 100 &&
1 <= m[2] && m[2] <= 100 {
print
}
' data
a b c d e 12S23M
For maintainability, you could encapsulate that into a function:
gawk '
function validate6() {
return( match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) &&
1<=m[1] && m[1]<=100 &&
1<=m[2] && m[2]<=100 );
}
validate6() {print}
' data
The way to write the script you posted:
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
in awk so it will do what you SEEM to be trying to do is:
awk '$6 ~ /^(([1-9][0-9]?|100)[SM]){2}$/' file.txt
Post some sample input and expected output to help us help you more.
I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query"). I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.
I wrote a Python script that takes in the column $6 from a SAM/BAM file:
import sys # getting standard input
import re # regular expression module
lines = sys.stdin.readlines() # gets all CIGAR strings for each paired-end read
total = 0
read_id = 1 # complements id from filter_1.txt
# Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
# Example inputs and outputs:
# "49M1S" produces total=50
# "10M757N40M" produces total=50
for line in lines:
all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
for n in all_ints:
total += n
print(str(read_id)+ ' ' + str(total))
read_id += 1
total = 0
The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file.
I hope this helps, or at least helps the next user that has a similar issue.
I consulted https://stackoverflow.com/a/11339230 for reference.
Try this:
awk '$6 ~/^([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]+([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt
Because you did not say exactly how the formatting will be in column 6, the above will work where the column looks like '03M05S', '40S100M', or '3M5S'; and exclude all else. For instance, it will not find '03F05S', '200M05S', '03M005S, 003M05S, or '003M005S'.
If you can keep the digits in column 6 to two when 0-99, or three when exactly 100 - meaning exactly one leading zero when under 10, and no leading zeros otherwise, then it is a simpler match. You can use the above pattern but exclude single digits (remove the first [1-9] condition), e.g.
awk '$6 ~/^(0[1-9]|[1-9][0-9]|100)+[S|M]+(0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt

String formatting, with Regex - Remove given character, and insert newline

I have a file full of URLs in a weird format, characters separated by a space character.
h t t p : / / w w w . y o u t u b e . c o m / u s e r / A S D
h t t p : / / m o r c c . c o m / f r m / i n d . p h p ? t o p i c = 5 7 . 0
I would like to make it look like :
http://www.youtube.com/user/ASD
http://morcc.com/frm/ind.php?topic=57.0
I use notepad++, and I think regex could take care of this problem for me, unfortunately I don't know regex.
I want to remove the ' ' character (space) between the characters, and leave them in listed format, so replacing /s with '' is not a solution, because it becomes a mess :/
I think I should also insert a /n BEFORE "http" occurs.
Can you not just replace a space ' ' with an empty string ''? Replacing \s is not working how you want because newlines are also matched.
If that doesn't work you could, as you say, replace \s with '' and then replace http with \nhttp.
Regex is fairly basic. Check out the examples page. The second example seems to have what you're looking for: http://www.regular-expressions.info/examples.html
EDIT: Also, I assume you know this, but just to be sure, regex itself will not do what you want. What language are you planning on using regex with, so that people can provide more detailed responses?
Regex reference page [Bookmark it ;)] - http://www.regular-expressions.info/reference.html

How to write this in different regex flavours

I have the following data:
a b c d FROM:<uniquepattern1>
e f g h TO:<uniquepattern2>
i j k l FROM:<uniquepattern1>
m n o p TO:<uniquepattern3>
q r s t FROM:<uniquepattern4>
u v w x TO:<uniquepattern5>
I would like a regex query that can find the contents of TO: when FROM:<uniquepattern1> is encountered, so the results would be uniquepattern2 and uniquepattern3.
I am hopeless with regex, I would appreciate any pointers on how to write this (lookahead parameters?) and any differences between regex on different platforms (eg the C# .NET Regex versus Grep vs Perl) that might be relevant here.
Thank you.
Try:
/FROM:<uniquepattern1>.*\r?\n.*?TO:<(.*?)>/
This works by first finding the FROM anchor and then use a dot wildcard. The dot operator does not match a newline so this will consume the rest of the line. A non-greedy dot wildcard match then consumes up to the next TO and captures what's between the angle brackets.
your requirement for file parsing is simple. there is no need to use regular expression. Open the file for reading, go through each line check for FROM:<uniquepattern1>, get the next line and print them out. Furthermore, your TO lines are only separated by ":". therefore you can use that as field delimiter.
eg with awk
$ awk -F":" '/FROM:<uniquepattern1>/{getline;print $2}' file
<uniquepattern2>
<uniquepattern3>
the same goes for other languages/tools