Filter column with awk and regexp - regex

I've a pretty simple question. I've a file containing several columns and I want to filter them using awk.
So the column of interest is the 6th column and I want to find every string containing :
starting with a number from 1 to 100
after that one "S" or a "M"
again a number from 1 to 100
after that one "S" or a "M"
So per example : 20S50M is ok
I tried :
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
but it didn't work... What am I doing wrong?

This should do the trick:
awk '$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/' file
Regexplanation:
^ # Match the start of the string
(([1-9]|[1-9][0-9]|100) # Match a single digit 1-9 or double digit 10-99 or 100
[SM] # Character class matching the character S or M
){2} # Repeat everything in the parens twice
$ # Match the end of the string
You have quite a few issue with your statement:
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
== is the string comparision operator. The regex comparision operator is ~.
You don't quote regex strings (you never quote anything with single quotes in awk beside the script itself) and your script is missing the final (legal) single quote.
[0-9] is the character class for the digit characters, it's not a numeric range. It means match against any character in the class 0,1,2,3,4,5,6,7,8,9 not any numerical value inside the range so [1-100] is not the regular expression for digits in the numerical range 1 - 100 it would match either a 1 or a 0.
[SM] is equivalent to (S|M) what you tried [S|M] is the same as (S|\||M). You don't need the OR operator in a character class.
Awk using the following structure condition{action}. If the condition is True the actions in the following block {} get executed for the current record being read. The condition in my solution is $6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/ which can be read as does the sixth column match the regular expression, if True the line gets printed because if you don't get any actions then awk will execute {print $0} by default.

Regexes cannot check for numeric values. "A number from 1 to 100" is outside what regexes can do. What you can do is check for "1-3 digits."
You want something like this
/\d{1,3}[SM]\d{1,3}[SM]/
Note that the character class [SM] doesn't have the ! alternation character. You would only need that if you were writing it as (S|M).

I would do the regex check and the numeric validation as different steps. This code works with GNU awk:
$ cat data
a b c d e 132x123y
a b c d e 123S12M
a b c d e 12S23M
a b c d e 12S23Mx
We'd expect only the 3rd line to pass validation
$ gawk '
match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) &&
1 <= m[1] && m[1] <= 100 &&
1 <= m[2] && m[2] <= 100 {
print
}
' data
a b c d e 12S23M
For maintainability, you could encapsulate that into a function:
gawk '
function validate6() {
return( match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) &&
1<=m[1] && m[1]<=100 &&
1<=m[2] && m[2]<=100 );
}
validate6() {print}
' data

The way to write the script you posted:
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
in awk so it will do what you SEEM to be trying to do is:
awk '$6 ~ /^(([1-9][0-9]?|100)[SM]){2}$/' file.txt
Post some sample input and expected output to help us help you more.

I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query"). I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.
I wrote a Python script that takes in the column $6 from a SAM/BAM file:
import sys # getting standard input
import re # regular expression module
lines = sys.stdin.readlines() # gets all CIGAR strings for each paired-end read
total = 0
read_id = 1 # complements id from filter_1.txt
# Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
# Example inputs and outputs:
# "49M1S" produces total=50
# "10M757N40M" produces total=50
for line in lines:
all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
for n in all_ints:
total += n
print(str(read_id)+ ' ' + str(total))
read_id += 1
total = 0
The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file.
I hope this helps, or at least helps the next user that has a similar issue.
I consulted https://stackoverflow.com/a/11339230 for reference.

Try this:
awk '$6 ~/^([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]+([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt
Because you did not say exactly how the formatting will be in column 6, the above will work where the column looks like '03M05S', '40S100M', or '3M5S'; and exclude all else. For instance, it will not find '03F05S', '200M05S', '03M005S, 003M05S, or '003M005S'.
If you can keep the digits in column 6 to two when 0-99, or three when exactly 100 - meaning exactly one leading zero when under 10, and no leading zeros otherwise, then it is a simpler match. You can use the above pattern but exclude single digits (remove the first [1-9] condition), e.g.
awk '$6 ~/^(0[1-9]|[1-9][0-9]|100)+[S|M]+(0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt

Related

Do not select if additional character is included

Suppose I have the following numbers:
3,000mt
300mt
44,000m
320m
And I want 44,000m and 320m to be selected.
What regex should I use to only select the numbers (comma separated) that have "m" in the end and not the ones that have "mt"?
This is what I have tried:
\d+[,]?\d+m.
I have no idea how to negate mt though.
You are very close to the solution and only missed the possibility to check for a word boundary (represented by regex character \b). So instead of using any character . at the end of your regular expression, you will probably only look if the string is ended by a word boundary (e.g. spaces or newlines or nothing more):
\d+(,\d+)?m\b
where
\d+ looks for any digits (at least one)
(,\d+)? looks for a comma followed by one digit or more (it's grouped by using parentheses and the whole group is completely optional using the ? sign)
m\b as explained above looks for a literal m at the end of a word
With this regex you can also match strings with one digit only followed by m like 9m or similar. This is a slight change in comparison to your regex (grouping comma followed by digits).
I proved the regex via Python and also added some more edge cases:
>>> import re
>>> text = "3,000mt 300mt 44,000m 1m 1mt 1,3mt 320m"
>>> re.findall(r"\d+(?:,\d+)?m\b", text) # ?: is python specific for findall method
['44,000m', '1m', '320m']
how about a unix solution like below
> echo "3,000mt 300mt 44,000m 320m" | tr ' ' '\n' | awk -F" " ' $0~/m$/ { print } '
44,000m
320m
>

Getting gawk regex parser to recognise NUL characters

I have a series of files which I am trying to process and validate using gawk. A small number of the files a corrupt and contain runs of NUL (0x00) characters which I would like to reject as invalid.
However, it appears that gawk (4.1.1) is essentially ignoring the NUL characters. Here's my minimal code to invoke the issue:
BEGIN {
FS="[#/]" #Split at hash or slash
OFS = ":"
}
$10 !~ "^[[:digit:]]+$" {
print NR, $0
}
This should print all records for which field 10 is not a positive integer. However, it fails to print a record for which field 10 is '7' followed by a long string of NULs.
How can I get gawk to recognise the NUL characters? I have tried the --posix command line option to no avail.
ADDENDUM: I changed the code to be:
BEGIN {
FS="[#/]" #Split at hash or slash
OFS = ":"
}
$10 ~ "^7$" {
print NR, $10
}
i.e. changing the criterion to ~ and searching for a 7 on its own in the tenth field. This matches 7NULNULNUL... in the tenth field. However, using:
$10 ~ "^7\0+$"
i.e. matching against 7 followed by one or more explicitly specified NUL characters (octal zero) fails to match.
If this is expected behaviour, can someone explain it to me? Is there any way to accomplish what I'm trying to achieve in gawk?

Arithmetic Calculation in Perl Substitute Pattern Matching

Using just one Perl substitute regular expression statement (s///), how can we write below:
Every success match contains just a string of Alphabetic characters A..Z. We need to substitute the match string with a substitution that will be the sum of character index (in alphabetical order) of every character in the match string.
Note: For A, character index would be 1, for B, 2 ... and for Z would be 26.
Please see example below:
success match: ABCDMNA
substitution result: 38
Note:
1 + 2 + 3 + 4 + 13 + 14 + 1 = 38;
since
A = 1, B = 2, C = 3, D = 4, M = 13, N = 14 and A = 1.
I will post this as an answer, I guess, though the credit for coming up with the idea should go to abiessu for the idea presented in his answer.
perl -ple'1 while s/(\d*)([A-Z])/$1+ord($2)-64/e'
Since this is clearly homework and/or of academic interest, I will post the explanation in spoiler tags.
- We match an optional number (\d*), followed by a letter ([A-Z]). The number is the running sum, and the letter is what we need to add to the sum.
- By using the /e modifier, we can do the math, which is add the captured number to the ord() value of the captured letter, minus 64. The sum is returned and inserted instead of the number and the letter.
- We use a while loop to rinse and repeat until all letters have been replaced, and all that is left is a number. We use a while loop instead of the /g modifier to reset the match to the start of the string.
Just split, translate, and sum:
use strict;
use warnings;
use List::Util qw(sum);
my $string = 'ABCDMNA';
my $sum = sum map {ord($_) - ord('A') + 1} split //, $string;
print $sum, "\n";
Outputs:
38
Can you use the /e modifier in the substitution?
$s = "ABCDMNA";
$s =~ s/(.)/$S += ord($1) - ord "#"; 1 + pos $s == length $s ? $S : ""/ge;
print "$s\n"
Consider the following matching scenario:
my $text = "ABCDMNA";
my $val = $text ~= s!(\d)*([A-Z])!($1+ord($2)-ord('A')+1)!gr;
(Without having tested it...) This should repeatedly go through the string, replacing one character at a time with its ordinal value added to the current sum which has been placed at the beginning. Once there are no more characters the copy (/r) is placed in $val which should contain the translated value.
Or an short alternative:
echo ABCDMNA | perl -nlE 'm/(.)(?{$s+=-64+ord$1})(?!)/;say$s'
or readable
$s = "ABCDMNA";
$s =~ m/(.)(?{ $sum += ord($1) - ord('A')+1 })(?!)/;
print "$sum\n";
prints
38
Explanation:
trying to match any character what must not followed by "empty regex". /.(?!)/
Because, an empty regex matches everything, the "not follow by anything", isn't true ever.
therefore the regex engine move to the next character, and tries the match again
this is repeated until is exhausted the whole string.
because we want capture the character, using capture group /(.)(?!)/
the (?{...}) runs the perl code, what sums the value of the captured character stored in $1
when the regex is exhausted (and fails), the last say $s prints the value of sum
from the perlre
(?{ code })
This zero-width assertion executes any embedded Perl code. It always
succeeds, and its return value is set as $^R .
WARNING: Using this feature safely requires that you understand its
limitations. Code executed that has side effects may not perform
identically from version to version due to the effect of future
optimisations in the regex engine. For more information on this, see
Embedded Code Execution Frequency.

Is it possible to conditionally insert text via regex substitution in Vim?

I have have several lines from a table that I’m converting from Excel to the Wiki format, and want to add link tags for part of the text on each line, if there is text in that field. I have started the converting job and come to this point:
|10.20.30.9||x|-||
|10.20.30.10||x|s04|Server 4|
|10.20.30.11||x|s05|Server 5|
|10.20.30.12|||||
|10.20.30.13|||||
What I want is to change the fourth column from, e.g., s04 to [[server:s04]]. I do not wish to add the link brackets if the line is empty, or if it contains -. If that - is a big problem, I can remove it.
All my tries on regex to get anything from the line ends in the whole line being replaced.
Consider using awk to do this:
#!/bin/bash
awk -F'|' '
{
OFS = "|";
if ($5 != "" && $5 != "-")
$5 = "server:" $5;
print $0
}'
NOTE: I've edited this script since the first version. This current one, IMO is better.
Then you can process it with:
cat $FILENAME | sh $AWK_SCRIPTNAME
The -F'|' switch tells awk to use | as a field separator. The if/else and printf statements are pretty self explanatory. It prints the fields, with 'server:' prepended to column 5, only if it is not "-" or "".
Why column 5 and not column 4?: Because you use | at the beginning of each record. So awk takes the 'first' field ($1) to be an empty string that it believes should have occured before this first |.
This seems to do the job on the sample you give up there (with Vim):
%s/^|\%([^|]*|\)\{3}\zs[^|]*/\=(empty(submatch(0)) || submatch(0) == '-') ? submatch(0) : '[[server:'.submatch(0).']]'/
It's probably better to use awk as ArjunShankar writes, but this should work if you remove "-" ;) Didn't get it to work with it there.
:%s/^\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]\+|\)/\1\2\3\4[[server:\5]]/
It's just stupid though. The first 4 are identical (match anything up to | 4 times). Didn't get it to work with {4}. The fifth matches the s04/s05-strings (just requires that it's not empty, therefor "-" must be removed).
Adding a bit more readability to the ideas given by others:
:%s/\v^%(\|.{-}){3}\|\zs(\w+)/[[server:\1]]/
Job done.
Note how {3} indicates the number of columns to skip. Also note the use of \v for very magic regex mode. This reduces the complexity of your regex, especially when it uses more 'special' characters than literal text.
Let me recommend the following substitution command.
:%s/^|\%([^|]*|\)\{3}\zs[^|-]\+\ze|/[[server:&]]/
try
:1,$s/|\(s[0-9]\+\)|/|[[server:\1]]|/
assuming that your s04, s05 are always s and a number
A simpler substitution can be achieved with this:
%s/^|.\{-}|.\{-}|.\{-}|\zs\(\w\{1,}\)\ze|/[[server:\1]]/
^^^^^^^^^^^^^^^^^^^^ -> Match the first 3 groups (empty or not);
^^^ -> Marks the "start of match";
^^^^^^^^^^^ -> Match only if the 4th line contains letters numbers and `_` ([0-9A-Za-z_]);
^^^ -> Marks the "end of match";
If the _character is similar to -, can appear but must not be substituted, uset the following regex: %s/^|.\{-}|.\{-}|.\{-}|\zs\([0-9a-zA-Z]\{1,}\)\ze|/[[server:\1]]/

regex expression help for string

I have a string
2045111780&&-3&5&-7
I want a regex to give me groups as:
2045111780
&&-
3
... and then next groups as
3
&
5
... and so on.
I came up with (\d+)(&&?-?)? but that gives me groups as:
2045111780
&&-
... and then next groups as
3
&
... and so on.
Note that I need the delim ( regex: &&?-? )
Thanks.
update1: changed the groups output.
I think it's not possible to share a match between groups (the -3 in your example). So, I recommend to do a 2 line processing: split the spring and take 2 pairs in an array. For example, using Perl:
$a = "2045111780&&-3&5&-7";
#pairs = split /&+/, $a;
# at this point you get $pairs[0] = '2045111780', $pairs[1] = '-3', ...
How about (-?\d+|&+). It will match numbers with an optional minus and sequences of &s.
If I understand correctly, you want to have overlapping matches.
You could use a regex like (-?\d+)(&&?)(-?\d+) and match it repeatedly until it fails, each time removing the beginning of the given string up to the start of the third group.
You could do it in perl like this:
$ perl -ne 'while (/(-?\d+)(&&?)(-?\d+)/g) { print $1, " ", $2, " ", $3, "\n"; pos() -= length($3); }'
2045111780&&-3&5&-7 # this is the input
2045111780 && -3
-3 & 5
5 & -7
But that's very ugly. The split approach by Miguel Prz is much, much cleaner.