I have a series of files which I am trying to process and validate using gawk. A small number of the files a corrupt and contain runs of NUL (0x00) characters which I would like to reject as invalid.
However, it appears that gawk (4.1.1) is essentially ignoring the NUL characters. Here's my minimal code to invoke the issue:
BEGIN {
FS="[#/]" #Split at hash or slash
OFS = ":"
}
$10 !~ "^[[:digit:]]+$" {
print NR, $0
}
This should print all records for which field 10 is not a positive integer. However, it fails to print a record for which field 10 is '7' followed by a long string of NULs.
How can I get gawk to recognise the NUL characters? I have tried the --posix command line option to no avail.
ADDENDUM: I changed the code to be:
BEGIN {
FS="[#/]" #Split at hash or slash
OFS = ":"
}
$10 ~ "^7$" {
print NR, $10
}
i.e. changing the criterion to ~ and searching for a 7 on its own in the tenth field. This matches 7NULNULNUL... in the tenth field. However, using:
$10 ~ "^7\0+$"
i.e. matching against 7 followed by one or more explicitly specified NUL characters (octal zero) fails to match.
If this is expected behaviour, can someone explain it to me? Is there any way to accomplish what I'm trying to achieve in gawk?
Related
In Powershell, I have a string of this form:
"abcdefghijk","hijk","lmnopqrstuvwxyzabcdefghij"
Assuming that I know how to escape a character (thus, if I were actually to write it including the string markers):
"`"abcdefghijk`",`"hijk`",`"lmnopqrstuvwxyzabcdefghij`""
... how would I trim anything between double quotes to only 16 characters?
The expected output is therefore:
"abcdefghijk","hijk","lmnopqrstuvwxyza"
I thought this:
% {$_ -replace "([^\`"]{16})([^\`"]+)", "$1"}
would match any relevant blocks as two backreferences, one of length 16 and one of unlimited length, and return only the first. However, that just removes everything of length 16 or more, resulting in:
"abcdefghijk","hijk",""
This isn't what I expected at all! What am I doing wrong?
The answer is as simple as this: change the double quotes around $1 to single quotes:
-replace "([^`"]{16})([^`"]+)", '$1'
It is a bit counter-intuitive, but here is the reason behind it: when you use a pair of double quotes, $1 is interpreted as a variable name and interpolated into its content, which is empty in this case, before it even reaches the regex engine.
Or, you can escape the $ as well:
-replace "([^`"]{16})([^`"]+)", "`$1"
I came up with the following which should be pretty close to what you're trying to do:
(\w+)\W+(\w+)\W+(\w+){16}
https://regex101.com/r/kfEAFl/1
Here you have the three groups and you can join them in any way.
Here's an example to do it without regex:
$longArray = #("1221111111111111111111111111", "213")
$shortArray = $longArray.ForEach{$_.ToString().PadRight(16, " ").SubString(0,16)}
write-host $shortArray
# Writes 1221111111111111 213
I've been searching for a few hours now for a way to find a string containing 21 numeric characters and place a return in front of the string itself. Finally i found the solution using:
sed -r 's/\b[0-9]{21}\b/\n&/g'
Works great!
Now i have a new set of data containing 21 numeric characters but adding to that there is some alphabetic characters at the end of the string with a variable length of 3 to 10 characters.
Sample input:
169349870913736210308ABC
168232727246529300209DEFGHI
166587299965005120122JKLMNOPQRS
162411281984306600005TUVWXYZ
What i would like is to have a space between the numeric and the alphabetical characters:
169349870913736210308 ABC
168232727246529300209 DEFGHI
166587299965005120122 JKLMNOPQRS
162411281984306600005 TUVWXYZ
Do note the 16 which every number starts with. I've tried using:
sed -r 's/^\b[0-9]{21}\+[A-Z]{3,10}\b/ /g' filename
But i couldnt get it to work because i dont know and couldnt find how to specifically search for a string containing an exact amount of numeric characters combined with alphabetical characters of a special length. I've found a lot of helpfull questions on this website, but this one i couldnt find.
Use capturing group.
sed -r 's/^([0-9]{21})([A-Z]{3,10})$/\1 \2/' filename
Search from left to right first non numeric character ([^0-9]) and replace it by a whitespace and the matching (&) non numeric character:
sed 's/[^0-9]/ &/' file
Output:
169349870913736210308 ABC
168232727246529300209 DEFGHI
166587299965005120122 JKLMNOPQRS
162411281984306600005 TUVWXYZ
Why not just add a space after the first 21 characters, like so:
sed 's/^...................../& /g'
After something I guess is pretty complex, and I am pretty bad with regex's so you guys might be able to help.
See this data source:
User ID:
a123456
a12345f
a1234e6
d123d56
b12c456
c1b3456
ba23456
Basically, what I want to do, is use a regex/sed to replace all occurances of letters into numbers EXCEPT the first letter. Letters will always match their alphabet position. e.g. a = 1, b = 2, c = 3 etc.
So the result set should look like this:
User ID:
a123456
a123456
a123456
d123456
b123456
c123456
b123456
There will also never be any letters other that a-j, and the string will always be 7 chars long.
Can anyone shed some light? Thanks! :)
Here's one way you could do it using standard tools cut, paste and tr:
$ paste -d'\0' <(cut -c1 file) <(cut -c2- file | tr 'abcdef' '123456')
a123456
a123456
a123456
d123456
b123456
c123456
b123456
This joins the first character of the line with the result of tr on the rest of the line, using the null string. tr replaces each element found in the first list with the corresponding element of the second list.
To replace a-j letters in a line by the corresponding digits except the first letter using perl:
$ perl -pe 'substr($_, 1) =~ tr/a-j/0-9/' input_file
a=0, not a=1 because j would be 10 (two digits) otherwise.
J = 0, and no, only numbers 0-9 are used, and letters simply replace their number counterpart, so there will never be a latter greater than j.
To make j=0 and a=1:
$ perl -pe 'substr($_, 1) =~ tr/ja-i/0-9/' input_file
sed '/[a-j][0-9a-j]\{6\}$/{h;y/abcdefghij/1234567890/;G;s/.\(.\{6\}\).\(.\).*/\2\1/;}' YourFile
filter on "number" only
remind line (for 1st letter)
change all letter to digit (including 1st)
add first form of number (as second line in buffer)
take 1st letter of second line and 6 last of 1st one, reorder and dont keep the other character
$ awk 'BEGIN{FS=OFS=""} NR>1{for (i=2;i<=NF;i++) if(p=index("jabcdefghi",$i)) $i=p-1} 1' file
User ID:
a123456
a123456
a123456
d123456
b123456
c123456
b123456
Note that the above reproduces the header line User ID: as-is. So far, best I can tell, all of the other posted solutions would change the header line to Us5r ID: since they would do the letter-to-number translation on it just like on all of the subsequent lines.
I don't see the complexity. Your samples look like you just want to replace six of seven characters with the numbers 1-6:
s/^\([a-j0-9]\)[a-j0-9]\{6\}/\1123456/
Since the numbers to put there are defined by position, we don't care what the letter was (or even if it was a letter). The downside here is that we don't preserve the numbers, but they never varied in your sample data.
If we want to replace only letters, the first method I can think of involves simply using multiple substitutions:
s/^\([a-j0-9]\{1\}\)[a-j]/\11/
s/^\([a-j0-9]\{2\}\)[a-j]/\12/
s/^\([a-j0-9]\{3\}\)[a-j]/\13/
s/^\([a-j0-9]\{4\}\)[a-j]/\14/
s/^\([a-j0-9]\{5\}\)[a-j]/\15/
s/^\([a-j0-9]\{6\}\)[a-j]/\16/
Replacing letters with specific digits, excluding the first letter:
s/\(.\)a/\11/g
This pattern will replace two character sequences, preserving the first, so would have to be run twice for each letter. Using hold space we could store the first character and use a simple transliteration. The tricky part is joining the two sections, whereupon sed injects an unwanted newline.
# Store in hold space
h
# Remove the first character
s/^.//
# Transliterate letters
y/jabcdefghi/0123456789/
# Exchange pattern and hold space
x
# Keep the first character
s/^\(.\).*$/\1/
# Print it
#P
# Join
G
# Remove the newline
s/^\(.\)./\1/
Still learning about sed's capabilities :)
I've a pretty simple question. I've a file containing several columns and I want to filter them using awk.
So the column of interest is the 6th column and I want to find every string containing :
starting with a number from 1 to 100
after that one "S" or a "M"
again a number from 1 to 100
after that one "S" or a "M"
So per example : 20S50M is ok
I tried :
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
but it didn't work... What am I doing wrong?
This should do the trick:
awk '$6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/' file
Regexplanation:
^ # Match the start of the string
(([1-9]|[1-9][0-9]|100) # Match a single digit 1-9 or double digit 10-99 or 100
[SM] # Character class matching the character S or M
){2} # Repeat everything in the parens twice
$ # Match the end of the string
You have quite a few issue with your statement:
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
== is the string comparision operator. The regex comparision operator is ~.
You don't quote regex strings (you never quote anything with single quotes in awk beside the script itself) and your script is missing the final (legal) single quote.
[0-9] is the character class for the digit characters, it's not a numeric range. It means match against any character in the class 0,1,2,3,4,5,6,7,8,9 not any numerical value inside the range so [1-100] is not the regular expression for digits in the numerical range 1 - 100 it would match either a 1 or a 0.
[SM] is equivalent to (S|M) what you tried [S|M] is the same as (S|\||M). You don't need the OR operator in a character class.
Awk using the following structure condition{action}. If the condition is True the actions in the following block {} get executed for the current record being read. The condition in my solution is $6~/^(([1-9]|[1-9][0-9]|100)[SM]){2}$/ which can be read as does the sixth column match the regular expression, if True the line gets printed because if you don't get any actions then awk will execute {print $0} by default.
Regexes cannot check for numeric values. "A number from 1 to 100" is outside what regexes can do. What you can do is check for "1-3 digits."
You want something like this
/\d{1,3}[SM]\d{1,3}[SM]/
Note that the character class [SM] doesn't have the ! alternation character. You would only need that if you were writing it as (S|M).
I would do the regex check and the numeric validation as different steps. This code works with GNU awk:
$ cat data
a b c d e 132x123y
a b c d e 123S12M
a b c d e 12S23M
a b c d e 12S23Mx
We'd expect only the 3rd line to pass validation
$ gawk '
match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) &&
1 <= m[1] && m[1] <= 100 &&
1 <= m[2] && m[2] <= 100 {
print
}
' data
a b c d e 12S23M
For maintainability, you could encapsulate that into a function:
gawk '
function validate6() {
return( match($6, /^([[:digit:]]{1,3})[SM]([[:digit:]]{1,3})[SM]$/, m) &&
1<=m[1] && m[1]<=100 &&
1<=m[2] && m[2]<=100 );
}
validate6() {print}
' data
The way to write the script you posted:
awk '{ if($6 == '/[1-100][S|M][1-100][S|M]/') print} file.txt
in awk so it will do what you SEEM to be trying to do is:
awk '$6 ~ /^(([1-9][0-9]?|100)[SM]){2}$/' file.txt
Post some sample input and expected output to help us help you more.
I know this thread has already been answered, but I actually have a similar problem (relating to finding strings that "consume query"). I'm trying to sum up all of the integers preceding a character like 'S', 'M', 'I', '=', 'X', 'H', as to find the read length via a paired-end read's CIGAR string.
I wrote a Python script that takes in the column $6 from a SAM/BAM file:
import sys # getting standard input
import re # regular expression module
lines = sys.stdin.readlines() # gets all CIGAR strings for each paired-end read
total = 0
read_id = 1 # complements id from filter_1.txt
# Get an int array of all the ints matching the pattern 101M, 1S, 70X, etc.
# Example inputs and outputs:
# "49M1S" produces total=50
# "10M757N40M" produces total=50
for line in lines:
all_ints = map(int, re.findall(r'(\d+)[SMI=XH]', line))
for n in all_ints:
total += n
print(str(read_id)+ ' ' + str(total))
read_id += 1
total = 0
The purpose of the read_id is to mark each read you're going through as "unique", in case if you want to take the read_lengths and print them beside awk-ed columns from a BAM file.
I hope this helps, or at least helps the next user that has a similar issue.
I consulted https://stackoverflow.com/a/11339230 for reference.
Try this:
awk '$6 ~/^([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]+([1-9]|0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt
Because you did not say exactly how the formatting will be in column 6, the above will work where the column looks like '03M05S', '40S100M', or '3M5S'; and exclude all else. For instance, it will not find '03F05S', '200M05S', '03M005S, 003M05S, or '003M005S'.
If you can keep the digits in column 6 to two when 0-99, or three when exactly 100 - meaning exactly one leading zero when under 10, and no leading zeros otherwise, then it is a simpler match. You can use the above pattern but exclude single digits (remove the first [1-9] condition), e.g.
awk '$6 ~/^(0[1-9]|[1-9][0-9]|100)+[S|M]+(0[1-9]|[1-9][0-9]|100)+[S|M]$/' file.txt
I have have several lines from a table that I’m converting from Excel to the Wiki format, and want to add link tags for part of the text on each line, if there is text in that field. I have started the converting job and come to this point:
|10.20.30.9||x|-||
|10.20.30.10||x|s04|Server 4|
|10.20.30.11||x|s05|Server 5|
|10.20.30.12|||||
|10.20.30.13|||||
What I want is to change the fourth column from, e.g., s04 to [[server:s04]]. I do not wish to add the link brackets if the line is empty, or if it contains -. If that - is a big problem, I can remove it.
All my tries on regex to get anything from the line ends in the whole line being replaced.
Consider using awk to do this:
#!/bin/bash
awk -F'|' '
{
OFS = "|";
if ($5 != "" && $5 != "-")
$5 = "server:" $5;
print $0
}'
NOTE: I've edited this script since the first version. This current one, IMO is better.
Then you can process it with:
cat $FILENAME | sh $AWK_SCRIPTNAME
The -F'|' switch tells awk to use | as a field separator. The if/else and printf statements are pretty self explanatory. It prints the fields, with 'server:' prepended to column 5, only if it is not "-" or "".
Why column 5 and not column 4?: Because you use | at the beginning of each record. So awk takes the 'first' field ($1) to be an empty string that it believes should have occured before this first |.
This seems to do the job on the sample you give up there (with Vim):
%s/^|\%([^|]*|\)\{3}\zs[^|]*/\=(empty(submatch(0)) || submatch(0) == '-') ? submatch(0) : '[[server:'.submatch(0).']]'/
It's probably better to use awk as ArjunShankar writes, but this should work if you remove "-" ;) Didn't get it to work with it there.
:%s/^\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]\+|\)/\1\2\3\4[[server:\5]]/
It's just stupid though. The first 4 are identical (match anything up to | 4 times). Didn't get it to work with {4}. The fifth matches the s04/s05-strings (just requires that it's not empty, therefor "-" must be removed).
Adding a bit more readability to the ideas given by others:
:%s/\v^%(\|.{-}){3}\|\zs(\w+)/[[server:\1]]/
Job done.
Note how {3} indicates the number of columns to skip. Also note the use of \v for very magic regex mode. This reduces the complexity of your regex, especially when it uses more 'special' characters than literal text.
Let me recommend the following substitution command.
:%s/^|\%([^|]*|\)\{3}\zs[^|-]\+\ze|/[[server:&]]/
try
:1,$s/|\(s[0-9]\+\)|/|[[server:\1]]|/
assuming that your s04, s05 are always s and a number
A simpler substitution can be achieved with this:
%s/^|.\{-}|.\{-}|.\{-}|\zs\(\w\{1,}\)\ze|/[[server:\1]]/
^^^^^^^^^^^^^^^^^^^^ -> Match the first 3 groups (empty or not);
^^^ -> Marks the "start of match";
^^^^^^^^^^^ -> Match only if the 4th line contains letters numbers and `_` ([0-9A-Za-z_]);
^^^ -> Marks the "end of match";
If the _character is similar to -, can appear but must not be substituted, uset the following regex: %s/^|.\{-}|.\{-}|.\{-}|\zs\([0-9a-zA-Z]\{1,}\)\ze|/[[server:\1]]/