Last Occurrence of Character Field Separator AWK - regex

I'm using a find command to find all files of a certain format, that command has been golden. I'm piping that output into an awk command and I want to use the last underscore as a field separator. The problem being that depending on the path the file is in, there could be one or two underscores before the fact.
find . -regex ".*prob[0-9]*_.*" | awk 'BEGIN { FS = "_.*$" } { print $1 " " $2 }'
I get what's wrong with the regular expression in my field separator, it thinks to separate on the underscore and whatever follows, is there away to specify just the single character itself. Moreover, how do I specifically use a field separator on the last occurrence of a character.
This is somewhat an extension of a question I asked earlier:
Suppress output to StdOut when piping echo
The files I get are generally like this, the wrinkle being that the directory can have an underscore as well:
/the/directory/probXXXXX_XX
where X is any integer.
A workaround I've been thinking of is separating at every underscore and then print every column... I'd rather like to get it working in the method above though.

A trick of awk that is not obvious is that $ is an operator; you can use it with a variable or even an expression, and in particular with expressions involving the predefined variable NF: $NF gets the last field, $(NF - 1) the second last field.

Related

Is there a way to use sed to remove only the exact string match?

I have recently started learning bash and I ran into a problem doing an assignment, So I have a txt file and in it contains something like
foo:abc:200:1:1:1
foobar:asd:100:3:2:1
bar:test:100:2:2:2
where the first column is the title of the book followed by the author name followed by price,quantity available and qty sold all seperated with the delimiter ":"
the goal here is to remove a book base on the name and author the user types in.
I have searched around and found that sed might possibly be able to help me with this problem, I have tried to test sed by deleting base on the title alone with
sed /"foo"/d Book.txt
I expected the output to be
foobar:asd:100:3:2:1
bar:test:100:2:2:2
however the output was
bar:test:100:2:2:2
which tells me that any line in the txt file containing "foo" will get deleted
Hence I would like to ask
Is there any way to use sed so it deletes the exact match only instead of lines containing foo?
is there any way to use delimiters with sed so I can use both title and author?
Should I be using something other than sed?
Using sed it is better to use:
sed -E '/(^|:)foo(:|$)/d' file
foobar:asd:100:3:2:1
bar:test:100:2:2:2
Which makes sure foo is preceded by start or : and followed by end or :.
However this job is more suitable for awk as data is delimited by colon:
awk -F: '$1 != "foo"' file
Is there any way to use sed so it deletes the exact match only instead of lines containing foo?
Yes you can for the given example, if you mark your search pattern to match exactly foo: you can have luck deleting it. For e.g. if you do below
sed '/^foo:/d' file
The pattern ^ marks that the string starting with foo followed by a colon mark : which matches your use-case. This is assuming foo can be part of the fist column only
Is there any way to use delimiters with sed so I can use both title and author?
Should I be using something other than sed?
If you are dealing with a input file has a fixed de-limiter like : which will never form a part of your valid column content, then using awk/perl are better suited as they read text easily once a de-limiter is set.
As an example, consider an e.g. if you want to change the quantity name from fourth column for one particular book named foobar, with awk you can just do
awk -F: 'BEGIN { OFS = FS } $1 == "foobar" { $4 = 6 }1' input-file
To decode above line, the content within '..' are left untouched by the shell and passed literally to the command, that's why we wrap the content in single quotes. Also the statements inside it are not meaningful in the context of the shell.
So the -F: sets the input field-separator to : which is when the command reads the file line by line, the first line is broken down into tokens separated by :. The first column is labelled $1, which is extended up to $NF, meaning the last column of the line. The part BEGIN { OFS = FS } assigns the output field separator as the same as input i.e. retain the : de-limitation when awk writes the output also.
The part $1 == "foobar" { $4 = 6 } is almost self-explanatory in a sense, that if the first column contains the string within quotes do the action inside {..}, which is set the fourth column value as 6. The {..}1 is a short-hand notation for {...; print} which is to re-construct the line based on the output field/record separators defined.
This might work for you (GNU sed):
sed '/\<foo\>/d' file
Or
sed '/\bfoo\b/d' file
The first solution uses \< start word and \> end word. The second solution uses the \b word boundary.
P.S. The dual of \b is \B so to delete lines that contain foobar or foobaz but not foo only, use:
sed '/\bfoo\B/d' file

Extract multiple independent regex matches per line

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:
1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"
The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:
MOUSE_10 XC:Z:TGGTCGGCGCGT RG:Z:A XM:Z:GAGTCCGT ZP:i:33
MOUSE_10 XC:Z:GAAGCCGCTTCC NM:i:0 XM:Z:ACCGACGG AS:i:16
MOUSE_10 ZP:i:36 XC:Z:TCCCCGGGTACA NM:i:0 XM:Z:GGGACGGG ZP:i:28
MOUSE_10 XC:Z:CAAATTTGGAAA RG:Z:A NM:i:1 XM:Z:GCAGATAG
In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:
use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.
The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.
Looking forward to your solutions!
Thanks,
Felix
With sed you can capture non-space characters after XC:Z: and XM:Z:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/p;' file
You can add a second s command for reversed values:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/;s/.*XM:Z:\([^[:blank:]]*\).*XC:Z:\([^[:blank:]]*\).*/\1, \2/;p;' file
Following awk solution may help you in same.
awk '
/XC:Z:/{
match($0,/XC:[^ ]*/);
num=split(substr($0,RSTART,RLENGTH),a,":");
match($0,/XM:[^ ]*/);
num1=split(substr($0,RSTART,RLENGTH),b,":");
print a[num],b[num1]
}' Input_file
Output will be as follows.
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
If we don't know the order in which XC and XM appear
You can try this sed
sed -E 'h;s/(XC:Z:.*XM:Z:)//;tA;x;s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/;b;:A;x;s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/' infile
explanation :
sed -E '
h
# keep the line in the hold space
s/(XC:Z:.*XM:Z:)//;x;tA
# if XCZ come before XMZ, go to A but before everything restore the pattern space with x
s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/
# XMZ come before XCZ, get the interresting parts and reorder it
b
# It is all for this line
:A
s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/
# XCZ come before XMZ, get the interresting parts
' infile
another awk
$ awk '{c=p=""; # need to reset c and p before each line
for(i=1;i<=NF;i++) # for all fields in the line
if($i~/^XC:Z:/) c=substr($i,6) # check pattern from the start of field
else if($i~/^XM:Z:/) p=substr($i,6) # if didn't match check other other pattern
if(c && p) print c,p}' file # if both matched print
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last matches if there are multiple instances on the same line. Here is another one with slightly different characteristic.
$ awk 'function s(x) {return ($i~x)?substr($i,6):""}
{c=p="";
for(i=1;i<=NF;i++) {
c=c?c:s("^XC:Z:"); p=p?p:s("^XM:Z:");
if(c && p)
{print c,p; next}}}' file
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last of the repeated match before the first match of the other. It they appear in pairs, will print the first pair.
Using POSIX awk, you can only use the string-function match(s,ere) as defined by IEEE Std 1003.1-2008 :
match(s, ere)
Return the position, in characters, numbering from 1, in
string s where the extended regular expression ere occurs, or zero if
it does not occur at all. RSTART shall be set to the starting position
(which is the same as the returned value), zero if no match is found;
RLENGTH shall be set to the length of the matched string, -1 if no
match is found.
The patterns you want to match are XM:Z:[^[:blank:]]* and XC:Z:[^[:blank:]]*. This however assumes you do not have any string which contains something like PXM:Z: (i.e. an extra non-blank character advancing the searched string). When the pattern is found in the line $0, then you only need to extract the important parts, which start 5 characters later.
The following code does the above:
awk '{match($0,/XM:Z:[^[:blank:]]*/);xm=substr($0,RSTART+5,RLENGTH-5)}
{match($0,/XC:Z:[^[:blank:]]*/);xc=substr($0,RSTART+5,RLENGTH-5)}
{print xc","xm}' <file>
As you can see, the first line extracts XM, the second XC and the third prints the outcome with comma-separator ",".
Remark - The following assumptions are made here :
each line contains both an xm and xc string
no strings of the type [^[:blank:]]X[CM]:Z:[^[:blank:]]* exist
If you are willing to use gawk, then you could use the patsplit function for string operations (Ref. here). You can do this with a single regex /X[CM]:Z:[^[:blank:]]*/. This gives you directly the requested strings in a single call which include the XM:Z: or XM:C: part. Afterwards you can easily sort them and extract the last parts.
The following lines do exactly the same in gawk
gawk '{patsplit($0,a,/X[MC]:Z:[^[:blank:]]*/) }
{xc=(a[1]~/^XC/)?a[1]:a[2]; xm=(a[1]~/^XC/)?a[2]:a[1]}
{print substr(xc,5)","substr(xm,5)' <file>
Nonetheless, I believe the awk solution is cleaner from a symmetric point of view.

BASH escaping double quotes within single quotes

I'm trying to write a bash function that would escape all double quotes within single quotes, eg:
'I need to escape "these" quotes with backslashes'
would become
'I need to escape \"these\" quotes with backslashes'
My take on it was:
Find pairs of single quotes in the input and extract them with grep
Pipe into sed, escape double quotes
Sed again the whole input and replace grep match with sedded match
I managed to get it working to the part of having correctly escaped quotes section, but replacing it in the whole input fails.
The script code copypaste:
# $1 - Full name, $2 - minified name
adjust_quotes ()
{
SINGLE_QUOTES=`grep -Eo "'.*'" $2`
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
sed -r "s|'.*'|$ESCAPED_QUOTES|g" "$2" > "$2.escaped"
mv "$2.escaped" $2
echo "Quotes escaped within single quotes on $2"
}
Random additional questions:
In the console, escaping the quote with only two backslashes works, but when code is put in the script - I need four. I'd love to know
Could I modify this code into a loop to escape all pairs of single quotes, one after another until EOF?
Thanks!
P.S. I know this would probably be easier to do in eg. python, but I really need to keep it in bash.
Using BASH string replacement:
s='I need to escape "these" quotes with backslashes'
r="${s//\"/\\\"}"
echo "$r"
I need to escape \"these\" quotes with backslashes
Here's a pure bash solution, which does the transformation on stdin, printing to stdout. It reads the entire input into memory, so it won't work with really enormous files.
escape_enclosed_quotes() (
IFS=\'
read -d '' -r -a fields
for ((i=1; i<${#fields[#]}; i+=2)); do
fields[i]=${fields[i]//\"/\\\"}
done
printf %s "${fields[*]}"
)
I deliberately enclosed the body of the function in parentheses rather than braces, in order to force the body to run in a subshell. That limits the modification of IFS to the body, as well as implicitly making the variables used local.
The function uses the read builtin to read the entire input (since the line delimiter is set to NUL with -d '') into an array (-a) using a single quote as the field separator (IFS=\'). The result is that the parts of the input surrounded with single quotes are in the odd positions of the array, so the function loops over the odd indices to do the substitution only for those fields. I use bash's find-and-replace syntax instead of deferring to an external utility like sed.
This being bash, there are a couple of gotchas:
If the file contains a NUL, the rest of the file will be ignored.
If the last line of the file does not end with a newline, and the last character of that line is a single quote, it will not be output.
Both of the above conditions are impossible in a portable text file, so it's probably OK. All the same, worth taking note.
The supplementary question: why are the extra backslashes needed in
ESCAPED_QUOTES=`echo $SINGLE_QUOTES | sed 's|"|\\\\"|g'`
Answer: It has nothing to do with that line being in a script. It has to do with your use of backticks (...) for command substitution, and the idiosyncratic and often unpredictable handling of backslashes inside backticks. This syntax is deprecated. Do not use it. (Not even if you see someone else using it in some random example on the internet.) If you had used the recommended $(...) syntax for command substitution, it would have worked as expected:
ESCAPED_QUOTES=$(echo $SINGLE_QUOTES | sed 's|"|\\"|g')
(More information is in the Bash FAQ linked above.)

Extracting a string between two patterns in bash with a regex

I have a string with key/value pairs in a bash variable. The value I want is hidden like this.
{"keyIDontCareAbout"=>"valueIDontCareAbout",
"keyForValueIWant"=>"valueIWant",
...............bunch more keys
}
What should I use to extract that value? sed, awk, expr match?
My thinking is this, I should extract the string that is preceded by "keyForValueIWant"=>" and is followed by " but I'm having a hard time deciding which tool to use.
expr match seems bad, because it grabs a string at the end of an expression or at the beginning of one, but my string is in the middle of a bunch of characters.
Basically, I can't figure out the regex syntax for a substring between two other substrings.
You can use the following sed command:
valueOfInterest=$(sed -n '/keyForValueIWant/ s/.*=>"\([^"]*\).*/\1/p' <<< "$input")
-n disables output by default. The regex /keyForValueIWant/ restricts the following action only to the/those lines which match the regex. The following substitute command filters the value out of the line and prints it /p.
Try awk as follows:
# Specify key of interest.
key='keyForValueIWant'
# Extract matching value, assuming that the input data is
# in shell variable $input:
value=$(awk -F'("|=>)' -v key="$key" '$2==key { print $5; exit }' <<<"$input")
# Print result.
echo "Value for $key: [$value]"
-F'("|=>)' tells awk to split each line into fields based on " or => as separators - effectively, this will put the the key in field 2 ($2), and the value in field 5 ($5)
The key of interest is passed as a shell variable ($key) to awk as a variable of the same name (-v key=...).
If the input line's key matches the specified key ($2==key), the 5th field - containing the value - is printed (print $5).
exit ensures that processing stops once a match is found to prevent unnecessary parsing of the remainder of the file (note: this assumes that the keys are true keys, i.e., that they are unique in the input file).

Problem with regular expression using grep

I've got some textfiles that hold names, phone numbers and region codes. One combination per line.
The syntax is always "Name Region_code number"
With any number of spaces between the 3 variables.
What I want to do is search for specific region codes, like 23 or 493, forexample.
The problem is that these numbers might appear in the longer numbers too, which might enable a return that shouldn't have been returned.
I was thinking of this sort of command:
grep '04' numbers.txt
But if I do that, a line that contains 04 in the number but not as region code will show as a result too... which is not correct.
I'm sure you are about to get buried in clever regular expressions, but I think in this case all you need to do is include one of the spaces on each side of your region code in the grep.
grep ' 04 ' numbers.txt
I'd do:
awk '$2 == "04"' < numbers.txt
and with grep:
grep -e '^[^ ]*[ ]*04[ ]*[^ ]*$' numbers.txt
If you want region codes alone, you should use:
grep "[[:space:]]04[[:space:]]"
this way it will only look for numbers on the middle column, while start or end of strings are considered word breaks.
You can even do:
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" FILE
}
replacing FILE with the name of your file,
and use
search_region_codes 04
or even
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" $2
}
and using
search_region_codes NUMBER FILE
Are you searching for an entire region code, or a region code that contains the subpattern?
If you want the whole region code, and there is at least one space on either side, then you can format the grep by adding a single space on either side of the specific region code. There are other ways to indicate word boundaries using regular expressions.
grep ' 04 ' numbers.txt
If there can be spaces in the name or phone number fields, than that solution might not work. Also, if you the pattern can be a sub-part of the region code, then awk is a better tool. This assumes that the 'name' field contains no spaces. The matching operator '==' requires that the pattern exactly match the field. This can be tricky when there is whitespace on either side of the field.
awk '$2 == "04" {print $0}' < numbers.txt
If the file has a delimiter, than can be set in awk using the '-F' argument to awk to set the field separator character. In this example, a comma is used as the field separator. In addition, the matching operator in this example is a '~' allowing the pattern to be any part of the region code (if that is applicable). The "/y" is a way to match work boundaries at the beginning and end of the expression.
awk -F , '$2 ~ /\y04\y/ {print $0}' < numbers.txt
In both examples, the {print $0} is optional, if you want the full line to be printed. However, if you want to do any formatting on the output, that can be done inside that block.
use word boundaries. not sure if this works in grep, but in other regex implementations i'd surround it with whitespace or word boundary patterns
'\s+04\s+' or '\b04\b'
Something like that