how to define a space in a regular expression (in awk)? - regex

I want to print the texts inside of " ". for example I have the following strings:
gfdg "jkfgh" "jkfd fdgj fd-" ghjhgj
gfggf "kfdjfdgfhbg" "fhfghg" jhgj
jhfjhg "dfgdf" fgf
fgfdg "dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd" hgjghj
And I want to print only the following:
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"
I have tried awk with the following regular expression:
awk '{for(i = 1; i <= NF; i++) if($i ~ /^\"[A-Za-z.$]*([A-Za-z.$][[:space:]]*[A-Za-z.$])*\"$/) print $i}' sample.txt
but it prints everything before space and actually does not recognize the spaces I have defined in my regular expression. My current output is:
"jkfgh"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj
as you can see, only the ones without any space are printed correctly.
I have also tried [[:blank:]], \t and also ' ' but did not work.
I appreciate if someone can tell me how to change this regular expression and include space.

The question's title is misleading and based on a fundamental misconception about awk.
The naïve answer is that a space can simply be represented as itself (a literal) in regular expressions in awk.
More generally, you can use [[:space:]] to match a space, a tab or a newline (GNU Awk also supports \s), and [[:blank:]] to match a space or a tab.
However, the crux of the problem is that Awk, by default, splits each input line into fields by whitespace, so that, by definition, no input field itself contains whitespace, so any attempt to match a space in a field value will invariably fail.
The input at hand has fields that are a mix of unquoted and quoted strings, but POSIX Awk has no support for recognizing quoted strings as fields.
#fedorqui has made a valiant attempt to work around the problem by splitting input into fields by double quotes, but it's no substitute for proper recognition of quoted strings, because it doesn't preserve the true field boundaries.
If you have GNU Awk, you can approximate recognition of quoted strings using the special FPAT variable, which, rather than defining a separator to split lines by, allows defining a regex that describes fields (and ignores tokens not recognized as such):
re='[[:alpha:]][[:alpha:] ]*[[:alpha:]]' # aux. shell variable
gawk -v FPAT="\"$re\"|'$re'" '{
for(i=1;i<=NF;++i) printf "%s%s", $i, (i==NF ? "\n" : " ")
}' sample.txt
This will work with single- and double-quoted strings.
Explanation:
FPAT="\"$re\"|'$re'" defines fields to be either double- or single-quoted strings consisting only of letters and spaces, with at least one letter on either end (as in the OP's code).
Note that this automatically excludes the UNquoted tokens on each input line - they will not be reflected in $1, ... and NF.
Therefore, the loop for(i=1;i<=NF;++i) is already limited to enumerating only the matching fields.
Note that, generally, the restrictions placed on the contents of the quoted strings in this case luckily bypass limitations inherent in this approach, namely the inability to deal with escaped nested quotes (of the same type).
If this limitation is acceptable, you can use the following idiom to tokenize input that is a mix of barewords (unquoted tokens) and quoted strings:
gawk -v "FPAT=[^[:blank:]]+|\"[^\"]*\"|'[^']*'" ...

You are just getting those without any space because you loop through fields and they are space separated. Thus, you need to change the approach to something handling the spaces differently. Assuming there are no nested quotes, you can use for example:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "\"%s\"", $i; print ""}' file
That is, use " as field separator and print the even fields.
This is equivalent to using FS more elegantly:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s", FS, $i, FS; print ""}' file
Note in the previous approaches the output has no space in between fields. If you need it, you can use:
awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
The trick (i>NF-2?"\n":" ") is a matter of printing the whole field together with a separator. If we are in the last field, we set it as new line; otherwise, as a space. More idiomatically, you can also say (i>NF-2?RS:OFS) using the default values of RS (record separator, new line) and OFS (output field separator, space).
Test
$ awk -F'"' '{for (i=2;i<NF;i+=2) printf "%s%s%s%s", FS, $i, FS, (i>NF-2?"\n":" ")}' file
"jkfgh" "jkfd fdgj fd-"
"kfdjfdgfhbg" "fhfghg"
"dfgdf"
"dfj jfdg jhfgjd" "hfgdh jfdhgd jkfghfd"

Related

Replace all commas between two quotes in a bash script

I need that all "," between two " are replaced with ";" within a bash script. I'm close, but hours on the internet and stackoverflow led me to this:
echo ',,Lung,,"Lobular, each.|lungs, right.",false,,,,"organ, left.",,,,,' | sed -r ':a;s/(".*?),(.*?")/\1;\2/;ta'
With the result:
,,Lung,,"Lobular; each.|lungs; right.";false;;;;"organ; left.",,,,,
Correct would be:
,,Lung,,"Lobular; each.|lungs; right.",false,,,,"organ; left.",,,,,
Not sure how you want to deal with lines that have an odd number of double quotes (eg, the double quoted string spans multiple lines), but perhaps:
awk '!(NR%2){gsub(",",";")} 1' RS=\" ORS=\"
This simply treats " as the record separator and does the replacement only on odd numbered records. Seems to work as desired. (Or, rather, it works as you seem to desire!)
As oguz points out in a comment, this prints an extra " at the end. That can be fixed with:
awk '!(NR%2){gsub(",",";")} {printf RFS $0} {RFS="\""}' RS=\"
which is a bit uglier but more correct . (or, rather, less incorrect!) If your input stream ends with a ", that quote will be truncated. If, however, your input is terminated by a newline rather than a ", this will do what you want.
OTOH, you might just want to do:
perl -wpE 'BEGIN{$/=\1}; y/,/;/ if $in; $in = ! $in if $_ eq "\""'
Which reads one character and uses a simple state machine. ($_ is the current character, so $in = ! $in changes state when a double quote is seen and the transliteration only happens when $in is non-zero.)
If you /really/ wanted to use sed, you could do a whole line replace and include a clause like ^(([^"]*"[^"*]")*[^"]*) at the beginning of your existing expression in order to ensure that the matched quotes are "odd".

Pass variable to awk pattern

I am writing a shell script which calls an awk script and then I take some user input in the BEGIN using getline, and I save the input to some variables.
BEGIN {printf "What's the word?"
getline word < "-"
}
Now, one of these variables is called "word" and I want to use it in another pattern in the script to print all lines containing the word given. I tried something like this:
/(^| )word( |$)/
which will print all lines containing the word "word", and I know that it's not gonna work because it's not recognized as being a variable. I'd searched a lot and found patterns starting with
$0~
but it's not working either in my case. Is there a way I could pass a variable to this pattern and print all lines containing the word stored in the variable?
If you use a BEGIN section to build a variable with your full pattern, you can refer to it later:
awk -v word="hello" '
BEGIN {
pattern = "(^|[[:space:]])" word "([[:space:]]|$)"
}
$0 ~ pattern { print $0 }
'
...that said, you don't even need to do that, if you don't mind the overhead of reconstructing the pattern for every line:
awk -v word="hello" '$0 ~ ("(^|[[:space:]])" word "([[:space:]]|$)") { print $0 }'
(Why [[:space:]] instead of ? That way tabs and other whitespace characters other than hex-20 vanilla spaces can also act as word separators).
another alternative is using the word boundary, which you can apply in the variable with some backslash escaping
awk -v v="\\\y$word\\\y" '$0~v' file
not sure all awks support this though. Alternatively you can use \< and \> for the left and right boundaries.

awk - parse text having same character in fields as delimiter

Consider this source:
field1;field2;"data;data field3";field4;"data;data field5";field6
field1;"data;data field2";field3;field4;field5;"data;data field6"
As you can see, the field delimiter is being used inside certain fields, enclosed between ". I cannot directly parse with awk because there is no way of avoiding unwanted splitting, at least I haven't found a way. Moreover, those special fields have a variable position within a line and they can occur once, twice, 4 times etc.
I thought of a solution involving a pre-parsing step, where I replace the ; in those fields with a code of some sort. The problem is that sed / awk perform greedy REGEX match. So in the above example, I can only replace ; within the last field enclosed in quotes in each line.
How can I match each instance of quotes and replace the specific ; within them? I do not want to use perl or python etc.
Using gnu awk you can use special FPAT variable to have a regex for your fields.
You can use this command to replace all ; by | inside the double quotes:
awk -v OFS=';' -v FPAT='"[^"]*"|[^;]*' '{for (i=1; i<=NF; i++) gsub(/;/, "|", $i)} 1' file
field1;field2;"data|data field3";field4;"data|data field5";field6
field1;"data|data field2";field3;field4;field5;"data|data field6"
As an alternative to FPAT you can set the awk FS to be double quotes and then swap out your semicolon delimiter for every other field:
awk -F"\"" '{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/;/, "|", $i)}} {print $0}' yourfile
Here awk is:
Splitting the record by double quote (-F"\"")
Looping through each field that it finds ({for(i=1;i<=NF;++i))
Testing the field ordinal's mod 2 if it's 0 (if(i%2==0))
If it's even then it swaps out the semicolons with pipes (gsub(/;/, "|", $i))
Prints out the transformed record ({print $0})

Last Occurrence of Character Field Separator AWK

I'm using a find command to find all files of a certain format, that command has been golden. I'm piping that output into an awk command and I want to use the last underscore as a field separator. The problem being that depending on the path the file is in, there could be one or two underscores before the fact.
find . -regex ".*prob[0-9]*_.*" | awk 'BEGIN { FS = "_.*$" } { print $1 " " $2 }'
I get what's wrong with the regular expression in my field separator, it thinks to separate on the underscore and whatever follows, is there away to specify just the single character itself. Moreover, how do I specifically use a field separator on the last occurrence of a character.
This is somewhat an extension of a question I asked earlier:
Suppress output to StdOut when piping echo
The files I get are generally like this, the wrinkle being that the directory can have an underscore as well:
/the/directory/probXXXXX_XX
where X is any integer.
A workaround I've been thinking of is separating at every underscore and then print every column... I'd rather like to get it working in the method above though.
A trick of awk that is not obvious is that $ is an operator; you can use it with a variable or even an expression, and in particular with expressions involving the predefined variable NF: $NF gets the last field, $(NF - 1) the second last field.

Problem with regular expression using grep

I've got some textfiles that hold names, phone numbers and region codes. One combination per line.
The syntax is always "Name Region_code number"
With any number of spaces between the 3 variables.
What I want to do is search for specific region codes, like 23 or 493, forexample.
The problem is that these numbers might appear in the longer numbers too, which might enable a return that shouldn't have been returned.
I was thinking of this sort of command:
grep '04' numbers.txt
But if I do that, a line that contains 04 in the number but not as region code will show as a result too... which is not correct.
I'm sure you are about to get buried in clever regular expressions, but I think in this case all you need to do is include one of the spaces on each side of your region code in the grep.
grep ' 04 ' numbers.txt
I'd do:
awk '$2 == "04"' < numbers.txt
and with grep:
grep -e '^[^ ]*[ ]*04[ ]*[^ ]*$' numbers.txt
If you want region codes alone, you should use:
grep "[[:space:]]04[[:space:]]"
this way it will only look for numbers on the middle column, while start or end of strings are considered word breaks.
You can even do:
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" FILE
}
replacing FILE with the name of your file,
and use
search_region_codes 04
or even
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" $2
}
and using
search_region_codes NUMBER FILE
Are you searching for an entire region code, or a region code that contains the subpattern?
If you want the whole region code, and there is at least one space on either side, then you can format the grep by adding a single space on either side of the specific region code. There are other ways to indicate word boundaries using regular expressions.
grep ' 04 ' numbers.txt
If there can be spaces in the name or phone number fields, than that solution might not work. Also, if you the pattern can be a sub-part of the region code, then awk is a better tool. This assumes that the 'name' field contains no spaces. The matching operator '==' requires that the pattern exactly match the field. This can be tricky when there is whitespace on either side of the field.
awk '$2 == "04" {print $0}' < numbers.txt
If the file has a delimiter, than can be set in awk using the '-F' argument to awk to set the field separator character. In this example, a comma is used as the field separator. In addition, the matching operator in this example is a '~' allowing the pattern to be any part of the region code (if that is applicable). The "/y" is a way to match work boundaries at the beginning and end of the expression.
awk -F , '$2 ~ /\y04\y/ {print $0}' < numbers.txt
In both examples, the {print $0} is optional, if you want the full line to be printed. However, if you want to do any formatting on the output, that can be done inside that block.
use word boundaries. not sure if this works in grep, but in other regex implementations i'd surround it with whitespace or word boundary patterns
'\s+04\s+' or '\b04\b'
Something like that