Awk 3 Spaces + 1 space or hyphen - regex

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.

You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'

Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt

If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

Related

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

Match strings with certain number of unique characters in bash

I need to delete all the strings in a file that have less than 4 unique characters in them
Input:
hello
cabby
pabba
lokka
lappa
coool
apple
Expected Output:
hello
cabby
lokka
apple
I tried to think up a regular expression to do this but I can't think how it would even be possible.
I did find a sed command that seems promising, it deletes all duplicate characters. However, I am not sure how to program sed to test if the program returns 4 characters, and then if it does, match the original string.
sed ':1;s/\(\(.\).*\)\2/\1/g;t'
Using gnu awk:
awk 'BEGIN{FS=""} {
unq=0; delete seen; for (i=1; i<=NF; i++) if (!seen[$i]++) unq++} unq > 3' file
hello
cabby
lokka
apple
FS="" breaks each character into a separate field in awk.
You tried sed ':1;s/\(\(.\).*\)\2/\1/g;t', please replace t by t1.
Before your command, copy the current line in the Hold space.
After your command, replace lines with at least 4 characters left with the original line.
Now make sure you only print lines with at least four characters.
echo 'hello
cabby
pabba
lokka
lappa
coool
apple' | sed -nE 'h;:1;s/(.)(.*)\1/\1\2/g;t1;/.{4}/x;/.{4}/p'

Parse a file without common delimiter in shell

I would like to ask you for help with parsing a file in shell.
Here is my data:
ID:1 g-t="Demo one" rfid="af7e 25" t-link="http://demo.site.com/api2",User af73 25 http://example.com/useraf73
ID:2 g-t="Demo one" rfid="77 63" t-link="http://demo.site.com/api",User 77 http://example.com/user77
There is no common delimiter, basically I need these fields:
ID=1 | g-t="Demo one" | rfid="af7e 25" | t-link="http://demo.site.com/api2" | User af73 25 | http://example.com/useraf73
Here is where I am stuck:
awk '{match($0,"g-t=([^\" ]+)",a)}END{print a[1]}'
I am trying to match double quote with space but I have no idea why it is not printing the result. All the chars work fine except double quotes.
What I am doing wrong? Awk is not a must here, I am open to suggestions.
Thanks.
It has been quite a while since I regularly used awk but if I remember correctly match() takes only 2 args and END{} happens only once, not for every line like I think you want. Something like:
awk '{match($0,/g-t="([^\"]+")/); print substr($0, RSTART, RLENGTH)}' dataFile
may be closer to what you had in mind?
A brute force Perl one-liner could look something like this:
perl -lne 'if (m/ID:(\S+) g-t="([^"]+)" rfid="([^"]+)" t-link="([^"]+)",User (.*) (http:.*)/){print "$1|$2|$3|$4|$5|$6"}' dataFile
and demonstrates getting all of the fields data separated by OR bars. You can move the () groups around to get more or less of the text you want for each resultant $1, $2 etc... See perldoc perl for more information.

Shell script, split string into everything before and after the last whitespace character

I'm having some issues with separating a string in a shell script. I've been trying similar bits of code I've found online for RegEx, perl, awk, grep etc... but I can't seem to get the required result.
Basically I have a number of strings. Most are in the following format:
long string, space, number e.g.
Something!Something_Something_#Something_Something 10
However a small number aren't all the one string (they should be!) but they have spaces instead of underscores, e.g.
Something!Something_Something_#Something Something 10
or
Something!Something - Something_#Something Something 10
Each string is then formatted as follows:
... |awk '{printf "%-100s %10d\n", $1, $2}' > file.out
which prints the correct result for the strings which contain no spaces
Something!Something_Something_#Something_Something 10
However in the case of the first example it only prints the following due to the space delimiter:
Something!Something_Something_#Something 10
So basically I need a way to pull out everything before the last " " space and assign it to $1 in the awk printf statement. Any help would be greatly appreciated!!!
It's a Solaris 5.10 server by the way.
Hackjob but this will work
awk '{x=$NF;NF--;printf "%-100s %10d\n", $0, x}'

Problem with regular expression using grep

I've got some textfiles that hold names, phone numbers and region codes. One combination per line.
The syntax is always "Name Region_code number"
With any number of spaces between the 3 variables.
What I want to do is search for specific region codes, like 23 or 493, forexample.
The problem is that these numbers might appear in the longer numbers too, which might enable a return that shouldn't have been returned.
I was thinking of this sort of command:
grep '04' numbers.txt
But if I do that, a line that contains 04 in the number but not as region code will show as a result too... which is not correct.
I'm sure you are about to get buried in clever regular expressions, but I think in this case all you need to do is include one of the spaces on each side of your region code in the grep.
grep ' 04 ' numbers.txt
I'd do:
awk '$2 == "04"' < numbers.txt
and with grep:
grep -e '^[^ ]*[ ]*04[ ]*[^ ]*$' numbers.txt
If you want region codes alone, you should use:
grep "[[:space:]]04[[:space:]]"
this way it will only look for numbers on the middle column, while start or end of strings are considered word breaks.
You can even do:
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" FILE
}
replacing FILE with the name of your file,
and use
search_region_codes 04
or even
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" $2
}
and using
search_region_codes NUMBER FILE
Are you searching for an entire region code, or a region code that contains the subpattern?
If you want the whole region code, and there is at least one space on either side, then you can format the grep by adding a single space on either side of the specific region code. There are other ways to indicate word boundaries using regular expressions.
grep ' 04 ' numbers.txt
If there can be spaces in the name or phone number fields, than that solution might not work. Also, if you the pattern can be a sub-part of the region code, then awk is a better tool. This assumes that the 'name' field contains no spaces. The matching operator '==' requires that the pattern exactly match the field. This can be tricky when there is whitespace on either side of the field.
awk '$2 == "04" {print $0}' < numbers.txt
If the file has a delimiter, than can be set in awk using the '-F' argument to awk to set the field separator character. In this example, a comma is used as the field separator. In addition, the matching operator in this example is a '~' allowing the pattern to be any part of the region code (if that is applicable). The "/y" is a way to match work boundaries at the beginning and end of the expression.
awk -F , '$2 ~ /\y04\y/ {print $0}' < numbers.txt
In both examples, the {print $0} is optional, if you want the full line to be printed. However, if you want to do any formatting on the output, that can be done inside that block.
use word boundaries. not sure if this works in grep, but in other regex implementations i'd surround it with whitespace or word boundary patterns
'\s+04\s+' or '\b04\b'
Something like that