Gawk regexp to select sequence - regex

sorry for the nth simple question on regexp but I'm not able to get what I need without a what seems to me a too complicated solution. I'm parsing a file containing sequence of only 3 letters A,E,D as in
AADDEEDDA
EEEEEEEE
AEEEDEEA
AEEEDDAAA
and I'd like to identify only those that start with E and ends in D with only one change in the sequence as for example in
EDDDDDDDD
EEEDDDDDD
EEEEEEEED
I'm fighting with the proper regexp to do that. Here my last attempt
echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E[(ED){1,1}]*D$/ && $2 !~ /^E[(ED){2,}]*D$/) print $0}'
which does not work. Any help?
Thanks in advance.

If i understand correctly your request a simple
awk '/^E+D+$/' file.input
will do the trick.
UPDATE: if the line format contains pre/post numbers (with post optional) as showed later in the example, this can be a possible pure regex adaptation (alternative to the use of field switch-F,):
awk '/^[0-9]+,E+D+(,[0-9]+)?$/' input.test

First of all, you need the regular expression:
^E+[^ED]*D+$
This matches one or more Es at the beginning, zero or more characters that are neither E nor D in the middle, and one or more Ds at the end.
Then your AWK program will look like
$2 ~ /^E+[^ED]*D+$/
$2 refers to the 2nd field of the current record, ~ is the regex matching operator, and /s delimit a regular expression. Together, these components form what is known in AWK jargon as a "pattern", which amounts to a boolean filter for input records. Note that there is no "action" (a series of statements in {s) specified here. That's because when no action is specified, AWK assumes that the action should be { print $0 }, which prints the entire line.

If I understand you correct you want to match patterns that starts with at least one E and then continues with at least one D until the end.
echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E+D+$) print $0}'

Related

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

Extract multiple independent regex matches per line

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:
1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"
The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:
MOUSE_10 XC:Z:TGGTCGGCGCGT RG:Z:A XM:Z:GAGTCCGT ZP:i:33
MOUSE_10 XC:Z:GAAGCCGCTTCC NM:i:0 XM:Z:ACCGACGG AS:i:16
MOUSE_10 ZP:i:36 XC:Z:TCCCCGGGTACA NM:i:0 XM:Z:GGGACGGG ZP:i:28
MOUSE_10 XC:Z:CAAATTTGGAAA RG:Z:A NM:i:1 XM:Z:GCAGATAG
In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:
use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.
The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.
Looking forward to your solutions!
Thanks,
Felix
With sed you can capture non-space characters after XC:Z: and XM:Z:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/p;' file
You can add a second s command for reversed values:
sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/;s/.*XM:Z:\([^[:blank:]]*\).*XC:Z:\([^[:blank:]]*\).*/\1, \2/;p;' file
Following awk solution may help you in same.
awk '
/XC:Z:/{
match($0,/XC:[^ ]*/);
num=split(substr($0,RSTART,RLENGTH),a,":");
match($0,/XM:[^ ]*/);
num1=split(substr($0,RSTART,RLENGTH),b,":");
print a[num],b[num1]
}' Input_file
Output will be as follows.
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
If we don't know the order in which XC and XM appear
You can try this sed
sed -E 'h;s/(XC:Z:.*XM:Z:)//;tA;x;s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/;b;:A;x;s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/' infile
explanation :
sed -E '
h
# keep the line in the hold space
s/(XC:Z:.*XM:Z:)//;x;tA
# if XCZ come before XMZ, go to A but before everything restore the pattern space with x
s/(.*XM:Z:)([^[:blank:]]*)(.*XC:Z:)([^[:blank:]]*)(.*)/\4,\2/
# XMZ come before XCZ, get the interresting parts and reorder it
b
# It is all for this line
:A
s/(.*XC:Z:)([^[:blank:]]*)(.*XM:Z:)([^[:blank:]]*)(.*)/\2,\4/
# XCZ come before XMZ, get the interresting parts
' infile
another awk
$ awk '{c=p=""; # need to reset c and p before each line
for(i=1;i<=NF;i++) # for all fields in the line
if($i~/^XC:Z:/) c=substr($i,6) # check pattern from the start of field
else if($i~/^XM:Z:/) p=substr($i,6) # if didn't match check other other pattern
if(c && p) print c,p}' file # if both matched print
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last matches if there are multiple instances on the same line. Here is another one with slightly different characteristic.
$ awk 'function s(x) {return ($i~x)?substr($i,6):""}
{c=p="";
for(i=1;i<=NF;i++) {
c=c?c:s("^XC:Z:"); p=p?p:s("^XM:Z:");
if(c && p)
{print c,p; next}}}' file
TGGTCGGCGCGT GAGTCCGT
GAAGCCGCTTCC ACCGACGG
TCCCCGGGTACA GGGACGGG
CAAATTTGGAAA GCAGATAG
this will print the last of the repeated match before the first match of the other. It they appear in pairs, will print the first pair.
Using POSIX awk, you can only use the string-function match(s,ere) as defined by IEEE Std 1003.1-2008 :
match(s, ere)
Return the position, in characters, numbering from 1, in
string s where the extended regular expression ere occurs, or zero if
it does not occur at all. RSTART shall be set to the starting position
(which is the same as the returned value), zero if no match is found;
RLENGTH shall be set to the length of the matched string, -1 if no
match is found.
The patterns you want to match are XM:Z:[^[:blank:]]* and XC:Z:[^[:blank:]]*. This however assumes you do not have any string which contains something like PXM:Z: (i.e. an extra non-blank character advancing the searched string). When the pattern is found in the line $0, then you only need to extract the important parts, which start 5 characters later.
The following code does the above:
awk '{match($0,/XM:Z:[^[:blank:]]*/);xm=substr($0,RSTART+5,RLENGTH-5)}
{match($0,/XC:Z:[^[:blank:]]*/);xc=substr($0,RSTART+5,RLENGTH-5)}
{print xc","xm}' <file>
As you can see, the first line extracts XM, the second XC and the third prints the outcome with comma-separator ",".
Remark - The following assumptions are made here :
each line contains both an xm and xc string
no strings of the type [^[:blank:]]X[CM]:Z:[^[:blank:]]* exist
If you are willing to use gawk, then you could use the patsplit function for string operations (Ref. here). You can do this with a single regex /X[CM]:Z:[^[:blank:]]*/. This gives you directly the requested strings in a single call which include the XM:Z: or XM:C: part. Afterwards you can easily sort them and extract the last parts.
The following lines do exactly the same in gawk
gawk '{patsplit($0,a,/X[MC]:Z:[^[:blank:]]*/) }
{xc=(a[1]~/^XC/)?a[1]:a[2]; xm=(a[1]~/^XC/)?a[2]:a[1]}
{print substr(xc,5)","substr(xm,5)' <file>
Nonetheless, I believe the awk solution is cleaner from a symmetric point of view.

how to grep exact string match across 2 files

I've UTF-8 plain text lists of usernames, 1 per line, in list1.txt and list2.txt. Note, in case pertinent, that usernames may contain regex characters e.g. ! ^ . ( and such as well as spaces.
I want to get and save to matches.txt a list of all unique values occurring in both lists. I've little command line expertise but this almost gets me there:
grep -Ff list1.txt list2.txt > matches.txt
...but that is treating "jdoe" and "jdoe III" as a match, returning "jdoe III" as the matched value. This is incorrect for the task. I need the per-line pattern match to be the whole line, i.e. from ^ to $. I've tried adding the -x flag but that gets no matches at all (edit: see comment to accepted answer - I got the flag order wrong).
I'm on OS X 10.9.5 and I don't have to use grep - another command line (tool) solving the problem will do.
All you need to do is add the -x flag to your grep query:
grep -Fxf list1.txt list2.txt > matches.txt
The -x flag will restrict matches to full line matches (each PATTERN becomes ^PATTERN$). I'm not sure why your attempt at -x failed. Maybe you put it after the -f, which must be immediately followed by the first file?
This awk will be handy than grep here:
awk 'FNR==NR{a[$0]; next} $0 in a' list1.txt list2.txt > matches.txt
$0 is the line, FNR is the current line number of the current file, NR is the overall line number (they are only the same when you are on the first file). a[$0] is a associative array (hash) whose key is the line. next will ensure that further clauses (the $0 in a) will not run if the current clause (the fact that this is the first file) did. $0 in a will be true when the current line has a value in the array a, thus only lines present in both will be displayed. The order will be their order of occurence in the second file.
A very simple and straightforward way to do it that doesn't require one to do all sorts of crazy things with grep is as follows
cat list1.txt list2.txt|grep match > matches.txt
Not only that, but it's also easier to remember, (especially if you regularly use cat).
grep -Fwf file1 file2 would match word to word !!

How to fetch the matched items using awk and regexp?

I am trying to parse "/boot/grub/grubenv" but really not very good at regexp.
Suppose the content of /boot/grub/grubenv is:
saved_entry=1
I want to output the number "1", like below. I am currently using "awk", but open to other tools.
$ awk '/^(saved_entry=)([0-9]+)/ {print $2}' /boot/grub/grubenv
But obviously not working, thanks for the help.
Specify a field separator with -F option:
awk -F= '/^saved_entry=/ {print $2}' /boot/grub/grubenv
$1, $2, .. here represents fields (separated by =), not a backreferences to captured groups.
If you want to match things probably best to use match!
This will work even if there are more fields after and does not need you to change the field separator(incase you are doing any other stuff with the data).
The only drawback of this method is that it will only match the left-most match of the record, so if the data appears twice in the same record(line) then it will only match the first one it finds.
awk 'match($0,/^(saved_entry=)([0-9]+)/,a){print a[2]}' file
Example
input
saved_entry=1 blah blah more stuff
output
1
Explanation
Matches the regex in $0(the record) and then stores anything in brackets as separate array elements.
From the example, there would be these outputs
a[0] is saved_entry=1
a[1] is saved_entry=
a[2] is 1

Problem with regular expression using grep

I've got some textfiles that hold names, phone numbers and region codes. One combination per line.
The syntax is always "Name Region_code number"
With any number of spaces between the 3 variables.
What I want to do is search for specific region codes, like 23 or 493, forexample.
The problem is that these numbers might appear in the longer numbers too, which might enable a return that shouldn't have been returned.
I was thinking of this sort of command:
grep '04' numbers.txt
But if I do that, a line that contains 04 in the number but not as region code will show as a result too... which is not correct.
I'm sure you are about to get buried in clever regular expressions, but I think in this case all you need to do is include one of the spaces on each side of your region code in the grep.
grep ' 04 ' numbers.txt
I'd do:
awk '$2 == "04"' < numbers.txt
and with grep:
grep -e '^[^ ]*[ ]*04[ ]*[^ ]*$' numbers.txt
If you want region codes alone, you should use:
grep "[[:space:]]04[[:space:]]"
this way it will only look for numbers on the middle column, while start or end of strings are considered word breaks.
You can even do:
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" FILE
}
replacing FILE with the name of your file,
and use
search_region_codes 04
or even
function search_region_codes {
grep "[[:space:]]${1}[[:space:]]" $2
}
and using
search_region_codes NUMBER FILE
Are you searching for an entire region code, or a region code that contains the subpattern?
If you want the whole region code, and there is at least one space on either side, then you can format the grep by adding a single space on either side of the specific region code. There are other ways to indicate word boundaries using regular expressions.
grep ' 04 ' numbers.txt
If there can be spaces in the name or phone number fields, than that solution might not work. Also, if you the pattern can be a sub-part of the region code, then awk is a better tool. This assumes that the 'name' field contains no spaces. The matching operator '==' requires that the pattern exactly match the field. This can be tricky when there is whitespace on either side of the field.
awk '$2 == "04" {print $0}' < numbers.txt
If the file has a delimiter, than can be set in awk using the '-F' argument to awk to set the field separator character. In this example, a comma is used as the field separator. In addition, the matching operator in this example is a '~' allowing the pattern to be any part of the region code (if that is applicable). The "/y" is a way to match work boundaries at the beginning and end of the expression.
awk -F , '$2 ~ /\y04\y/ {print $0}' < numbers.txt
In both examples, the {print $0} is optional, if you want the full line to be printed. However, if you want to do any formatting on the output, that can be done inside that block.
use word boundaries. not sure if this works in grep, but in other regex implementations i'd surround it with whitespace or word boundary patterns
'\s+04\s+' or '\b04\b'
Something like that