Match last name using awk

Match last name using awk - regex

Say there's a file like this
1 | John Smith | 70000
2 | Al McSmith | 60000
If I use
awk -F"|" '$2~/Smith/' file
both rows are matched.
Is there a way to only match John Smith? (USING AWK ONLY)
EDIT: I'm trying to match the people that have Smith as their last name, without matching McSmith, or O'Smith, etc.

this may work for you:
awk -F'|' '$2~/ Smith\s*$/' file
it won't match:
fooSmith
Smithfoo
foo Smith is middlename

Just stick a Space before Smith:
awk -F'|' '$2~/ Smith/' testfile
If there is a name like John Smitherton in there, then stick a space after as well (since it looks like you have <space><delim><space> between each field). Otherwise you can get a little fancier with the regex, but your space padding is pretty useful here.

Another solution using grep
grep -E "[^|]*\|[^|]*\<Smith\>"
explanation
[^|] match any character except |
\| match with |
\< \> start and end of word

I've made test. I created file: test.in with your content:
1 | John Smith | 70000
2 | Al McSmith | 60000
Then tried another expression:
awk -F'|' '{print $2~/\sSmith\s/}' test.in
It prints:
1
0
So, 1 for Smith, 0 for McSmith.
[UPD] \s - is an additional character, specific for gawk

Related

How to grep/perl/awk overlapping regex

Trying to pipe a string into a grep/perl regex to pull out overlapping matches. Currently, the results only appear to pull out sequential matches without any "lookback":
Attempt using egrep (both on GNU and BSD):
$ echo "bob mary mike bill kim jim john" | egrep -io "[a-z]+ [a-z]+"
bob mary
mike bill
kim jim
Attempt using perl style grep (-P):
$ echo "bob mary mike bill kim jim john" | grep -oP "()[a-z]+ [a-z]+"
bob mary
mike bill
kim jim
Attempt using awk showing only the first match:
$ echo "bob mary mike bill kim jim john" | awk 'match($0, /[a-z]+ [a-z]+/) {print substr($0, RSTART, RLENGTH)}'
bob mary
The overlapping results I'd like to see from a simple working bash pipe command are:
bob mary
mary mike
mike bill
bill kim
kim jim
jim john
Any ideas?

Lookahead is your friend here
echo "bob mary mike bill kim jim john" |
perl -wnE'say "$1 $2" while /(\w+)\s+(?=(\w+))/g'
The point is that lookahead, as a "zero-width assertion," doesn't consume anything -- while it still allows us to capture a pattern in it.
So as the regex engine matches a word and spaces ((\w+)\s+), gobbling them up, it then stops there and "looks ahead," merely to "assert" that the sought pattern is there; it doesn't move from its spot between the last space and the next \w, doesn't "consume" that next word, as they say.
It is nice though that we can also capture that pattern that is "seen," even tough it's not consumed! So we get our $1 and $2, two words.
Then, because of /g modifier, the engine moves on, to find another word+spaces, with yet another word following. That next word is the one our lookahead spotted -- so now that one is consumed, and yet next one "looked" for (and captured). Etc.
See Lookahead and lookbehind assertions in perlretut

Use the Perl one-liners below, which avoid the lookahead (which can still be your friend):
For whitespace-delimited words:
echo "bob mary mike bill kim jim john" | perl -lane 'print "$F[$_] $F[$_+1]" for 0..($#F-1);'
For words defined as \w+ in Perl, delimited by the non-word characters \W+:
echo "bob.mary,mike'bill kim jim john" | perl -F'/\W+/' -lane 'print "$F[$_] $F[$_+1]" for 0..($#F-1);'
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\W+/' : Split into #F on \W+ (one or more non-word characters), rather than on whitespace.
$#F : the last index of the array #F, into which the input line is split.
0..($#F-1) : the range of indexes (numbers), from the first (0) to the penultimate ($#F-1) index of the array #F.
$F[$_] and $F[$_+1]: two consecutive elements of the array #F, with indexes $_ and $_+1, respectively.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

You can also use awk
awk '{for(i=1;i<NF;i++) print $i,$(i+1)}' <<< 'bob mary mike bill kim jim john'
See the online demo. This solution iterates over all whitespace-separated fields and prints current field ($i) + field separator (a space here) + the subsequent field value ($(i+1)).
Or, another perl solution that uses a very common technique to capture the overlapping pattern inside a positive lookahead:
perl -lane 'while (/(?=\b(\p{L}+\s+\p{L}+))/g) {print $1}' <<< 'bob mary mike bill kim jim john'
See the online demo. Details:
(?= - start of a positive lookahead
\b - a word boundary
(\p{L}+\s+\p{L}+) - capturing group 1: one or more letters, one or more whitespaces, one or more letters
) - end of the lookahead.
Here, only Group 1 values are printed ({print $1}).
Performance consideration
As for Perl solutions here, mine turns out the slowest, and Timur's the fastest, however, awk solution turns out to be faster than any Perl solutions. Results:
# ./wiktor_awk.sh
real 0m17.069s
user 0m12.264s
sys 0m5.314s
# ./timur_perl.sh
real 0m18.201s
user 0m15.612s
sys 0m6.139s
# ./zdim.sh
real 0m23.559s
user 0m19.883s
sys 0m7.359s
# ./wiktor_perl.sh
real 2m12.528s
user 1m52.857s
sys 0m20.201s
Note I created *.sh files for each solution like
#!/bin/bash
N=10000
time(
for i in $(seq 1 $N); do
<SOLUTION_HERE> &>/dev/null;
done)
and ran for f in *.sh; do chmod +x "$f"; done (borrowed from here).

Replace the first occurrence with sed

As the example below, I want to keep only the word before the first
'John'.
However, the pattern I applied seems to replace John from the end to the head. So I need to call sed twice.
How could I find the correct way?
PATTERN="I am John, you are John also"
OUTPUT=$( echo "$PATTERN" | sed -r "s/(^.*)([ \t]*John[ ,\t]*)(.*$)/\1/" )
echo "$OUTPUT"
OUTPUT=$( echo "$OUTPUT" | sed -r "s/(^.*)([ \t]*John[ ,\t]*)(.*$)/\1/" )
echo "$OUTPUT"
My expectation is only call sed one time. Since if "John" appears several times it will be a trouble.
By the procedure above, it will generate output as:
Firstly it matches & trims the word after the final John; then the first John.
I am John, you are
I am
I want to execute one time and get
I am

Following sed may help you on same.
echo "I am John, you are John also" | sed 's/ John.*//'
Or with variables.
pattern="I am John, you are John also"
output=$(echo "$pattern" | sed 's/John.*//')

Another way of doing it is to use the grep command in Perl mode:
echo "I am John, you are John also" | grep -oP '^(?:(?!John).)*';
I am
#there will be a space at the end
echo "I am John, you are John also" | grep -oP '^(?:(?!John).)*(?=\s)';
I am
#there is no space at the end
Regex explanations:
^(?:(?!John).)*
This will accept all characters from the beginning of the lines until it reaches the first John.
Regex demo

Awk solution:
s="I am John, you are John also and there is another John there"
awk '{ sub(/[[:space:]]+John.*/, "") }1' <<<"$s"
The output:
I am

Replacing newline after a pattern in unix

I want to replace a newline with space after a pattern. For example my text is:
1.
good movie
(2006)
This is a world class movie for music.
Dir:
abc
With:
lan
,
cer
,
cro
Comedy
|
Drama
|
Family
|
Musical
|
Romance
120 mins.
53,097
I want above text to become something like this
1. good movie (2006) This is a wold class movie fo music. Dir: abc With: lan, cer, cro comedy | Drama | Family | Musical | Romance 120 mins

After the question update, the requirements for the solution changed:
cat test.txt | tr '\n' ' ' | perl -ne 's/(?<!\|) ([A-Z])/\n\1/g; print' | sed 's/ ,/,/g' | sed 's/ \([0-9]\+\)/\n\1/g'; echo
output:
1. good movie (2006)
This is a world class movie for music.
Dir: abc
With: lan, cer, cro
Comedy | Drama | Family | Musical | Romance
120 mins.
Explanation:
First I replace all newline characters using tr.
Second I replace every capital letter by a preceding newline and
itself unless it is preceeded by a pipe "| "symbol.
The third one corrects the comma spacings.
The last moves the duration declaration to a new line
The echo at the very end is to append a 'newline' to the output.
Deprecated:
Building on kpie's comment, I suggest you the following solution:
cat test.txt | sed ':a;N;$!ba;s/\n//g' | sed 's/\([A-Z]\)/\n\1/g'
I pasted your input into test.txt.
The first sed replacement is explained here: https://stackoverflow.com/a/1252191/1863086
The second one replaces every captial letter by a preceding newline and itself.
EDIT:
Another possibility using tr:
cat test.txt | tr -d '\n' | sed 's/\([A-Z]\)/\n\1/g'; echo

awk matches regex characters it shouldn't

My awk program does some odd character matching. Could you please explain what's going on or point me to relevant documentation.
Input file
| 29900 | St. James | ...
| 33010 | Boole / Kirk | ...
awk
awk '/\| ([0-9]{5}) \| ([^\|]*)/{print $2 $4}' input-file.txt
Result
29900St.
33010Boole
Why is the first capturing group $1 the leading |? Usually $0 is the entire match and $1 is the first group.
Why does ([^\|]*) stop at . and / instead of reading on? I basically tell it "all characters that are not |" after all.

By default, awk separates columns by whitespace, so for the record
| 29900 | St. James | ...
we have $1="|", $2="29900", $3="|", $4="St.", $5="James", $6="|" and $7="..."
Additionally, unlike Perl, awk does not store the contents of capturing parentheses anywhere (gawk does though)
Seeing as you want to use pipes as separators, I'd suggest:
awk -F '[[:blank:]]*\\|[[:blank:]]*' -v OFS=, '$2 ~ /[0-9]{5}/ {print $2,$3}'
29900,St. James
33010,Boole / Kirk
If you're confused about seeing $2 and $3 in there instead of $1 and $2, consider that a field separator, by definition, separates two fields and must have a field before it and after it. The first field separator shows up at the beginning of each line, therefore there must be a field consisting of an empty string before it: $1 will be the empty string.

awk doesn't provide a way to access capture groups, it uses $<number> to access fields of the input file. It looks like you could do:
awk -F' *\| *' '{print $2 $3;}' input-file.txt

Using regex to extract a substring while excluding a certain phrase

Say for the string:
test.1234.mp4
I would like to extract the numbers
1234
without extracting the 4 in mp4
What would the regex be for this?
The numbers aren't always in the second position and can be in different positions and might not always be four digits. I would like to extract the number without extracting the 4 in mp4 essentially.
More examples:
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
Essentially only the numbers would be extracted. Hence, for the last example, 666 from e666 would not be extracte and only 123.
To extract I have been using
echo "example.123.mp4" | grep -o "REGEX"
Edit: test456 was meant to be test.456

The accepted answer will fail on "test.e666.123.mp4" (print 666).
This should work
$ cat | perl -ne '/\.(\d+)\./; print "$1\n"'
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
1234
456
111
123
Note that this will only print the first group of numbers, if we have test.123.456.mp4 only 123 will be printed.
The idea is to match a dot followed by numbers which we are interested in (parentheses to save the match), followed by another dot. This means that it will fail on 123.mp4.
To fix this you could have:
$ cat | perl -ne '/(^|\.)(\d+)\./; print "$2\n"'
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
781.test.mp4
1234
456
111
123
781
First match is either beginning of line (^) or a dot, followed by numbers and a dot. We use $2 here since $1 is either beginning of a line or a dot.

cut can make it:
$ echo "test.1234.mp4" | cut -d. -f2
1234
where
cut -d'.' -f2
delimiter 2nd field
If you provide more examples we can improve the output. With the current code you would extract any something in blablabla.something.blablabla.
Update: from your question update we can do this:
grep -o '\.[0-9]*\.' | sed 's/\.//g'
test:
$ echo "test.abc.1234.mp4
test456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4" | grep -o '\.[0-9]*\.' | sed 's/\.//g'
1234
111
123

grep -Po "(?<=\.)\d+(?=\.)"

echo "test.1234.mp4" | perl -lpe 's/[^.\d]+\d*//g;s/\D*(\d+).*/$1/'
or:
echo "1321.test.mp4" | perl -lpe 's/.*(?:^|\.)(\d+)\..*/$1/'
p is to print by default so that we don't need explicit print.
e says we have an expression, not a script file
l puts the newline
These will also work if you have a number at the first part of the name.

perl -F'\.' -lane 'print "$F[scalar(#F)-2]" if(/\d+\.mp4$/)' your_file
tested:
> perl -F'\.' -lane 'print "$F[scalar(#F)-2]" if(/\d+\.mp4$/)' temp
1234
111
123

$ cat file
test.abc.1234.mp4
test.456.abc.mp4
test.aaa.bbb.c.111.mp4
test.e666.123.mp4
$ sed 's/.*\.\([0-9][0-9]*\)\..*/\1/' file
1234
456
111
123

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Match last name using awk - regex

this may work for you: awk -F'|' '$2~/ Smith\s*$/' file it won't match: fooSmith Smithfoo foo Smith is middlename

Another solution using grep grep -E "[^|]\|[^|]\<Smith\>" explanation [^|] match any character except | \| match with | \< \> start and end of word

I've made test. I created file: test.in with your content: 1 | John Smith | 70000 2 | Al McSmith | 60000 Then tried another expression: awk -F'|' '{print $2~/\sSmith\s/}' test.in It prints: 1 0 So, 1 for Smith, 0 for McSmith. [UPD] \s - is an additional character, specific for gawk

Related

How to grep/perl/awk overlapping regex

Replace the first occurrence with sed

Replacing newline after a pattern in unix

awk matches regex characters it shouldn't

Using regex to extract a substring while excluding a certain phrase

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Match last name using awk - regex

this may work for you: awk -F'|' '$2~/ Smith\s*$/' file it won't match: fooSmith Smithfoo foo Smith is middlename

Another solution using grep grep -E "[^|]*\|[^|]*\<Smith\>" explanation [^|] match any character except | \| match with | \< \> start and end of word

I've made test. I created file: test.in with your content: 1 | John Smith | 70000 2 | Al McSmith | 60000 Then tried another expression: awk -F'|' '{print $2~/\sSmith\s/}' test.in It prints: 1 0 So, 1 for Smith, 0 for McSmith. [UPD] \s - is an additional character, specific for gawk

Related

How to grep/perl/awk overlapping regex

Replace the first occurrence with sed

Replacing newline after a pattern in unix

awk matches regex characters it shouldn't

Using regex to extract a substring while excluding a certain phrase

Categories

Resources

Another solution using grep grep -E "[^|]\|[^|]\<Smith\>" explanation [^|] match any character except | \| match with | \< \> start and end of word