Print lines until the second field changes - regex

Let's say this is my command line output:
Mike US 11
John US 3
Dina US 1002
Dan US 44
Mike UK 552
Luc US 23
Jenny US 23
I want to print all lines starting from first line and stop printing once the second field changes to something other than "US" even if there are more "US" after that. So I want to the output to be:
Mike US 11
John US 3
Dina US 1002
Dan US 44
This is the code I have right now:
awk '$2 == "US"{a=1}$2 != "US"{a=0}a'
It works fine as long as there are no more "US" after the range I matched. So my current code will output like this:
Mike US 11
John US 3
Dina US 1002
Dan US 44
Luc US 23
Jenny US 23
As you may notice, it dropped the "UK" line and kept printing which is not what I'm trying to achieve here.

Here is a generic approach, it prints to second filed change, regardless of data in second field
awk '$2!=f && NR>1 {exit} 1; {f=$2}' file
Mike US 11
John US 3
Dina US 1002
Dan US 44
This just test if its US, if not exit. Maybe more correct to your question:
awk '$2!="US" {exit}1' file
Mike US 11
John US 3
Dina US 1002
Dan US 44

I'm sure there is something more elegant, but this does the job:
awk 'BEGIN { P=1 } P == 1 && $2 != "US" { P = 0 }P' filename

This might work for you (GNU sed):
sed '/US/!Q' file
If the line does not contain US quit.
For specifically the second field:
sed '/^\S\+\s\+US\b/!Q' file

Related

How do i fix regex matching few unexpected characters?

i am using a regex where as a first preference i am intending to match the character ( number or alphanumeric ) immediately succeeding the string "Lecture" else match the last character of line in absence of string "Lecture".
Curent regex
cat 1.txt | perl -ne 'print "$& \n" while /Lecture\h*\K\w+|^(?!.*Lecture).*\h\K[^.\s]+/g;/^.*?-(.*)/g' | perl -ne 'print "$& \n" while /(\d+\w*)/g'
The data to read is not very consistent. There could be spaces or hyphen around the string "Lecture" or end character and line may not end as .mp4
My current regex is working almost well , it just having the issues for the bottom 3 lines . I could have only included those lines here but i don't want the solution regex to break for the other cases. So including all possibilities below
cat 1.txt
54282068 Lecture74- AS 29 Question.mp4
174424104Lecture 74B - AS 29 Theory.mp4
Branch Accounts Lecture 105
Lecture05 - Practicals AS 28
Submissions 20.mp4
HW Section 77N
Residential status HWS Q.1 to 6 -60A
Residential status HWS Q.7 to 20 -60B
House property all HWS-60C
Salary HWS Q.11 to 13 - 60F
Salary HWS Q.1 to 5-60D
Salary HWS Q.6 to 10-60E
Salary HWS Q.14 to 20-60G
Operating Costing 351
Expected Output
74
74B
105
05
20
77N
60A
60B
60C
60F
60D
60E
60G
351
Exact Issue - For the bottom 3 lines above the last one it is printing 5,10 and 20 additionally along with the end character 60D, 60E and 60G
I believe there's a issue in the last part of my regex somewhere, needs a very small edit to fix . Hopefully someone can help me.
Please inspect following piece of code for compliance with your requirements
use strict;
use warnings;
use feature 'say';
while( <DATA> ) {
chomp;
s/\.mp4//;
say $1 if /Lecture\s*(\w+)/ or /(\d{2}[A-Z]?)\Z/;
}
__DATA__
54282068 Lecture74- AS 29 Question.mp4
174424104Lecture 74B - AS 29 Theory.mp4
Branch Accounts Lecture 105
Lecture05 - Practicals AS 28
Submissions 20.mp4
HW Section 77N
Residential status HWS Q.1 to 6 -60A
Residential status HWS Q.7 to 20 -60B
House property all HWS-60C
Salary HWS Q.11 to 13 - 60F
Salary HWS Q.1 to 5-60D
Salary HWS Q.6 to 10-60E
Salary HWS Q.14 to 20-60G
Output
74
74B
105
05
20
77N
60A
60B
60C
60F
60D
60E
60G

Substitute second occurrence of a pattern in a line using awk

I need to replace second occurrence of a pattern (that matches the last field) with another and also keep a count of all such changes done in a file.
Example: try.txt
Hi
Change apple orange guava mango banana orange
It's hot outside
Change tom greg fred harry steve fred
George is a cool guy
Change mary lucy becky karly jill karly
thank you
In all the lines that has pattern "Change", I want to replace the last word, for example "orange" in second line, with say, pear. Note that first orange should not be changed. I also want to put a suffix that shows number of changes happened in the file.
I tried following, but it was changing both the occurences (1st orange and 2nd orange, 1st fred and 2nd fred, 1st karly and 2nd karly), whereas I wanted to change only the second occurence.
awk 'BEGIN {cntr=0} {if (/Change/) {gsub($NF,"pear"); OFS=""; print $0,cntr; cntr++} else {print}}' try.txt
The output is:
Hi
Change apple pear guava mango banana pear0
It's hot outside
Change tom greg pear harry steve pear1
George is a cool guy
Change mary lucy becky pear jill pear2
thank you
Desired output is:
Hi
Change apple orange guava mango banana pear0
It's hot outside
Change tom greg fred harry steve pear1
George is a cool guy
Change mary lucy becky karly jill pear2
thank you
When gsub is replaced with sub, it's changing only first occurrence. Any help is appreciated.
This one-liner works for your input:
awk '/Change/{$NF="peal"(i++)}7' file
This line will overwrite the OFS, however, if you want to keep OFS (continuous spaces for example) untouched, you can do:
awk '/Change/{sub(/\S+$/,"peal"(i++))}7' file
I think I found a work-around:
awk 'BEGIN {cntr=0} {if (/Change/) {$NF=$NF_cntr; sub($NF,"pear"); OFS=""; print $0,cntr; OFS=" "; cntr++} else {print}}' try.txt
The output was as I desired.
But I still would like to hear from community for better ways of achieving it.
Thanks

How can I extract Twitter #handles from a text with RegEx?

I'm looking for an easy way to create lists of Twitter #handles based on SocialBakers data (copy/paste into TextMate).
I've tried using the following RegEx, which I found here on StackOverflow, but unfortunately it doesn't work the way I want it to:
^(?!.*#([\w+])).*$
While the expression above deletes all lines without #handles, I'd like the RegEx to delete everything before and after the #handle as well as lines without #handles.
Example:
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Desired result:
#katyperry
#justinbieber
#taylorswift13
Thanks in advance for any help!
Something like this:
cat file | perl -ne 'while(s/(#[a-z0-9_]+)//gi) { print $1,"\n"}'
This will also work if you have lines with multiple #handles in.
A Twitter handle regex is #\w+. So, to remove everything else, you need to match and capture the pattern and use a backreference to this capture group, and then just match any character:
(#\w+)|.
Use DOTALL mode to also match newline symbols. Replace with $1 (or \1, depending on the tool you are using).
See demo
Strait REGEX Tested in Caret:
#.*[^)]
The above will search for and any given and exclude close parenthesis.
#.*\b
The above here does the same thing in Caret text editor.
How to awk and sed this:
Get usernames as well:
$ awk '/#.*/ {print}' test
katyperry KATY PERRY (#katyperry)
justinbieber Justin Bieber (#justinbieber)
taylorswift13 Taylor Swift (#taylorswift13)
Just the Handle:
$ awk -F "(" '/#.*/ {print$2}' test | sed 's/)//g'
#katyperry
#justinbieber
#taylorswift13
A look at the test file:
$ cat test
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Bash Version:
$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Match a word just once - AWK

I was reading GNU awk manual but I didnt find a regular expression wich whom I can match a string just once.
For example from the files aha_1.txt, aha_2.txt, aha_3.txt, .... I would like to print the second column $2 from the first time ana appears in the files (aha_1.txt, aha_2.txt, aha_3.txt, ....). In addition, the same thing when pedro appears.
aha_1.txt
luis 321 487
ana 454 345
pedro 341 435
ana 941 345
aha_2.txt
pedro 201 723
gusi 837 134
ana 319 518
cindy 738 278
ana 984 265
.
.
.
.
Meanwhile I did this but it counts all the cases not just the first time
/^ana/ {print $2 }
/^pedro/ {print $2 }
Thanks for your help :-)
Just call the exit command after printing the first value(second column in the line which starts with the string ana).
$ awk '$1~/^ana$/{print $2; exit}' file
454
Original question
Only processing one file.
awk '/ana/ { if (ana++ == 0) print $2 }' aha.txt
or
awk '/ana/ && ana++ == 0 { print $2 }' aha.txt
Or, if you don't need to do anything else, you can exit after printing, as suggested by Avinash Raj in his answer.
Revised question
I have many files (aha.txt, aha_1.txt, aha_2.txt, ...) each file has ana inside and I need just to take the fist time ana appears in each file and the output has to be one file.
That's sightly different as a question. If you have GNU grep, you can use (more or less):
grep -m1 -e ana aha*.txt
That will list the whole line, not just column 2, and will list the filenames too, so it isn't a perfect match.
Using awk, you have to work a bit more:
awk 'FILENAME != old_file { ana = 0; old_file = FILENAME }
/ana/ { if (ana++ == 0) print $2 }' aha*.txt

How to print all lines matching the first field of last line

I've been trying to do this for the last two days. I read a lot of tutorials and I learned a lot of new things but so far I couldn't manage to achieve what I'm trying to do. Let's say this is the command line output:
Johnny123 US 224
Johnny123 US 145
Johnny123 US 555
Johnny123 US 344
Robert UK 4322
Robert UK 52
Lucas FR 344
Lucas FR 222
Lucas FR 8945
I want to print the lines which match 'the first field (Lucas) of last line'.
So, I want to print out:
Lucas FR 344
Lucas FR 222
Lucas FR 8945
Notes:
What I'm trying to print have a different line count each time so I can't do something like returning the last 3 lines only.
The first field doesn't have a specific pattern that I can use to print.
Here is another way using tac and awk:
tac file | awk 'NR==1{last=$1}$1==last' | tac
Lucas FR 344
Lucas FR 222
Lucas FR 8945
The last tac is only needed if the order is important.
awk 'NR==FNR{key=$1;next} $1==key' file file
or if you prefer
awk '{val[$1]=val[$1] $0 RS; key=$1} END{printf "%s", val[key]}' file
This might work for you (GNU sed):
sed -nr 'H;g;/^(\S+\s).*\n\1[^\n]*$/!{s/.*\n//;h};$p' file
Store lines with duplicate keys in the hold space. At change of key remove previous lines. At end-of-file print out what remains.