Bash/sed: delete everything from text file except match(es) - regex

I have a text file which I need to extract a match from in a bash script. There might be more than one match and everything else is supposed to be discarded.
Sample snippet of input.txt file content:
PART TWO OF TWO PARTS-
E RESNO 56/20 56/30 54/40 52/50 TUDEP
EAST LVLS NIL
WEST LVLS 310 320 330 340 350 360 370 380 390
EUR RTS WEST NIL
NAR NIL-
REMARKS.
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
TMI NUMBER AS PART OF THE OCEANIC CLEARANCE READ BACK.
2.ADS-C AND CPDLC MANDATED OTS ARE AS FOLLOWS
TRACK A 350 360 370 380 390
TRACK B 350 360 370 380 390
I try to match for 142 from the line
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
The match is always a number (one to three digits, may have leading zeroes) and always preceded by TMI IS.
My experiments so far led to nothing: I tried .*TMI IS ([0-9]+).* with the following sed command in my bash script
sed -n 's/.*TMI IS \([0-9]+\).*/\1/g' input.txt > output.txt
but only got an empty output.txt.
My script runs in GNU Bash-4.2. Where do I make my mistake? I ran out of ideas so your input is highly appreciated!
Thanks,
Chris

Two moments about your sed approach to make it work:
+ quantifier should be escaped in sed basic regular expressions
to print matched pattern use p subcommand:
sed -n 's/.*TMI IS \([0-9]\+\).*/\1/gp' input.txt
142
To get only the first match for your current format use:
sed -n 's/^\S\+TMI IS \([0-9]\+\).*/\1/gp' input.txt

With GNU grep:
$ grep -oP 'TMI IS \K([0-9]*)' input.txt
142

You could also do this using perl as an alternative to the above:
$ perl -nle 'print $1 if /TMI IS (\d+)/;' < input.txt
142

Related

Using grep, how to match beginning of line with pattern from stdin

I have a one-liner that prints out a series a numbers:
124
132
186
I am then piping this output into grep to match these numbers to the beginning of lines in another file but sometimes the second number in the line matches one of the patterns and I get an incorrect match like so:
$ get_id_command | grep -f - users.list
124 => 3456, Charles Charmichael, ccharmichael
132 => 2498, Sarah Walker, swalker
186 => 8934, John Casey, jcasey
240 => 1245, Morgan Grimes, mgrimes
What options do I need for grep to only match patterns at the beginning of the line? I would really like to keep this as a one-linter.
Prepend a circumflex to each line of your file and it will work. Circumflex does indicate the line start within the pattern. So modify your users.list as described, e.g.
sed -Ei 's|(.*)|^\1|' users.list
After that you should get the desired result by your command
$ get_id_command | grep -f - users.list

How can I extract Twitter #handles from a text with RegEx?

I'm looking for an easy way to create lists of Twitter #handles based on SocialBakers data (copy/paste into TextMate).
I've tried using the following RegEx, which I found here on StackOverflow, but unfortunately it doesn't work the way I want it to:
^(?!.*#([\w+])).*$
While the expression above deletes all lines without #handles, I'd like the RegEx to delete everything before and after the #handle as well as lines without #handles.
Example:
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Desired result:
#katyperry
#justinbieber
#taylorswift13
Thanks in advance for any help!
Something like this:
cat file | perl -ne 'while(s/(#[a-z0-9_]+)//gi) { print $1,"\n"}'
This will also work if you have lines with multiple #handles in.
A Twitter handle regex is #\w+. So, to remove everything else, you need to match and capture the pattern and use a backreference to this capture group, and then just match any character:
(#\w+)|.
Use DOTALL mode to also match newline symbols. Replace with $1 (or \1, depending on the tool you are using).
See demo
Strait REGEX Tested in Caret:
#.*[^)]
The above will search for and any given and exclude close parenthesis.
#.*\b
The above here does the same thing in Caret text editor.
How to awk and sed this:
Get usernames as well:
$ awk '/#.*/ {print}' test
katyperry KATY PERRY (#katyperry)
justinbieber Justin Bieber (#justinbieber)
taylorswift13 Taylor Swift (#taylorswift13)
Just the Handle:
$ awk -F "(" '/#.*/ {print$2}' test | sed 's/)//g'
#katyperry
#justinbieber
#taylorswift13
A look at the test file:
$ cat test
1
katyperry KATY PERRY (#katyperry)
Followings 158
Followers 82 085 596
Rating
5
Worst012345678910Best
2
justinbieber Justin Bieber (#justinbieber)
254 399
74 748 878
2
Worst012345678910Best
3
taylorswift13 Taylor Swift (#taylorswift13)
245
70 529 992
Bash Version:
$ bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14)
Copyright (C) 2007 Free Software Foundation, Inc.

Sed command garbled with very easy mutiline regex in bash

I'm again garbled with sed command, because most probably i have very old version of sed but according to my limitations i couldn't change the version of 'sed' (!)
My question is this i wrote such an easy regex that fits with my string file such as:
/[^,]*$/mg
My string file is this :
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
23:53:22,539
23:53:23,109
23:53:23,110
23:53:23,115
23:53:23,117
23:53:23,118
23:53:23,119
23:53:23,690
23:53:23,721
23:53:23,722
23:53:24,275
23:53:24,276
23:53:24,313
23:53:24,316
23:53:24,317
23:53:24,318
23:53:24,854
23:53:24,888
23:53:24,889
23:53:24,890
23:53:24,891
23:53:50,676
23:53:50,677
23:53:50,711
23:53:50,713
23:53:50,714
23:53:51,257
23:53:51,258
23:53:51,296
23:53:51,297
23:53:51,298
23:53:51,820
23:53:51,822
23:53:51,823
23:53:52,358
23:53:52,364
23:53:52,367
23:53:52,909
23:53:52,910
23:53:52,936
23:53:52,939
23:53:52,941
23:53:52,944
23:53:52,945
23:53:52,946
23:53:52,949
23:53:52,953
23:53:52,956
23:53:52,959
23:53:52,963
23:53:52,966
23:53:52,970
23:53:52,971
23:53:52,974
23:53:52,978
23:53:52,980
23:53:52,983
23:53:52,984
23:53:52,986
23:53:52,987
23:53:52,989
23:53:52,990
23:53:52,991
23:53:52,994
23:53:52,995
23:53:52,999
23:53:53,001
23:53:53,002
23:53:53,004
23:53:53,005
23:53:53,007
23:53:53,010
23:53:53,026
23:53:53,027
23:53:53,081
23:53:53,082
23:53:53,083
23:53:53,085
07:32:54,519
07:32:54,521
07:32:54,537
07:32:54,538
07:32:54,539
07:32:54,540
07:32:54,541
07:32:54,542
07:32:54,543
07:32:54,544
07:32:54,545
07:32:54,546
07:32:54,547
07:32:54,548
07:32:54,549
07:32:54,550
I'm trying to get the values after the comma then assign them into array, when I used the sed command like :
`sed -n '/[^,]*$/mg'` file
It says command garbled, i read about multiline sed but i still couldn't reach to solution, i am new to regexes so the help will be appreciated.
Thank you in advance!
If you are using a "recent" bash, I think you can use cut and assign extracted values to an array:
numbers="$(cut -d',' -f2 filename.txt)"
array_numbers=( $numbers )
If you want to get the values after comma then you could use the below sed command which removes the values from the start upto the first comma.
sed 's/^[^,]*,//' file
OR
sed 's/^.*,//' file
Example:
$ echo '23:53:22,492' | sed 's/^[^,]*,//'
492
$ echo '23:53:22,492' | sed 's/^.*,//'
492
sed s/.*,// file
would match the till the first , are substitute the match wth nothing, which effectively gives the values after comma
for the input file
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
will produce output as
650
654
655
656
238
240
302
303
304
305
889
890
896
897
898
899
492
538

Match a word just once - AWK

I was reading GNU awk manual but I didnt find a regular expression wich whom I can match a string just once.
For example from the files aha_1.txt, aha_2.txt, aha_3.txt, .... I would like to print the second column $2 from the first time ana appears in the files (aha_1.txt, aha_2.txt, aha_3.txt, ....). In addition, the same thing when pedro appears.
aha_1.txt
luis 321 487
ana 454 345
pedro 341 435
ana 941 345
aha_2.txt
pedro 201 723
gusi 837 134
ana 319 518
cindy 738 278
ana 984 265
.
.
.
.
Meanwhile I did this but it counts all the cases not just the first time
/^ana/ {print $2 }
/^pedro/ {print $2 }
Thanks for your help :-)
Just call the exit command after printing the first value(second column in the line which starts with the string ana).
$ awk '$1~/^ana$/{print $2; exit}' file
454
Original question
Only processing one file.
awk '/ana/ { if (ana++ == 0) print $2 }' aha.txt
or
awk '/ana/ && ana++ == 0 { print $2 }' aha.txt
Or, if you don't need to do anything else, you can exit after printing, as suggested by Avinash Raj in his answer.
Revised question
I have many files (aha.txt, aha_1.txt, aha_2.txt, ...) each file has ana inside and I need just to take the fist time ana appears in each file and the output has to be one file.
That's sightly different as a question. If you have GNU grep, you can use (more or less):
grep -m1 -e ana aha*.txt
That will list the whole line, not just column 2, and will list the filenames too, so it isn't a perfect match.
Using awk, you have to work a bit more:
awk 'FILENAME != old_file { ana = 0; old_file = FILENAME }
/ana/ { if (ana++ == 0) print $2 }' aha*.txt

Bash: write regex matches to separate file

I have the following text where I'm trying to extract several regex-matched lines into a separate text-file. The regex used is
^[A-Z][ \t].*$
and matches the required lines. The point I'm struggling with is separating the matched lines into a separate text-file. I tried sed, but was unable to achieve anything useful.
sample data:
272106 EGGXZOZX
(NAT-1/2 TRACKS FLS 310/390 INCLUSIVE
DEC 28/1130Z TO DEC 28/1900Z
PART ONE OF TWO PARTS-
A ERAKA 59/15 59/20 59/30 58/40 57/50 LOACH FOXXE
EAST LVLS NIL
WEST LVLS 310 320 330 340 350 360 370
EUR RTS WEST ETSOM
NAR NIL-
B GOMUP 58/15 58/20 58/30 57/40 56/50 SCROD VALIE
EAST LVLS NIL
WEST LVLS 310 320 330 340 350 360 370 380 390
EUR RTS WEST GINGA
NAR NIL-
C SUNOT 57/20 57/30 56/40 55/50 OYSTR STEAM
EAST LVLS NIL
WEST LVLS 310 320 330 340 350 360 370 380 390
EUR RTS WEST NIL
NAR NIL-
END OF PART ONE OF TWO PARTS)
desired result:
A ERAKA 59/15 59/20 59/30 58/40 57/50 LOACH FOXXE
B GOMUP 58/15 58/20 58/30 57/40 56/50 SCROD VALIE
C SUNOT 57/20 57/30 56/40 55/50 OYSTR STEAM
Any help or nudge in the right direction is greatly appreciated.
All the best,
Chris
working solution:
#anubhava had the solution working best for me:
grep '^[A-Z][[:space:]]' file > out.txt
Thank you!
Does this solve it?
grep -e '^[A-Z][ \t].*$' inputfile.txt > outputfile.txt
I believe this grep should work:
grep '^[A-Z][[:space:]]' file > out.txt
OR using awk:
awk '/^[A-Z][[:space:]]/' file > out.txt
OR using sed:
sed -n '/^[A-Z][[:space:]]/p' file > out.txt
This regex is more restrictive, it matches exact format of your lines, not just a single character in front of the string.
grep -Po '^[A-Z]\s+[A-Z]+\s([0-9]+/[0-9]+\s+)+[A-Z ]+$' file