Using grep, how to match beginning of line with pattern from stdin - regex

I have a one-liner that prints out a series a numbers:
124
132
186
I am then piping this output into grep to match these numbers to the beginning of lines in another file but sometimes the second number in the line matches one of the patterns and I get an incorrect match like so:
$ get_id_command | grep -f - users.list
124 => 3456, Charles Charmichael, ccharmichael
132 => 2498, Sarah Walker, swalker
186 => 8934, John Casey, jcasey
240 => 1245, Morgan Grimes, mgrimes
What options do I need for grep to only match patterns at the beginning of the line? I would really like to keep this as a one-linter.

Prepend a circumflex to each line of your file and it will work. Circumflex does indicate the line start within the pattern. So modify your users.list as described, e.g.
sed -Ei 's|(.*)|^\1|' users.list
After that you should get the desired result by your command
$ get_id_command | grep -f - users.list

Related

Regular Expression to match against first character and file extension

I'm using Bash to try to write a command that gets every file where the first character is not 'a' and the file does not end with '.html' but cannot seem to get both to work properly.
So far I can get my regex to match all the files that start with 'a' and end with '.html' and remove them but my issue that I cannot seem to solve is when the file starts with 'a' and ends with a different file extension. My regex seems to ignore that second requirement and just hides it regardless.
cat inputfile.txt | sed -n '/^[^a].*[^html$]/p'
Input File Contents:
123
anapple.html
456
theapple.html
789
nottrue.html
apple.csv
12
Output:
123
456
theapple.html
789
nottrue.html
12
Instead of trying to write a pattern that matches the rows to keep, write a pattern that matches the rows to remove, and use grep -v to print all the lines that don't match it.
grep -v '^a.*\.html$' inputfile.txt

Bash/sed: delete everything from text file except match(es)

I have a text file which I need to extract a match from in a bash script. There might be more than one match and everything else is supposed to be discarded.
Sample snippet of input.txt file content:
PART TWO OF TWO PARTS-
E RESNO 56/20 56/30 54/40 52/50 TUDEP
EAST LVLS NIL
WEST LVLS 310 320 330 340 350 360 370 380 390
EUR RTS WEST NIL
NAR NIL-
REMARKS.
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
TMI NUMBER AS PART OF THE OCEANIC CLEARANCE READ BACK.
2.ADS-C AND CPDLC MANDATED OTS ARE AS FOLLOWS
TRACK A 350 360 370 380 390
TRACK B 350 360 370 380 390
I try to match for 142 from the line
1.TMI IS 142 AND OPERATORS ARE REMINDED TO INCLUDE THE
The match is always a number (one to three digits, may have leading zeroes) and always preceded by TMI IS.
My experiments so far led to nothing: I tried .*TMI IS ([0-9]+).* with the following sed command in my bash script
sed -n 's/.*TMI IS \([0-9]+\).*/\1/g' input.txt > output.txt
but only got an empty output.txt.
My script runs in GNU Bash-4.2. Where do I make my mistake? I ran out of ideas so your input is highly appreciated!
Thanks,
Chris
Two moments about your sed approach to make it work:
+ quantifier should be escaped in sed basic regular expressions
to print matched pattern use p subcommand:
sed -n 's/.*TMI IS \([0-9]\+\).*/\1/gp' input.txt
142
To get only the first match for your current format use:
sed -n 's/^\S\+TMI IS \([0-9]\+\).*/\1/gp' input.txt
With GNU grep:
$ grep -oP 'TMI IS \K([0-9]*)' input.txt
142
You could also do this using perl as an alternative to the above:
$ perl -nle 'print $1 if /TMI IS (\d+)/;' < input.txt
142

sed - replace all strings that begin with

Hi I want to find all strings anywhere in a file that begin with the letters rs i.e.
rs12345 100
rs54321 200
300 rs13579
and delete all strings that begin with the criteria so that I get:
100
200
300
i.e. replace the string with nothing. I am not bothered about leading whitespace before final output as I deal with that later. I have tried sed 's/rs*//g' however this gives:
12345 100
54321 200
i.e. only removes the rs.
How do I edit my sed command to delete the entire string? Thanks
You can replace from starting rs up to a space by an empty string, so that the rsXXX gets removed.
sed 's/^rs[^ ]*//' file
This supposed your rs was in the beginning of the line. If rs can appear anywhere in the file, use:
sed 's/\brs[^ ]*//' file
The \b works as word boundary, so that things like hellorshello does not match.
Test
$ cat a
rs12345 100
rs54321 200
hellooo 300
$ sed 's/^rs[^ ]*//' a
100
200
hellooo 300
Note I am not dealing with the whitespace, since you mention you are handling it later on. In case you needed it, you can say sed 's/^rs[^ ]* \+//' file.
rs anywhere:
$ cat a
rs12345 100
rs54321 200
hellooo 300
and rs12345 100
thisrsisatest 24
$ sed 's/\brs[^ ]*//' a
100
200
hellooo 300
and 100
thisrsisatest 24
Note you current approach wasn't working because with rs* you are saying: r followed by 0 or more s.
Let the word be a pattern starting with word break \b and having characters \w (and ending with \b too, but we will not need it).
The command that removes words starting with rs will be sed -r 's/\brs\w+//g' file
$ cat file
rs12345 100
rs54321 200
ars1234 000
$ sed -r 's/\brs\w+//g' file
100
200
ars1234 000

Sed command garbled with very easy mutiline regex in bash

I'm again garbled with sed command, because most probably i have very old version of sed but according to my limitations i couldn't change the version of 'sed' (!)
My question is this i wrote such an easy regex that fits with my string file such as:
/[^,]*$/mg
My string file is this :
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
23:53:22,539
23:53:23,109
23:53:23,110
23:53:23,115
23:53:23,117
23:53:23,118
23:53:23,119
23:53:23,690
23:53:23,721
23:53:23,722
23:53:24,275
23:53:24,276
23:53:24,313
23:53:24,316
23:53:24,317
23:53:24,318
23:53:24,854
23:53:24,888
23:53:24,889
23:53:24,890
23:53:24,891
23:53:50,676
23:53:50,677
23:53:50,711
23:53:50,713
23:53:50,714
23:53:51,257
23:53:51,258
23:53:51,296
23:53:51,297
23:53:51,298
23:53:51,820
23:53:51,822
23:53:51,823
23:53:52,358
23:53:52,364
23:53:52,367
23:53:52,909
23:53:52,910
23:53:52,936
23:53:52,939
23:53:52,941
23:53:52,944
23:53:52,945
23:53:52,946
23:53:52,949
23:53:52,953
23:53:52,956
23:53:52,959
23:53:52,963
23:53:52,966
23:53:52,970
23:53:52,971
23:53:52,974
23:53:52,978
23:53:52,980
23:53:52,983
23:53:52,984
23:53:52,986
23:53:52,987
23:53:52,989
23:53:52,990
23:53:52,991
23:53:52,994
23:53:52,995
23:53:52,999
23:53:53,001
23:53:53,002
23:53:53,004
23:53:53,005
23:53:53,007
23:53:53,010
23:53:53,026
23:53:53,027
23:53:53,081
23:53:53,082
23:53:53,083
23:53:53,085
07:32:54,519
07:32:54,521
07:32:54,537
07:32:54,538
07:32:54,539
07:32:54,540
07:32:54,541
07:32:54,542
07:32:54,543
07:32:54,544
07:32:54,545
07:32:54,546
07:32:54,547
07:32:54,548
07:32:54,549
07:32:54,550
I'm trying to get the values after the comma then assign them into array, when I used the sed command like :
`sed -n '/[^,]*$/mg'` file
It says command garbled, i read about multiline sed but i still couldn't reach to solution, i am new to regexes so the help will be appreciated.
Thank you in advance!
If you are using a "recent" bash, I think you can use cut and assign extracted values to an array:
numbers="$(cut -d',' -f2 filename.txt)"
array_numbers=( $numbers )
If you want to get the values after comma then you could use the below sed command which removes the values from the start upto the first comma.
sed 's/^[^,]*,//' file
OR
sed 's/^.*,//' file
Example:
$ echo '23:53:22,492' | sed 's/^[^,]*,//'
492
$ echo '23:53:22,492' | sed 's/^.*,//'
492
sed s/.*,// file
would match the till the first , are substitute the match wth nothing, which effectively gives the values after comma
for the input file
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
will produce output as
650
654
655
656
238
240
302
303
304
305
889
890
896
897
898
899
492
538

how to extract number in a single quote from a line with awk or sed?

I have this line, tab delimited:
chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
chr1 11479 11481 '11/29' 379 + chr1 11479 11481 '20/5' 667 + 2
What I want to do is to test if all the second number inside ' ' are greater or equal to 10. If so, I'll output this line. So the result should be to print the first line
chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
I can write a perl code to do it. But this seems to be something awk can do easily.. anyone has a solution?
Thanks.
If you set the right field separators, it's pretty easy:
awk -F "['/]" '{for (i=3; i<=NF; i+=3) if ($i<10) next; print}' file
Easiest way fetch the content inside single quotes might be just to strip off everything from both ends of each line, up to and including the single quote:
$ sed "s/^[^']*'//;s/'.*//" file
16/38
11/29
This sed expression consists of two commands:
s/^[^']*'// -- strips off all text to the first single quote,
s/'.*// -- strips off all text from the first (remaining) single quote to EOL.
To wrap this in a shell script that does something with that data requires .. well, a shell script...
You can parse this stuff using bash's read command. For example:
#!/bin/bash
IFS=/
sed "s/^[^']*'//;s/'.*//" file \
| while read left right; do
echo "$left / $right"
done
To implement something that grabs contents of multiple single-quoted numbers, you can expand the sed script appropriately, and implement if statements for the conditions you want. For example, a sed expression to grab the TWO single-quoted strings might be:
sed "s/^[^']*'\([^']*\)'[^']*'\([^']*\)'.*/\1 \2/"
This is a single large regex that uses two sets of brackets \( and \), to mark patterns that will be placed in the output, \1 and \2.
But you might be better off parsing things according to column position:
$ while read _ _ _ A _ _ _ _ _ B _; do echo "$A .. $B"; done < file
'16/38' .. '21/29'
'11/29' .. '20/5'
Actually implementing your programming logic is left as an exercise to the reader. If you'd like us to help you with your script, please include your work so far.
As long as those are the only ' characters in the string and the numbers won't have leading zeros you could use the regular expression:
\d\d+'.*\d\d+'
If either of those preconditions isn't true there are changes that could be made, but it would depend on the situation.
You should be able to use grep to get the lines you want using that regex.
The following puts just the first line to stdout:
grep \d\d+'.*\d\d+' "chr1 11460 11462 '16/38' 421 + chr1 11460 11462 '21/29' 724 + 2
chr1 11479 11481 '11/29' 379 + chr1 11479 11481 '20/5' 667 + 2"
My version, serious overkill but should work with any amount of 'xx/xx' per line:
awk -F'\t' "{
found=1;
for(i=0;i<NF;i++){
if(match(\$i, /'[[:digit:]]+\/([[:digit:]]+)'/, capts)){
if(capts[1] < 10){
found=0;
break;
}
}
}
if(found){
print;
}
}" file.txt
Explanation:
This will loop through each field of the line and apply a regex against the field to find the last digits of 'xx/xx'. If the last digits are less than 10 it will break out of the loop and go to the next line. If all fields have been processed by the if loop and no last digits were less than 10, it will print the line.
Note:
Seeing that i'm using the match function to capture regex groups this will only work with GNU awk.