grep matching but not printing if line end in dos ^M - regex

I need to search in multiple files for a PATTERN, if found display the file, line and PATTERN surrounded by a few extra chars. My problem is that if the line matching the PATTERN ends with ^M (CRLF) grep prints an empty line instead.
Create a file like this, first line "a^M", second line "a", third line empty line, forth line "a" (not followed by a new line).
a^M
a
a
Without trying to match a few chars after the PATTERN all occurrences are found and displayed:
# grep -srnoEiI ".{0,2}a" *
1:a
2:a
4:a
If I try to match any chars at the end of the PATTERN, it prints an empty line instead of line one, the one ending in CRLF:
# grep -srnoEiI ".{0,2}a.{0,2}" *
2:a
4:a
How can I change this to act as expected ?
P.S. I will like to fix this grep, but I will accept other solutions for example in awk.
EDIT:
Based on the answers below I choose to strip the \r and force grep to pipe the colors to tr:
grep --color=always -srnoEiI ".{0,2}a.{0,2}" * | tr -d '\r'

Here's a simpler case that reproduces your problem:
# Output
echo $'a\r' | grep -o "a"
# No output
echo $'a\r' | grep -o "a."
This is beacuse the ^M matches like a regular character, and makes your terminal overwrite its output (this is purely cosmetic).
How you want to fix this depends on what you want to do.
# Show the output in hex format to ensure it's correct
$ echo $'a\r' | grep -o "a." | od -t x1 -c
0000000 61 0d 0a
a \r \n
# Show the output in visually less ambiguous format
$ echo $'a\r' | grep -o "a." | cat -v
a^M
# Strip the carriage return
$ echo $'a\r' | grep -o "a." | tr -d '\r'
a

awk -v pattern="a" '$0 ~ pattern && !/\r$/ {print NR ": " $0}' file
or
sed -n '/a/{/\r$/!{=;p}}' ~/tmp/srcfile | paste -d: - -
Both of these do: find the pattern, see if the line does not end in a carriage return, print the line number and the line. For the sed, the line number is on its own line, so we have to join two consecutive lines with a colon.

You could use pcregrep:
pcregrep -n '.{0,2}a.{0,2}' inputfile
For your sample input:
$ printf $'a\r\na\n\na\n' | pcregrep -n '.{0,2}a.{0,2}'
1:a
2:a
4:a

A couple more ways:
Use the dos2unix utility to convert the dos-style line endings to unix-style:
dos2unix myfile.txt
Or preprocess the file using tr to remove the CR characters, then pipe to grep:
$ tr -d '\r' < myfile.txt | grep -srnoEiI ".{0,2}a.{0,2}"
1:a
2:a
4:a
$
Note dos2unix may need to be installed on whatever OS you are using. More than likely tr will be available on any POSIX-compliant OS.

You can use awk with a custom field separator:
awk -F '[[:blank:]\r]' '/.{0,2}a.{0,2}/{print FILENAME, NR, $1}' OFS=':' file
TESTING:
Your grep command:
grep -srnoEiI ".{0,2}a.{0,2}" file|cat -vte
file:1:a^M$
file:2:a$
file:4:a$
Suggested awk commmand:
awk -F '[[:blank:]\r]' '/.{0,2}a.{0,2}/{print FILENAME, NR, $1}' OFS=':' file|cat -vte
file:1:a$
file:2:a$
file:4:a$

Related

Extract sub-string from strings based on condition with shell command line

I have lines in myfile like this:
mount -t cifs //hostname/path/ /mount/path/ -o username='xxxx',password='xxxxx'
I need to extract sub-strings from this based on condition "start with // till next white-space including //".
I can't parse with the position as it won't be the same in all matched lines.
So far I have extracted the sub-string using grep's perl assertion, but the result does not return the //.
The piece of code I've used is
cat myfile | grep " cifs " | grep -oP "(?<=/)[^\s]*" | grep -v ^/
Output:
hostname/path/
Expected Output:
//hostname/path/
Is there a way to get the desired output by modifying the perl regex, perhaps some other method?
Simple bash one line solution
grep " cifs " myfile | sed -e "s/ /\n/g" | grep '^\/\/'
You may consider using some non-PCRE based solutions like
sed -En '/ cifs /{s,.*(//[^[:space:]]+).*,\1,p}' file
grep -oE '//[^[:space:]]+' file
The grep solution simply extracts all occurrences of // and 1+ non-whitespace chars after from the file.
The sed solution finds lines containing cifs and then extracts the last occurrence of // and 1+ non-whitespace chars after on those lines.
Following command should do what you ask for
grep cifs myfile | cut -d ' ' -f 4
or
grep cifs myfile | nawk '{print $4}'
or
awk '/cifs/ { print $4 }' myfile
or
perl -ne "print $1 if /cifs\s+(\S+)/" myfile

Find all text between $...$ delimiters using bash script

I have a text file, and I'm trying to get an array of strings containing between $..$ delimiters (LaTeX formulas) using bash script. My current code doesn't work, result is empty:
#!/bin/bash
array=($(grep -o '\$([^\$]*)\$' test.txt))
echo ${array[#]}
I tested this regex here, it finds the matches. I use the following test string:
b5f1e7$bfc2439c621353$d1ce0$629f$b8b5
Expected result is
bfc2439c621353 629f
But echo returns empty. Although if I use '[0-9]\+' it works:
5 1 7 2439 621353 1 0 629 8 5
What do I do wrong?
How about:
grep -o '\$[^$]*\$' test.txt | tr -d '$'
This is basically performing your original grep (but without the brackets, which were causing it to not match), then removing the first/last characters from each match.
You may use awk with input field separator as $:
s='b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
awk -F '$' '{for (i=2; i<=NF; i+=2) print $i}' <<< "$s"
Note that this awk command doesn't validate input. If you want awk to allow for only valid inputs then you may use this gnu awk command with FPAT:
awk -v FPAT='\\$[^$]*\\$' '{for (i=1; i<=NF; i++) {gsub(/\$/, "", $i); print $i}}' <<< "$s"
bfc2439c621353
629f
What about this?
grep -Eo '\$[^$]+\$' a.txt | sed 's/\$//g'
I'm using sed to replace the $.
Try escaping your braces:
tst> grep -o '\$\([^\$]*\)\$' test.txt
$bfc2439c621353$
$629f$
of course, you then have to strip out the $ signs (-o prints the entire match). You can try sed instead:
tst> sed 's/[^\$]*\$\([^\$]*\)\$[^\$]*/\1\n/g' test.txt
bfc2439c621353
629f
Why is your expected output given b5f1e7$bfc2439c621353$d1ce0$629f$b8b5 the two elements bfc2439c621353 629f rather than the three elements bfc2439c621353 d1ce0 629f?
Here's a single grep command to extract those:
$ grep -Po '\$\K[^\$]*(?=\$)' <<<'b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
bfc2439c621353
d1ce0
629f
(This requires GNU grep as compiled with libpcre for -P)
This uses \$\K (equivalent to (?<=\$)to look behind at the first $ and (?=\$) to look ahead to the next $. Since these are lookarounds, they are not absorbed by grep in the process and therefore d1ce0 is available to be found.
Here's a single POSIX sed command to extract those:
$ sed 's/^[^$]*\$//; s/\$[^$]*$//; s/\$/\n/g' \
<<<'b5f1e7$bfc2439c621353$d1ce0$629f$b8b5'
bfc2439c621353
d1ce0
629f
This does not use any GNU notation and should work on any POSIX-compatible system (such as OS X). It removes the leading and trailing portions that aren't wanted, then replaces each $ with a newline.
Using bash regex:
var="b5f1e7\$bfc2439c621353\$d1ce0\$629f\$b8b5" # string to var
while [[ $var =~ ([^$]*\$)([^$]*)\$(.*) ]] # matching
do
echo -n "${BASH_REMATCH[2]} " # 2nd element has the match
var="${BASH_REMATCH[3]}" # 3rd is the rest of the string
done
echo # trailing newline
bfc2439c621353 629f

sed & regex expression

I'm trying to add a 'chr' string in the lines where is not there. This operation is necessary only in the lines that have not '##'.
At first I use grep + sed commands, as following, but I want to run the command overwriting the original file.
grep -v "^#" 5b110660bf55f80059c0ef52.vcf | grep -v 'chr' | sed 's/^/chr/g'
So, to run the command in file I write this:
sed -i -E '/^#.*$|^chr.*$/ s/^/chr/' 5b110660bf55f80059c0ef52.vcf
This is the content of the vcf file.
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="#ref plus strand,#ref minus strand, #alt plus strand, #alt minus strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 24430-0009S21_GM17-12140
1 955597 95692 G T 1382 PASS VARTYPE=1;BGN=0.00134309;ARL=150;DER=53;DEA=55;QR=40;QA=39;PBP=1091;PBM=300;TYPE=SNP;DBXREF=dbSNP:rs115173026,g1000:0.2825,esp5400:0.2755,ExAC:0.2290,clinvar:rs115173026,CLNSIG:2,CLNREVSTAT:mult,CLNSIGLAB:Benign;SGVEP=AGRN|+|NM_198576|1|c.45G>T|p.:(p.Pro15Pro)|synonymous GT:DP:AD:DP4 0/1:125:64,61:50,14,48,13
chr1 957898 82729935 G T 1214 off_target VARTYPE=1;BGN=0.00113362;ARL=149;DER=50;DEA=55;QR=38;QA=40;PBP=245;PBM=978;NVF=0.53;TYPE=SNP;DBXREF=dbSNP:rs2799064,g1000:0.3285;SGVEP=AGRN|+|NM_198576|2|c.463+56G>T|.|intronic GT:DP:AD:DP4 0/1:98:47,51:9,38,10,41
If I understand what is your expected result, try:
sed -ri '/^(#|chr)/! s/^/chr/' file
Your question isn't clear and you didn't provide the expected output so we can't test a potential solution but if all you want is to add chr to the start of lines where it's not already present and which don't start with # then that's just:
awk '!/^(#|chr)/{$0="chr" $0} 1' file
To overwrite the original file using GNU awk would be:
awk -i inplace '!/^(#|chr)/{$0="chr" $0} 1' file
and with any awk:
awk '!/^(#|chr)/{$0="chr" $0} 1' file > tmp && mv tmp file
This can be done with a single sed invocation. The script itself is something like the following.
If you have an input of format
$ echo -e '#\n#\n123chr456\n789chr123\nabc'
#
#
123chr456
789chr123
abc
then to prepend chr to non-commented chrless lines is done as
$ echo -e '#\n#\n123chr456\n789chr123\nabc' | sed '/^#/ {p
d
}
/chr/ {p
d
}
s/^/chr/'
which prints
#
#
123chr456
789chr123
chrabc
(Note the multiline sed script.)
Now you only need to run this script on a file in-place (-i in modern sed versions.)

Extract all numbers from a text file and store them in another file

I have a text file which have lots of lines. I want to extract all the numbers from that file.
File contains text and number and each line contains only one number.
How can i do it using sed or awk in bash script?
i tried
#! /bin/bash
sed 's/\([0-9.0-9]*\).*/\1/' <myfile.txt >output.txt
but this didn't worked.
grep can handle this:
grep -Eo '[0-9\.]+' myfile.txt
-o tells to print only the matches and [0-9\.]+ is a regular expression to match numbers.
To put all numbers on one line and save them in output.txt:
echo $(grep -Eo '[0-9\.]+' myfile.txt) >output.txt
Text files should normally end with a newline characters. The use of echo above assures that this happens.
Non-GNU grep:
If your grep does not support the -o flag, try:
echo $(tr ' ' '\n' <myfile.txt | grep -E '[0-9\.]+') >output.txt
This uses tr to replace all spaces with newlines (so each number appears separately on a line) and then uses grep to search for numbers.
tr -sc '0-9.' ' ' "$file"
Will transform every string of non-digit-or-period characters into a single space.
You can also use Bash:
while read line; do
if [[ $line =~ [0-9\.]+ ]]; then
echo $BASH_REMATCH
fi
done <myfile.txt >output.txt

How can i display the second matched regex in sed

Suppose I have this text
The code for 233-CO is the main reason for 45-DFG and this 45-GH
Now I have this regexp \s[0-9]+-\w+ which matches 233-CO, 45-DFG and 45-GH.
How can I display just the third match 45-GH?
sed -re 's/\s[0-9]+-\w+/\3/g' file.txt
where \3 should be the third regexp match.
Is it mandatory to use sed? You could do it with grep, using arrays:
text="The code for 233-CO is the main reason for 45-DFG and this 45-GH"
matches=( $(echo "$text" | grep -o -m 3 '\s[0-9]\+-\w\+') ) # store first 3 matches in array
echo "${matches[0]} ${matches[2]}" # prompt first and third match
To find the last occurence of your pattern, you can use this:
$ sed -re 's/.*\s([0-9]+-\w+).*/\1/g' file
45-GH
if awk is accepted, there is an awk onliner, you give the No# of match you want to grab, it gives your the matched str.
awk -vn=$n '{l=$0;for(i=1;i<n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' file
test
kent$ echo $STR #so we have 7 matches in str
The code for 233-CO is the main reason for 45-DFG and this 45-GH,foo 004-AB, bar 005-CC baz 006-DDD and 007-AWK
kent$ n=6 #now I want the 6th match
#here you go:
kent$ awk -vn=$n '{l=$0;for(i=1;i<=n;i++){match(l,/\s[0-9]+-\w+/,a);l=substr(l,RSTART+RLENGTH);}print a[0]}' <<< $STR
006-DDD
This might work for you (GNU sed):
sed -r 's/\b[0-9]+-[A-Z]+\b/\n&\n/3;s/.*\n(.*)\n.*/\1/' file
s/\b[0-9]+-[A-Z]+\b/\n&\n/3 prepend and append \n (newlines) to the third (n) pattern in question.
s/.*\n(.*)\n.*/\1/ delete the text before and after the pattern
With grep for matching and sed for printing the occurrence:
$ egrep -o '\b[0-9]+-\w+' file | sed -n '1p'
233-CO
$ egrep -o '\b[0-9]+-\w+' file | sed -n '2p'
45-DFG
$ egrep -o '\b[0-9]+-\w+' file | sed -n '3p'
45-GH
Or with a little awk passing the occurrence to print using the variable o:
$ awk -v o=1 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
233-CO
$ awk -v o=2 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-DFG
$ awk -v o=3 '{for(i=0;i++<NF;)if($i~/[0-9]+-\w+/&&j++==o-1)print $i}' file
45-GH