Egrep expression: how to unescape single quotes when reading from file?

Egrep expression: how to unescape single quotes when reading from file? - regex

I need to use egrep to obtain an entry in an index file.
In order to find the entry, I use the following command:
egrep "^$var_name" index
$var_name is the variable read from a var list file:
while read var_name; do
egrep "^$var_name" index
done < list
One of the possible keys comes usually in this format:
$ERROR['SOME_VAR']
My index file is in the form:
$ERROR['SOME_VAR'] --> n
Where n is the line where the variable is found.
The problem is that $var_name is automatically escaped when read. When I enable the debug mode, I get the following command being executed:
+ egrep '^$ERRORS['\''SELECT_COUNTRY'\'']' index
The command above doesn't work, because egrep will try to interpret the pattern.
If I don't use the extended version, using grep or fgrep, the command will work only if I remove the ^ anchor:
grep -F "$var_name" index # this actually works
The problem is that I need to ensure that the match is made at the beginning of the line.
Ideas?

set -x shows the command being executed in shell notation.
The backslashes you see do not become part of the argument, they're just printed by set -x to show the executed command in a copypastable format.
Your problem is not too much escaping, but too little: $ in regex means "end of line", so ^$ERROR will never match anything. Similarly, [ ] is a character range, and will not match literal square brackets.
The correct regex to match your pattern would be ^\$ERROR\['SOME VAR'], equivalent to the shell argument in egrep "^\\\$ERROR\['SOME_VAR']".
Your options to fix this are:
If you expect to be able to use regex in your input file, you need to include regex escapes like above, so that your patterns are valid.
If you expect to be able to use arbitrary, literal strings, use a tool that can match flexibly and literally. This requires jumping through some hoops, since UNIX tools for legacy reasons are very sloppy.
Here's one with awk:
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
It passes the string in through the environment (because -v is sloppy) and then matches literally against the string from the start of the input.
Here's an example invocation:
$ cat script
while IFS= read -r line
do
export line
gawk 'BEGIN{var=ENVIRON["line"];} substr($0, 0, length(var)) == var' index
done < list
$ cat list
$ERRORS['SOME_VAR']
\E and \Q
'"'%##%*'
$ cat index
hello world
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too
etc
$ bash script
$ERRORS['SOME_VAR'] = 'foo';
\E and \Q are valid strings
'"'%##%*' too

You can use printf "%q":
while read -r var_name; do
egrep "^$(printf "%q\n" "$var_name")" index
done < list
Update: You can also do:
while read -r var_name; do
egrep "^\Q$var_name\E" index
done < list
Here \Q and \E are used to make string in between a literal string removing all special meaning of regex symbols.

Related

Regex -> extracting fixed position occurrences from complex string

I have a string like this one below (nvram extract) that is used by tinc VPN to define the network hosts:
1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>
I'm trying to extract from the above:
host1.network.org
host2.network.org
host3.network.org
The hostname and keys are made up, but the structure of the input string is accurate. By the way the end node could be as well be defined as an IP addresses, so I'm trying to extract what's in between the second occurrence of "<" and the first occurrence of "<<". Since this is a multi match the occurrences are counted after either beginning of the line or the ">" character. So the above could be read as follow:
1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>
1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>
0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>
As I need this info in a shell script I guess I would need to store each host/IP as an emlement of an array.
I have used regexp online editors, and managed to work out this string:
^[0|1]<.*?(\<(.*?)\<<)|>[0|1]<.*?(\<(.*?)\<)
however is I run a
grep -Eo '^[0|1]<.*?(\<(.*?)\<<)|>[0|1]<.*?(\<(.*?)\<)'
against the initial stinge I get the full string in return so I must be doing something wrong :-/
P.S. running on buysbox:
`BusyBox v1.25.1 (2017-05-21 14:11:58 CEST) multi-call binary.
Usage: grep [-HhnlLoqvsriwFE] [-m N] [-A/B/C N] PATTERN/-e PATTERN.../-f FILE [FILE]...
Search for PATTERN in FILEs (or stdin)
-H Add 'filename:' prefix
-h Do not add 'filename:' prefix
-n Add 'line_no:' prefix
-l Show only names of files that match
-L Show only names of files that don't match
-c Show only count of matching lines
-o Show only the matching part of line
-q Quiet. Return 0 if PATTERN is found, 1 otherwise
-v Select non-matching lines
-s Suppress open and read errors
-r Recurse
-i Ignore case
-w Match whole words only
-x Match whole lines only
-F PATTERN is a literal (not regexp)
-E PATTERN is an extended regexp
-m N Match up to N times per file
-A N Print N lines of trailing context
-B N Print N lines of leading context
-C N Same as '-A N -B N'
-e PTRN Pattern to match
-f FILE Read pattern from file`
Thanks!

OK, no response to my comment so I'll enter it as answer. How about
\w*[a-z]\w*(\.\w*[a-z]\w*)+
It matches at least two parts of a fully qualified name, separated by a dot.
grep -Eo '\w*[a-z]\w*(\.\w*[a-z]\w*)+'
yields
host1.network.org
host2.network.org
host3.network.org
(assuming your string is entered in stdin ;)

The regex you have is based on capturing groups and with grep you can only get full matches. Besides, you use -E (POSIX ERE flavor), while your regex is actually not POSIX ERE compatible as it contains lazy quantifiers that are not supported by this flavor.
I think you can extract all non-< chars between < and << followed with a digit and then a < with a PCRE regex (-P option):
s='1<host1<host1.network.org<<0<10.10.10.0/24<<Ed25519PublicKey = 8dtRRgAaTbUNtPxW9U3nGn6U7uvfIPwRo1wnx7xMIUH<Subnet = 10.10.3.0/24>1<host2<host2.network.org<<0<10.10.9.0/24<<Ed25519PublicKey = irn48tqF2Em4rIG0ggBmpEfaVKtkl6DmGdSzTHMmVEI<>0<host3<host3.network.org<<0<10.10.11.0/24<<Ed25519PublicKey = wQt1sFwOsd1hnBaNGHq4JDyib22fOg1YqzOp0p08ZTD<>'
echo $s | grep -oP '(?<=<)[^<]+(?=<<[0-9]<)'
See the regex demo and a grep demo.
Output:
host1.network.org
host2.network.org
host3.network.org
Here, (?<=<) is a positive lookbehind that only checks for the < presence immediately to the left of the current location but does not add < to the match value, [^<]+ matches 1+ chars other than < and (?=<<[0-9]<) (a positive lookahead) requires <<, then a digit, and then a < but again does not add these chars to the match.
If you have no PCRE option in grep, try replacing all the text you do not need with some char, and then either split with awk, or use grep:
echo $s | \
sed 's/[^<]*<[^<]*<\([^<][^<]*\)<<[0-9]<[^<]*<<[^<]*[<>]*/|\1/g' | \
grep -oE '[^|]+'
See another online demo.

Grep invert on string matched, not line matched

I'll keep this explanation of why I need help to a mimimum. One of my file directories got hacked through XSS and placed a long string at the beginning of all php files. I've tried to use sed to replace the string with nothing but it won't work because the pattern to match includes many many characters that would need to be escaped.
I found out that I can use fgrep to match a fixed string saved in a pattern file, but I'd like to replace the matched string (NOT THE LINE) in each file, but grep's -v inverts the result on the line, rather than the end of the matched string.
This is the command I'm using on an example file that contains the hacked
fgrep -v -f ~/hacked-string.txt example.php
I need the output to contain the <?php that's at the end of the line (sometimes it's a <style> tag), but the -v option inverts at the end of that line, so the output doesn't contain the <?php at the beginning.
NOTE
I've tried to use the -o or --only-matching which outputs nothing instead:
fgrep -f ~/hacked-string.txt example.php --only-matching -v
Is there another option in grep that I can use to invert on the end of the matched pattern, rather than the line where the pattern was matched? Or alternatively, is there an easier option to replace the hacked string in all .php files?
Here is a small snippet of what's in hacked-string.txt (line breaks added for readability):
]55Ld]55#*<%x5c%x7825bG9}:}.}-}!#*<%x55c%x7825)
dfyfR%x5c%x7827tfs%x5c%x7c%x785c%x5c%x7825j:^<!
%x5c%x7825w%x5c%x7860%x5c%x785c^>Ew:25tww**WYsb
oepn)%x5c%x7825bss-%x5c%x7825r%x5c%x7878B%x5c%x
7825h>#]y3860msvd},;uqpuft%x5c%x7860msvd}+;!>!}
%x5c%x7827;!%x5c%x7825V%x5c%x7827{ftmfV%x5e56+9
9386c6f+9f5d816:+946:ce44#)zbssb!>!ssbnpe_GMFT%
x5c5c%x782f#00#W~!%x5c%x7825t2w)##Qtjw)#]82#-#!
#-%x5c%x7825tmw)%x5c%x78w6*%x5c%x787f_*#fubfsdX
k5%x5c%xf2!>!bssbz)%x5c%x7824]25%x5c%x7824-8257
-K)fujs%x5c%x7878X6<#o]o]Y%x5c%x78257;utpI#7>-1
-bubE{h%x5c%x7825)sutcvt)!gj!|!*bubEpqsut>j%x5c
%x7825!*72!%x5c%x7827!hmg%x5c%x78225>2q%x5c%x7
Thanks in advance!

I think what you are asking is this:
"Is it possible to use the grep utility to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
In that case, the answer is "No".
What I think you wanted to ask was:
"What is the easiest way to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
Here's one reasonably simple solution:
delete_string() {
awk -v s="$the_string" '{while(i=index($0,s))$0=substr($0,1,i-1)substr($0,i+length(s))}1'
}
delete_string 'some_hideous_string_with*!"_inside' < original_file > new_file
The shell syntax is slightly fragile; it will break if the string contains an apostrophe ('). However, you can read a raw string from stdin into a variable with:
$ IFS= read -r the_string
absolutely anything here
which will work with any string which doesn't contain a newline or a NUL character. Once you have the string in a variable, you can use the above function:
delete_string "$the_string" < original_file > new_file
Here's another possible one liner, using python:
delete_string() {
python -c 'import sys;[sys.stdout.write(l.replace(r"""'"$1"'""","")) for l in sys.stdin]'
}
This won't handle strings which have three consecutive quotes (""").

Is the hacked string the same in every file?
If the length of hacked string in chars was 1234 then you can use
tail -c +1235 file.php > fixed-file.php
for each infected file.
Note that tail c +1235 tells to start output at 1235th character of the input file.

With perl:
perl -i.hacked -pe "s/\Q$(<hacked-string.txt)\E//g" example.php
Notes:
The $(<file) bit is a bash shortcut to read the contents of a file.
The \Q and \E bits are from perl, they treat the stuff in between as plain characters, ignoring regex metachars.
The -i.hacked option will edit the file in-place, creating a backup "example.php.hacked"

"sed" special characters handling

we have an sed command in our script to replace the file content with values from variables
for example..
export value="dba01upc\Fusion_test"
sed -i "s%{"sara_ftp_username"}%$value%g" /home_ldap/user1/placeholder/Sara.xml
the sed command ignores the special characters like '\' and replacing with string "dba01upcFusion_test" without '\'
It works If I do the export like export value='dba01upc\Fusion_test' (with '\' surrounded with ‘’).. but unfortunately our client want to export the original text dba01upc\Fusion_test with single/double quotes and he don’t want to add any extra characters to the text.
Can any one let me know how to make sed to place the text with special characters..
Before Replacement : Sara.xml
<?xml version="1.0" encoding="UTF-8"?>
<ser:service-account >
<ser:description/>
<ser:static-account>
<con:username>{sara_ftp_username}</con:username>
</ser:static-account>
</ser:service-account>
After Replacement : Sara.xml
<?xml version="1.0" encoding="UTF-8"?>
<ser:service-account>
<ser:description/>
<ser:static-account>
<con:username>dba01upcFusion_test</con:username>
</ser:static-account>
</ser:service-account>
Thanks in advance

You cannot robustly solve this problem with sed. Just use awk instead:
awk -v old="string1" -v new="string2" '
idx = index($0,old) {
$0 = substr($0,1,idx-1) new substr($0,idx+length(old))
}
1' file
Ah, #mklement0 has a good point - to stop escapes from being interpreted you need to pass in the values in the arg list along with the file names and then assign the variables from that, rather than assigning values to the variables with -v (see the summary I wrote a LONG time ago for the comp.unix.shell FAQ at http://cfajohnson.com/shell/cus-faq-2.html#Q24 but apparently had forgotten!).
The following will robustly make the desired substitution (a\ta -> e\tf) on every search string found on every line:
$ cat tst.awk
BEGIN {
old=ARGV[1]; delete ARGV[1]
new=ARGV[2]; delete ARGV[2]
lgthOld = length(old)
}
{
head = ""; tail = $0
while ( idx = index(tail,old) ) {
head = head substr(tail,1,idx-1) new
tail = substr(tail,idx+lgthOld)
}
print head tail
}
$ cat file
a\ta a a a\ta
$ awk -f tst.awk 'a\ta' 'e\tf' file
e\tf a a e\tf
The white space in file is tabs. You can shift ARGV[3] down and adjust ARGC if you like but it's not necessary in most cases.

Update with the benefit of hindsight, to present options:
Update 2: If you're intent on using sed, see the - somewhat cumbersome, but now robust and generic - solution below.
If you want a robust, self-contained awk solution that also properly handles both arbitrary search and replacement strings (but cannot incorporate regex features such as word-boundary assertions), see Ed Morton's answer.
If you want a pure bash solution and your input files are small and preserving multiple trailing newlines is not important, see Charles Duffy's answer.
If you want a full-fledged third-party templating solution, consider, for instance, j2cli, a templating CLI for Jinja2 - if you have Python and pip, install with sudo pip install j2cli.
Simple example (note that since the replacement string is provided via a file, this may not be appropriate for sensitive data; note the double braces ({{...}})):
value='dba01upc\Fusion_test'
echo "sara_ftp_username=$value" >data.env
echo '<con:username>{{sara_ftp_username}}</con:username>' >tmpl.xml
j2 tmpl.xml data.env # -> <con:username>dba01upc\Fusion_test</con:username>
If you use sed, careful escaping of both the search and the replacement string is required, because:
As Ed Morton points out in a comment elsewhere, sed doesn't support use of literal strings as replacement strings - it invariably interprets special characters/sequences in the replacement string.
Similarly, the search string literal must be escaped in a way that its characters aren't mistaken for special regular-expression characters.
The following uses two generic helper functions that perform this escaping (quoting) that apply techniques explained at "Is it possible to escape regex characters reliably with sed?":
#!/usr/bin/env bash
# SYNOPSIS
# quoteRe <text>
# DESCRIPTION
# Quotes (escapes) the specified literal text for use in a regular expression,
# whether basic or extended - should work with all common flavors.
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# '
# SYNOPSIS
# quoteSubst <text>
# DESCRIPTION
# Quotes (escapes) the specified literal string for safe use as the substitution string (the 'new' in `s/old/new/`).
quoteSubst() {
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
printf %s "${REPLY%$'\n'}"
}
# The search string.
search='{sara_ftp_username}'
# The replacement string; a demo value with characters that need escaping.
value='&\1%"'\'';<>/|dba01upc\Fusion_test'
# Use the appropriately escaped versions of both strings.
sed "s/$(quoteRe "$search")/$(quoteSubst "$value")/g" <<<'<el>{sara_ftp_username}</el>'
# -> <el>&\1%"';<>/|dba01upc\Fusion_test</el>
Both quoteRe() and quoteSubst() correctly handle multi-line strings.
Note, however, given that sed reads a single line at at time by default, use of quoteRe() with multi-line strings only makes sense in sed commands that explicitly read multiple (or all) lines at once.
quoteRe() is always safe to use with a command substitution ($(...)), because it always returns a single-line string (newlines in the input are encoded as '\n').
By contrast, if you use quoteSubst() with a string that has trailing newlines, you mustn't use $(...), because the latter will remove the last trailing newline and therefore break the encoding (since quoteSubst() \-escapes actual newlines, the string returned would end in a dangling \).
Thus, for strings with trailing newlines, use IFS= read -d '' -r escapedValue < <(quoteSubst "$value") to read the escaped value into a separate variable first, then use that variable in the sed command.

This can be done with bash builtins alone -- no sed, no awk, etc.
orig='{sara_ftp_username}' # put the original value into a variable
new='dba01upc\Fusion_test' # ...no need to 'export'!
contents=$(<Sara.xml) # read the file's content into
new_contents=${contents//"$orig"/$new} # use parameter expansion to replace
printf '%s' "$new_contents" >Sara.xml # write new content to disk
See the relevant part of BashFAQ #100 for information on using parameter expansion for string substitution.

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done

Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.

This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.

After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.

echo "$var" | pcregrep -o "(?<=yet more )text"

Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing

My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt

You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'

I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).

You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.

perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.

If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp

why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution

there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.

you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"

For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}

gawk '/.*abc([0-9]+)xyz.*/' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Egrep expression: how to unescape single quotes when reading from file? - regex

Related

Regex -> extracting fixed position occurrences from complex string

Grep invert on string matched, not line matched

"sed" special characters handling

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

how to use sed, awk, or gawk to print only what is matched?

Categories

Resources