grep on unix / linux: how to replace or capture text? - regex

So I'm pretty good with regular expressions, but I'm having some trouble with them on unix. Here are two things I'd love to know how to do:
1) Replace all text except letters, numbers, and underscore
In PHP I'd do this: (works great)
preg_replace('#[^a-zA-Z0-9_]#','',$text).
In bash I tried this (with limited success); seems like it dosen't allow you to use the full set of regex:
text="my #1 example!"
${text/[^a-zA-Z0-9_]/'')
I tried it with sed but it still seems to have problems with the full regex set:
echo "my #1 example!" | sed s/[^a-zA-Z0-9\_]//
I'm sure there is a way to do it with grep, too, but it was breaking it into multiple lines when i tried:
echo abc\!\#\#\$\%\^\&\*\(222 | grep -Eos '[a-zA-Z0-9\_]+'
And finally I also tried using expr but it seemed like that had really limited support for extended regex...
2) Capture (multiple) parts of text
In PHP I could just do something like this:
preg_match('#(word1).*(word2)#',$text,$matches);
I'm not sure how that would be possible in *nix...

Part 1
You are almost there with the sed just add the g modifier so that the replacement happen globally, without the g, replacement will happen just once.
$ echo "my #1 example!" | sed s/[^a-zA-Z0-9\_]//g
my1example
$
You did the same mistake with your bash pattern replacement too: not making replacements globally:
$ text="my #1 example!"
# non-global replacement. Only the space is delete.
$ echo ${text/[^a-zA-Z0-9_]/''}
my#1 example!
# global replacement by adding an additional /
$ echo ${text//[^a-zA-Z0-9_]/''}
my1example
Part 2
Capturing works the same in sed as it did in PHP's regex: enclosing the pattern in parenthesis triggers capturing:
# swap foo and bar's number using capturing and back reference.
$ echo 'foo1 bar2' | sed -r 's/foo([0-9]+) bar([0-9]+)/foo\2 bar\1/'
foo2 bar1
$

As an alternative to codaddict's nice answer using sed, you could also use tr for the first part of your question.
echo "my #1 _ example!" | tr -d -C '[[:alnum:]_]'
I've also made use of the [:alnum:] character class, just to show another option.

what do you mean you can't use the regex syntax for bash?
$ text="my #1 example!"
$ echo ${text//[^a-zA-Z0-9_]/}
my1example
you have to use // for more than 1 replacement.
for your 2nd question, with bash 3.2++
$ [[ $text =~ "(my).*(example)" ]]
$ echo ${BASH_REMATCH[1]}
my
$ echo ${BASH_REMATCH[2]}
example

Related

Linux shell extracting substring between matching patterns

Let's say I have a string poskek|gfgfd|XLSE|a1768|d234|uijjk and I want to extract just the LSE part.
I only know that there will be |X directly before LSE, and | directly after the part I am interested in LSE.
The other answer using sed should work, but I always find sed to be a bit awkward for regex selection, as it's really intended for replacement (hence why either side of the pattern needs to be flanked with .* and the part you actually want needs to be in parentheses). Here's a solution using grep:
grep -Po '\|X\K[^|]+'
-P signals grep to use Perl's regex engine which is more advanced
-o only prints the matching part of the line
\|X match a literal vertical bar and a capital X
\K forget what has currently been matched (do not include it in the final output)
[^|]+ one or more characters other than vertical bars
As a pure bash solution, please try:
str='poskek|gfgfd|XLSE|a1768|d234|uijjk'
ext=${str#*|X}
ext=${ext%%|*}
echo "$ext"
If regex is available, following also works:
if [[ $str =~ .*\|X([^|]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
echo 'poskek|gfgfd|XLSE|a1768|d234|uijjk' | sed -n 's/.*|X\([^|]\+\).*/\1/p'
That ought to do the trick.
Explained:
sed -n will not print anything unless specified
s/ - search and replace
.*|X - match everything up to and including |X
\([^|]\+\) - capture multiple (at least one) character that isn't a |
.* - match the rest of the text (just to "eat it up")
/\1/p - Replace all matched text with the first capture, and print
For this particular case, you could do the rather unconventional:
awk '$1=="X"{$1="";print}' FS= OFS= RS=\|
try this
echo 'poskek|gfgfd|XLSE|a1768|d234|uijjk' |
awk -F "|" '{for(i=1;i<=NF;++i) printf "%s", (substr($i,1,1)=="X"?substr($i,2):"")}'
where
-F is field seperator => '|'
NF is number of fields

Excluding the first 3 characters of a string using regex

Given any string in bash, e.g flaccid, I want to match all characters in the string but the first 3 (in this case I want to exclude "fla" and match only "ccid"). The regex also needs to work in sed.
I have tried positive look behind and the following regex expressions (as well as various other unsuccessful ones):
^.{3}+([a-z,A-Z]+)
sed -r 's/(?<=^....)(.[A-Z]*)/,/g'
Google hasn't been very helpful as it only produce results like "get first 3 characters .."
Thanks in advance!
If you want to get all characters but the first 3 from a string, you can use cut:
str="flaccid"
cut -c 4- <<< "$str"
or bash variable subsitution:
str="flaccid"
echo "${str:3}"
That will strip the first 3 characters out of your string.
You may just use a capturing group within an expression like ^.{3}(.*) / ^.{3}([a-zA-Z]+) and grab the ${BASH_REMATCH[1]} contents:
#!/bin/bash
text="flaccid"
rx="^.{3}(.*)"
if [[ $text =~ $rx ]]; then
echo ${BASH_REMATCH[1]};
fi
See online Bash demo
In sed, you should also be using capturing groups / backreferences to get what you need. To just keep the first 3 chars, you may use a simple:
echo "flaccid" | sed 's/.\{3\}//'
See this regex demo. The .\{3\} matches exactly any 3 chars and will remove them from the beginning only, since g modifier is not used.
Now, both the solutions above will output ccid, returning the first 3 chars only.
Using sed, just remove them
echo string | sed 's/^...//g'
How is it that no-one has named the most simple and portable solution:
shell "Parameter expansions":
str="flacid"
echo "${str#???}
For a regex (bash):
$ str="flaccid"
$ regex='^.{3}(.*)$'
$ [[ $str =~ $regex ]] && echo "${BASH_REMATCH[1]}"
ccid
Same regex in sed:
$ echo "flaccid" | sed -E "s/$regex/\1/"
ccid
Or sed (Basic Regex):
$ echo "flaccid" | sed 's/^.\{3\}\(.*\)$/\1/'
ccid

Capture group from regex in bash

I have the following string /path/to/my-jar-1.0.jar for which I am trying to write a bash regex to pull out my-jar.
Now I believe the following regex would work: ([^\/]*?)-\d but I don't know how to get bash to run it.
The following: echo '/path/to/my-jar-1.0.jar' | grep -Po '([^\/]*?)-\d' captures my-jar-1
In BASH you can do:
s='/path/to/my-jar-1.0.jar'
[[ $s =~ .*/([^/[:digit:]]+)-[[:digit:]] ]] && echo "${BASH_REMATCH[1]}"
my-jar
Here "${BASH_REMATCH[1]}" will print captured group #1 which is expression inside first (...).
You can do this as well with shell prefix and suffix removal:
$ path=/path/to/my-jar-1.0.jar
# Remove the longest prefix ending with a slash
$ base="${path##*/}"
# Remove the longest suffix starting with a dash followed by a digit
$ base="${base%%-[0-9]*}"
$ echo "$base"
my-jar
Although it's a little annoying to have to do the transform in two steps, it has the advantage of only using Posix features so it will work with any compliant shell.
Note: The order is important, because the basename cannot contain a slash, but a path component could contain a dash. So you need to remove the path components first.
grep -o doesn't recognize "capture groups" I think, just the entire match. That said, with Perl regexps (-P) you have the "lookahead" option to exclude the -\d from the match:
echo '/path/to/my-jar-1.0.jar' | grep -Po '[^/]*(?=-\d)'
Some reference material on lookahead/lookbehind:
http://www.perlmonks.org/?node_id=518444

Sed replace entire line with replacement

I posted this question on superuser a bit ago, but I haven't gotten an answer I like. Here it is, slightly modified.
I was hoping for a way to make sed replace the entire line with the replacement (rather than just the match) so I could do something like this:
sed -e "/$some_complex_regex_with_a_backref/\1/"
and have it only print the back-reference.
From this question, it seems like the way to do it is mess around with the regex to match the entire line, or use some other tool (like perl). Simply changing the regex to .*regex.* doesn't always work (as mentioned in that question). For example:
$ echo $regex
\([:alpha:]*\)day
$ echo "$phrase"
it is Saturday tomorrow
Warm outside...
$ echo "$phrase" | sed "s/$regex/\1/"
it is Satur tomorrow
Warm outside...
$ echo "$phrase" | sed "s/.*$regex.*/\1/"
Warm outside...
$ # what I'd like to have happen
$ echo "$phrase" | [[[some command or string of commands]]]
Satur
Warm outside...
I'm looking for the most concise way to replace the entire line (not just the matching part) assuming the following:
The regex is in a variable, so can't be changed on a case by case basis.
I'd like to do this without using perl or other beefier languages (sed, awk, grep, etc. are ok)
The solution can't remove lines that don't match the original regex (sorry grep -o doesn't cut it).
What I thought of was the following (but it's ugly, so I'd like to know if there is something better):
$ sentinel=XXX
$ echo "$phrase" | sed "s/$regex/$sentinel\1$sentinel/" |
> sed "s/^.*$sentinel\(.*\)$sentinel.*$/\1/"
This might work for you:
sed '/'"$regex"'/!b;s//\n\1\n/;s/.*\n\(.*\)\n.*/\1/' file

Return a regex match in a Bash script, instead of replacing it

I just want to match some text in a Bash script. I've tried using sed but I can't seem to make it just output the match instead of replacing it with something.
echo -E "TestT100String" | sed 's/[0-9]+/dontReplace/g'
Which will output TestTdontReplaceString.
Which isn't what I want, I want it to output 100.
Ideally, it would put all the matches in an array.
edit:
Text input is coming in as a string:
newName()
{
#Get input from function
newNameTXT="$1"
if [[ $newNameTXT ]]; then
#Use code that im working on now, using the $newNameTXT string.
fi
}
You could do this purely in bash using the double square bracket [[ ]] test operator, which stores results in an array called BASH_REMATCH:
[[ "TestT100String" =~ ([0-9]+) ]] && echo "${BASH_REMATCH[1]}"
echo "TestT100String" | sed 's/[^0-9]*\([0-9]\+\).*/\1/'
echo "TestT100String" | grep -o '[0-9]\+'
The method you use to put the results in an array depends somewhat on how the actual data is being retrieved. There's not enough information in your question to be able to guide you well. However, here is one method:
index=0
while read -r line
do
array[index++]=$(echo "$line" | grep -o '[0-9]\+')
done < filename
Here's another way:
array=($(grep -o '[0-9]\+' filename))
Pure Bash. Use parameter substitution (no external processes and pipes):
string="TestT100String"
echo ${string//[^[:digit:]]/}
Removes all non-digits.
I Know this is an old topic but I came her along same searches and found another great possibility apply a regex on a String/Variable using grep:
# Simple
$(echo "TestT100String" | grep -Po "[0-9]{3}")
# More complex using lookaround
$(echo "TestT100String" | grep -Po "(?i)TestT\K[0-9]{3}(?=String)")
With using lookaround capabilities search expressions can be extended for better matching. Where (?i) indicates the Pattern before the searched Pattern (lookahead),
\K indicates the actual search pattern and (?=) contains the pattern after the search (lookbehind).
https://www.regular-expressions.info/lookaround.html
The given example matches the same as the PCRE regex TestT([0-9]{3})String
Use grep. Sed is an editor. If you only want to match a regexp, grep is more than sufficient.
using awk
linux$ echo -E "TestT100String" | awk '{gsub(/[^0-9]/,"")}1'
100
I don't know why nobody ever uses expr: it's portable and easy.
newName()
{
#Get input from function
newNameTXT="$1"
if num=`expr "$newNameTXT" : '[^0-9]*\([0-9]\+\)'`; then
echo "contains $num"
fi
}
Well , the Sed with the s/"pattern1"/"pattern2"/g just replaces globally all the pattern1s to pattern 2.
Besides that, sed while by default print the entire line by default .
I suggest piping the instruction to a cut command and trying to extract the numbers u want :
If u are lookin only to use sed then use TRE:
sed -n 's/.*\(0-9\)\(0-9\)\(0-9\).*/\1,\2,\3/g'.
I dint try and execute the above command so just make sure the syntax is right.
Hope this helped.
using just the bash shell
declare -a array
i=0
while read -r line
do
case "$line" in
*TestT*String* )
while true
do
line=${line#*TestT}
array[$i]=${line%%String*}
line=${line#*String*}
i=$((i+1))
case "$line" in
*TestT*String* ) continue;;
*) break;;
esac
done
esac
done <"file"
echo ${array[#]}