Extract number from specific string pattern in string - regex

I am iterating through file names with bash and am in need of a way to pull out a specific number from the string notated by a preceding character. Essentially, all of the files have a part of their name that looks like D01 or D02. An example filename is Build-asdasdasd.D01V02.dat. I am trying to use sed, but to no avail thus far. Thanks!

Pure Bash:
name='Build-asdasdasd.D01V02.dat'
[[ "$name" =~ \.(D[[:digit:]]{2}[[:upper:]][[:digit:]]{2})\. ]] \
&& number="${BASH_REMATCH[1]}" || number=''
echo "'$number'"
The echo shows
'D01V02'

You don't have to do everything in a single expression. You can build a pipeline, like so:
echo 'Build-asdasdasd.D01V02.dat' |
egrep -o '\.D([[:digit:]]{2}[^.]+)' |
sed 's/^.//'
This returns D01V02 for me, but you may want to test your expression against a wider corpus to see if there are any edge cases.

Here is another pure bash answer that assumes your filenames are always similar to the example. Regex is not required.
name='Build-asdasdasd.D01V02.dat'
number="${name%.*}"
number="${name##*.}"
echo "$number"

Your question is quite unclear. If you want the digits after the D, you can use
f="Build-asdasdasd.D01V02.dat"
num=$(grep -Po '(?<=D)\d\d' <<< "$f")
or
num=$(sed 's/^.*D\([[:digit:]][[:digit:]]\).*/\1/' <<< "$f")

Related

replace string with underscore and dots using sed or awk

I have a bunch of files with filenames composed of underscore and dots, here is one example:
META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
I want to remove the part that contains .bed.nodup.sortedbed.roadmap.sort.fgwas.gz. so the expected filename output would be META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
I am using these sed commands but neither one works:
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo $stringZ | sed -e 's/\([[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.\)//g'
echo $stringZ | sed -e 's/\[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.[[:lower:]]\.//g'
Any solution is sed or awk would help a lot
Don't use external utilities and regexes for such a simple task! Use parameter expansions instead.
stringZ=META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params
echo "${stringZ/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz}"
To perform the renaming of all the files containing .bed.nodup.sortedbed.roadmap.sort.fgwas.gz, use this:
shopt -s nullglob
substring=.bed.nodup.sortedbed.roadmap.sort.fgwas.gz
for file in *"$substring"*; do
echo mv -- "$file" "${file/"$substring"}"
done
Note. I left echo in front of mv so that nothing is going to be renamed; the commands will only be displayed on your terminal. Remove echo if you're satisfied with what you see.
Your regex doesn't really feel too much more general than the fixed pattern would be, but if you want to make it work, you need to allow for more than one lower case character between each dot. Right now you're looking for exactly one, but you can fix it with \+ after each [[:lower:]] like
printf '%s' "$stringZ" | sed -e 's/\([[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.[[:lower:]]\+\.\)//g'
which with
stringZ="META_ALL_whrAdjBMI_GLOBAL_August2016.bed.nodup.sortedbed.roadmap.sort.fgwas.gz.r0-ADRL.GLND.FET-EnhA.out.params"
give me the output
META_ALL_whrAdjBMI_GLOBAL_August2016.r0-ADRL.GLND.FET-EnhA.out.params
Try this:
#!/bin/bash
for line in $(ls -1 META*);
do
f2=$(echo $line | sed 's/.bed.nodup.sortedbed.roadmap.sort.fgwas.gz//')
mv $line $f2
done

How to get a part of a string with a regular expression in a /bin/sh script

I need to extract the part of a string in a shell script. The original string is pretty complicated, so I really need a regular expression to select the right part of the original string - justing removing a prefix and suffix won't work. Also, the regular expression needs to check the context of the string I want to extract, so I e.g. need a regular expression a\([^b]*\)b to extract 123 from 12a123b23.
The shell script needs to be portable, so I cannot make use of the Bash constructs [[ and BASH_REMATCH.
I want the script to be robust, so when the regular expression does not match, the script should notice this e.g. through a non-zero exit code of the command to be used.
What is a good way to do this?
I've tried various tools, but none of them fully solved the problem:
expr match "$original" ".*$regex.*" works except for the error case. With this command, I don't know how to detect if the regex did not match. Also, expr seems to take the extracted string to determine its exit code - so when I happened to extract 00, expr had an exit code of 1. So I would need to generally ignore the exit code with expr match "$original" ".*$regex.*" || true
echo "$original" | sed "s/.*$regex.*/\\1/" also works except for the error case. To handle this case, I'd need to test if I got back the original string, which is also quite unelegant.
So, isn't there a better way to do this?
You could use the -n option of sed to suppress output of all input lines and add the p option to the substitute command, like this:
echo "$original" | sed -n -e "s/.*$regex.*/\1/p"
If the regular expression matches, the matched group is printed as before. But now if the regular expression does not match, nothing is printed and you will need to test only for the empty string.
How about grep -o the only possible problem is portability, otherwise it satisfies all requirements:
➜ echo "hello and other things" | grep -o hello
hello
➜ echo $?
0
➜ echo "hello and other things" | grep -o nothello
➜ echo $?
1
One of the best things is that since it's grep you can pick what regex's you want whether BRE, ERE or Perl.
if egrep is available (pretty much all time)
egrep 'YourPattern' YourFile
or
egrep "${YourPattern}" YourFile
if only grep is available
grep -e 'YourPattern' YourFile
you check with a classical [ $? -eq 0 ] for the status of the command (also take into account bad YourFile access)
for the content itself, extract with sed or awk (for portability issue) (after the failure test)
Content="$( sed -n -e "s/.*\(${YourPattern}\).*/\1/p;q" )"

Extract number of variable length from string

I want to extract a number of variable length from a string.
The string looks like this:
used_memory:1775220696
I would like to have the 1775220696 part in a variable. There are a lot of questions about this, but I could not find a solution that suits my needs.
You can use cut:
my_val=$(echo "used_memory:1775220696" | cut -d':' -f2)
Or also awk:
my_val=$(echo "used_memory:1775220696" | awk -F':' '{print $2}')
Use parameter expansion:
string=used_memory:1775220696
num=${string#*:} # Delete everything up to the first colon.
I used to use egrep
echo used_memory:1775220696 | egrep -o [0-9]+
Output:
1775220696
use the regex:
s/^[^:]*://g
you use it with sed or perl and get the part you needed.
> echo "used_memory:1775220696" | perl -pe 's/^[^:]*://g'
1775220696
bash supports regular-expression matching, but for a simple case like this it is overkill; use parameter expansion (see choroba's answer).
For the sake of completeness, here's an example using regular expression matching:
[[ $string =~ (.*):([[:digit:]]+) ]] && num=${BASH_REMATCH[2]}
Can be done using awk, like this:
var=`echo "used_memory:1775220696" | awk -F':' '{print $2;}'`
echo $var
output:
1775220696
If your number could be anywhere in the string, but you know that the digits are contiguous, you can use shell parameter expansion to remove everything that is not a digit:
$ str=used_memory:1775220696
$ num=${str//[!0-9]}
$ echo "$num"
1775220696
This works also for used_memory:1775220696andmoretext and 123numberfirst. However, something like abc123def456 would become 123456.

Bash regex for strong password

How can I use the following regex in a BASH script?
(?=^.{8,255}$)((?=.*\d)(?!.*\s)(?=.*[A-Z])(?=.*[a-z]))^.*
I need to check the user input(password) for the following:
at least one Capital Letter.
at least one number.
at least one small letter.
and the password should be between 8 and 255 characters long.
If your version of grep has the -P option it supports PCRE (Perl-Compatible Regular Expressions.
grep -P '(?=^.{8,255}$)(?=^[^\s]*$)(?=.*\d)(?=.*[A-Z])(?=.*[a-z])'
I had to change your expression to reject spaces since it always failed. The extra set of parentheses didn't seem necessary. I left off the ^.* at the end since that always matches and you're really only needing the boolean result like this:
while ! echo "$password" | grep -P ...
do
read -r -s -p "Please enter a password: " password
done
I'm don't think that your regular expression is the best (or correct?) way to check the things on your list (hint: I'd check the length independently of the other conditions), but to answer the question about using it in Bash: use the return value of grep -Eq, e.g.:
if echo "$candidate_password" | grep -Eq "$strong_pw_regex"; then
echo strong
else
echo weak
fi
Alternatively in Bash 3 and later you can use the =~ operator:
if [[ "$candidate_password" =~ "$strong_pw_regex" ]]; then
…
fi
The regexp syntax of grep -E or Bash does not necessarily support all the things you are using in your example, but it is possible to check your requirements with either. But if you want fancier regular expressions, you'll probably need to substitute something like Ruby or Perl for Bash.
As for modifying your regular expression, check the length with Bash (${#candidate_password} gives you the length of the string in the variable candidate_password) and then use a simple syntax with no lookahead. You could even check all three conditions with separate regular expressions for simplicity.
These matches are connected with the logical AND operator, which means the only good match is when all of them match.
Therefore the simplest way is to match those conditions chained, with the previous result piped into the next expression. Then if any of the matches fail, the entire expression fails:
$echo "tEsTstr1ng" | egrep "^.{8,255}"| egrep "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]"| egrep "[abcdefghijklmnopqrstuvwxyz"] | egrep "[0-9]"
I manually entered all characters instead of "[A-Z]" and "[a-z]" because different system locales might substitute them as [aAbBcC..., which is two conditions in one match and we need to check for both conditions.
As shell script:
#!/bin/sh
a="tEsTstr1ng"
b=`echo $a | egrep "^.{8,255}" | \
egrep "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]" | \
egrep "[abcdefghijklmnopqrstuvwxyz"] | \
egrep "[0-9]"`
# now featuring W in the alphabet string
#if the result string is empty, one of the conditions has failed
if [ -z $b ]
then
echo "Conditions do not match"
else
echo "Conditions match"
fi
grep with -E option uses the Extended regular expression(ERE)From this documentation ERE does not support look ahead.
So you can use Perl for this as:
perl -ne 'exit 1 if(/(?=^.{8,255}$)((?=.*\\d)(?=.*[A-Z])(?=.*[a-z])|(?=.*\\d)(?=.*[^A-Za-z0-9])(?=.*[a-z])|(?=.*[^A-Za-z0-9])(?=.*[A-Z])(?=.*[a-z])|(?=.*\\d)(?=.*[A-Z])(?=.*[^A-Za-z0-9]))^.*/);exit 0;'
Ideone Link
I get that you are looking for regex, but have you consider doing it through PAM module?
dictionary
quality
There might be other interesting modules.

Return a regex match in a Bash script, instead of replacing it

I just want to match some text in a Bash script. I've tried using sed but I can't seem to make it just output the match instead of replacing it with something.
echo -E "TestT100String" | sed 's/[0-9]+/dontReplace/g'
Which will output TestTdontReplaceString.
Which isn't what I want, I want it to output 100.
Ideally, it would put all the matches in an array.
edit:
Text input is coming in as a string:
newName()
{
#Get input from function
newNameTXT="$1"
if [[ $newNameTXT ]]; then
#Use code that im working on now, using the $newNameTXT string.
fi
}
You could do this purely in bash using the double square bracket [[ ]] test operator, which stores results in an array called BASH_REMATCH:
[[ "TestT100String" =~ ([0-9]+) ]] && echo "${BASH_REMATCH[1]}"
echo "TestT100String" | sed 's/[^0-9]*\([0-9]\+\).*/\1/'
echo "TestT100String" | grep -o '[0-9]\+'
The method you use to put the results in an array depends somewhat on how the actual data is being retrieved. There's not enough information in your question to be able to guide you well. However, here is one method:
index=0
while read -r line
do
array[index++]=$(echo "$line" | grep -o '[0-9]\+')
done < filename
Here's another way:
array=($(grep -o '[0-9]\+' filename))
Pure Bash. Use parameter substitution (no external processes and pipes):
string="TestT100String"
echo ${string//[^[:digit:]]/}
Removes all non-digits.
I Know this is an old topic but I came her along same searches and found another great possibility apply a regex on a String/Variable using grep:
# Simple
$(echo "TestT100String" | grep -Po "[0-9]{3}")
# More complex using lookaround
$(echo "TestT100String" | grep -Po "(?i)TestT\K[0-9]{3}(?=String)")
With using lookaround capabilities search expressions can be extended for better matching. Where (?i) indicates the Pattern before the searched Pattern (lookahead),
\K indicates the actual search pattern and (?=) contains the pattern after the search (lookbehind).
https://www.regular-expressions.info/lookaround.html
The given example matches the same as the PCRE regex TestT([0-9]{3})String
Use grep. Sed is an editor. If you only want to match a regexp, grep is more than sufficient.
using awk
linux$ echo -E "TestT100String" | awk '{gsub(/[^0-9]/,"")}1'
100
I don't know why nobody ever uses expr: it's portable and easy.
newName()
{
#Get input from function
newNameTXT="$1"
if num=`expr "$newNameTXT" : '[^0-9]*\([0-9]\+\)'`; then
echo "contains $num"
fi
}
Well , the Sed with the s/"pattern1"/"pattern2"/g just replaces globally all the pattern1s to pattern 2.
Besides that, sed while by default print the entire line by default .
I suggest piping the instruction to a cut command and trying to extract the numbers u want :
If u are lookin only to use sed then use TRE:
sed -n 's/.*\(0-9\)\(0-9\)\(0-9\).*/\1,\2,\3/g'.
I dint try and execute the above command so just make sure the syntax is right.
Hope this helped.
using just the bash shell
declare -a array
i=0
while read -r line
do
case "$line" in
*TestT*String* )
while true
do
line=${line#*TestT}
array[$i]=${line%%String*}
line=${line#*String*}
i=$((i+1))
case "$line" in
*TestT*String* ) continue;;
*) break;;
esac
done
esac
done <"file"
echo ${array[#]}