Bash: Regex match betwen first occurrence of characters - regex

Input:
p45-322-16.jpg
Desired output:
p45
I'm trying to make a bash script with grep or awk or sed or something that could run on a bash shell.
Currently I'm stuck with this:
echo "p45-322-16.jpg" | sed 's/\(.*\)-.*/\1/'
Output:
p45-322

You can use cut command: echo "p45-322-16.jpg" | cut -d"-" -f1

You need to limit what you will accept. Right now, you are accepting too much by using ., and the greedy-by-default nature of regexes is consuming too many characters.
Try either limiting the accepted characters to digits only, or specifically excluding the dash:
\([0-9]*\).*
\([^-]*\).*

echo "p45-322-16.jpg" | sed 's/\([^-]*\).*/\1/'
The .* part of your regex is greedy, so it reads as far as possible so that the regex still matches. This means it claims everything up to the last -.
Use [^-]* to match everything until a -.

Related

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?
You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991
1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'
Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991
You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991
Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7
With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.
This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.
$ cut -d'"' -f2 file
TEXT I WANT TO KEEP
You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"
The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.
A simpler one for sed:
sed 's/^[^"]*//' myfile.txt
If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.
Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

How to extract jira ticket number with sed?

I want to extract Jira ticket number from the branch name with sed.
This is what I have
echo "PTW-123-branch-name" | sed 's/.*\([A-Z]+-[0-9]+[^-]\).*/\1/'
expected result: PTW-123
What is wrong with the regexp?
You may use this sed:
echo "PTW-123-branch-name" | sed 's/\([0-9]\)-.*$/\1/'
PTW-123
Details:
\([0-9]\)-: Matches a digit and captures it in group #1 followed by hyphen
.*$: Match remaining string until end
\1: Is replacement that puts captured digit back in output
Alternatively you can use cut also:
echo "PTW-123-branch-name" | cut -d- -f1,2
PTW-123
In case you are ok with GNU grep please try following then. Simple explanation would be passing echo command's output as a standard input to grep command. Then in grep command using -oP option to print only matched portion and enabling PCRE regex capabilities here. In match section of grep then using non-greedy match to match till digits which should be followed by -, then if a match is found it will print it.
echo "PTW-123-branch-name" | grep -oP '^.*?\d+(?=-)'

How to use sed to grab regular expression

I'd like to grab the digits in a string like so :
"sample_2341-43-11.txt" to 2341-43-11
And so I tried the following command:
echo "sample_2341-43-11.txt" | sed -n -r 's|[0-9]{4}\-[0-9]{2}\-[0-9]{2}|\1|p'
I saw this answer, which is where I got the idea.
Use sed to grab a string, but it doesn't work on my machine:
it gives an error "illegal option -r".
it doesn't like the \1, either.
I'm using sed on MacOSX yosemite.
Is this the easiest way to extract that information from the file name?
You need to set your grouping and match the rest of the line to remove it with the group. Also the - does not need to be escaped. And the -n will inhibit the output (It just returns exit level for script conditionals).
echo "sample_2341-43-11.txt" | sed -r 's/^.*([0-9]{4}-[0-9]{2}-[0-9]{2}).*$/\1/'
Enhanced regular expressions are not supported in the Mac version of sed.
You can use grep instead:
echo "sample_2341-43-11.txt" | grep -Eo "((\d+|-)+)"
OUTPUT
2341-43-11
echo "one1sample_2341-43-11.txt" \
| sed 's/[^[:digit:]-]\{1,\}/ /g;s/ \{1,\}/ /g;s/^ //;s/ $//'
1 2341-43-11
Extract all numbers(digit) completed with - (thus allow here --12 but can be easily treated)
posix compliant
all number of the line are on same line (if several) separate by a space character (could be changed to new line if wanted)
You can try this ways also
sed 's/[^_]\+_\([^.]\+\).*/\1/' <<< sample_2341-43-11.txt
OutPut:
2341-43-11
Explanation:
[^_]\+ - Match the content untile _ ( sample_)
\([^.]\+\) - Match the content until . and capture the pattern (2341-43-11)
.* - Discard remaining character (.txt)
You can go with what the poster above said. Well, making use of this
pattern "\d+-\d+-\d+" would match what you are looking for. See demo here
https://regex101.com/r/kO2cZ1/3

Why is sed matching minimum and not maximum string?

I am trying to extract the numbers (0.000500) from a bunch of files alá eta_x2-0.000500.
I would think that below line would do that, but i get only 0 and not 0.000500.
How can I get the maximum match?
find eta* | sed 's/.*\([0-9.]\+\)/\1/g'
The .* is greedy, so it is going to match as many characters as possible. In this case the .* will match eta_x2-0.00050 with just the final 0 matched in your group.
Normally the answer here would be to just use a non-greedy match using .*?, but I don't think sed supports that.
You should be able to get this to work by requiring that there is one non-digit character before you start matching, this way the .* will have to stop before consuming the digits:
sed 's/.*[^0-9.]\([0-9.]\+\)/\1/g'
Of course if you know the digits you want will be immediately after a -, you can replace [^0-9.] with - and it will work the same way.
add - in sed before the last numbers, and g is useless.
find eta* | sed 's/.*-\([0-9.]\+\)/\1/'
Don't need sed
find eta* | grep -oP '(?<=-)[\s.]+'
find eta* | cut -d'-' -f 2
This should work -
find eta* | sed 's/[^-]*-\([0-9]*\.[0-9]*\)/\1/'