bash regexp to extract part of URL - regex

From the following URL:
https://console.developers.google.com/storage/browser/test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/]
I need to extract the following part:
test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/
I'm pretty bad at regex. I came up with the following but it doesn't work:
sed -n "s/^.*browser\(test-lab.*/.*/\).*$/\1/p"
Can anyone help with what I'm doing wrong?

Could you please try with awk solution also and let me know if this helps you.
echo "https://console.developers.google.com/storage/browser/test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/" | awk '{sub(/.*browser\//,"");sub(/\/$/,"");print}'
Explanation: Simply, substituting everything till browser/ then substituting last / with NULL.
EDIT1: Adding a sed solution here too.
sed 's/\(.[^//]*\)\/\/\(.[^/]*\)\(.[^/]*\)\(.[^/]*\)\/\(.*\)/\5/' Input_file
Output will be as follows.
test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/
Explanation of sed command: Dividing the whole line into parts and using sed's ability to keep the matched regex into memory so here are the dividers I used.
(.[^//]):* Which will have the value till https: in it and if anyone wants to print it you could use \1 for it because this is very first buffer for sed.
//: Now as per URL // comes to mentioning them now.
(.[^/]):* Now comes the 2nd part for sed's buffer which will have value console.developers.google.com in it, because REGEX looks for very first occurrence of / and stops matching there itself.
(.[^/]) && (.[^/]) && /(.):* These next 3 occurrences works on same method of storing buffers like they will look for first occurrence of / and keep the value from last matched letter's next occurrence to till 1st / comes.
/\5/: Now I am substituting everything with \5 means 5th buffer which contains values as per OP's instructions.

Use a different sed delimiter and don't forget to escape the braces.
avinash:~/Desktop$ echo 'https://console.developers.google.com/storage/browser/test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/]' | sed 's~.*/browser/\([^/]*/[^/]*/\).*~\1~'
test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/
OR
Use grep with oP parameters.
avinash:~/Desktop$ echo 'https://console.developers.google.com/storage/browser/test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/]' | grep -oP '/browser/\K[^/]*/[^/]*/'
test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/

Related

Replace unknown sub-string in an URL

I have an URL in the format like https://foo.bar.whoo.dum.io, for which I like to replace the foo string with something else. Of course, the foo part is unknown and can be anything.
I tried with a simple regex like (.+?)\.(.+), but it seems that regex in Bash is always greedy (or?).
My best attempt is to split the string by . and then join it back with the first part left out, but I was wondering, whether there is a more intuitive, different solution.
Thank you
There are a lot of ways of getting the desired output.
If you're sure the url will always start with https://, we can use parameter expansion to remove everything before the first . and then add the replacement you need:
input="https://foo.bar.whoo.dum.io"
echo "https://new.${input#*.}"
Will output
https://new.bar.whoo.dum.io
Try it online!
You can use sed:
url='https://foo.bar.whoo.dum.io'
url=$(sed 's,\(.*://\)[^/.]*,\1new_value,' <<< "$url")
Here, the sed command means:
\(.*://\) - Capturing group 1: any text and then ://
[^/.]* - zero or more chars other than / and .
\1new_value - replaces the match with the Group 1 and new_value is appended to this group value.
See the online demo:
url='https://foo.bar.whoo.dum.io'
sed 's,\(.*://\)[^/.]*,\1new_value,' <<< "$url"
# => https://new_value.bar.whoo.dum.io
1st solution: Using Parameter expansion capability of bash here, adding this solution. Where newValue is variable with new value which you want to have in your url.
url='https://foo.bar.whoo.dum.io'
newValue="newValue"
echo "${url%//*}//$newValue.${url#*.}"
2nd solution: With your shown samples, please try following sed code here. Where variable url has your shown sample url value in it.
echo "$url" | sed 's/:\/\/[^.]*/:\/\/new_value/'
Explanation: Simple explanation would be, printing shell variable named url value by echo command and sending it as a standard input to sed command. Then in sed command using its capability of substitution here. Where substituting :// just before1st occurrence of . with ://new_value as per requirement.

How can I use sed to regex string and number in bash script

I want to separate string and number in a file to get a specific number in bash script, such as:
Branches executed:75.38% of 1190
I want to only get number
75.38
. I have try like the code below
$new_value=value | sed -r 's/.*_([0-9]*)\..*/\1/g'
but it was incorrect and it was failed.
How should it works? Thank you before for your help.
You can use the following regex to extract the first number in a line:
^[^0-9]*\([0-9.]*\).*$
Usage:
% echo 'Branches executed:75.38% of 1190' | sed 's/^[^0-9]*\([0-9.]*\).*$/\1/'
75.38
Give this a try:
value=$(sed "s/^Branches executed:\([0-9][.0-9]*[0-9]*\)%.*$/\1/" afile)
It is assumed that the line appears only once in afile.
The value is stored in the value variable.
There are several things here that we could improve. One is that you need to escape the parentheses in sed: \(...\)
Another one is that it would be good to have a full specification of the input strings as well as a good script that can help us to play with this.
Anyway, this is my first attempt:
Update: I added a little more bash around this regex so it'll be more easy to play with it:
value='Branches executed:75.38% of 1190'
new_value=`echo $value | sed -e 's/[^0-9]*\([0-9]*\.[0-9]*\).*/\1/g'`
echo $new_value
Update 2: as john pointed out, it will match only numbers that contain a decimal dot. We can fix it with an optional group: \(\.[0-9]\+\)?.
An explanation for the optional group:
\(...\) is a group.
\(...\)? Is a group that appears zero or one times (mind the question mark).
\.[0-9]\+ is the pattern for a dot and one or more digits.
Putting all together:
value='Branches executed:75.38% of 1190'
new_value=`echo $value | sed -e 's/[^0-9]*\([0-9]\+\(\.[0-9]\+\)\?\).*/\1/g'`
echo $new_value

Why those two sed commands get different result?

A csv file example.csv, it has
hello,world,wow
this,is,amazing
I want to get the first column elements, at the beginning I wrote a sed command like:
sed -n 's/\([^,]*\),*/\1/p' example.csv
output:
helloworld,now
thisis,amazing
Then I modified my command to the following and get what I want:
sed -n 's/\([^,]*\).*/\1/p' example.csv
output:
hello
this
command1 I used comma(,) and command2 I replaced comma with dot(.), and it works as expected, can anyone explain how sed really works to get the 1st output? What's the story behind? Is it because of the dot(.) or because of the substitution group & back-reference?
In both regexes, ([^,]*) will consume the same part of the string - all the symbols preceding the first encountered comma. Apparently the difference is how are the remaining parts of those regexes treated.
In the first one, it's ,* - zero or more comma symbols. Obviously all it might consume is
the comma itself - the rest of the line isn't covered by a pattern.
In the second one, it's .* - zero or more of any symbols. It's not a big surprise that'll cover the remaining string completely - as it has nothing to stop at; any is, well, any. )
In both cases the pattern-covered part of the string is replaced by the contents of the capturing group (and that's, as I said already, 'all the symbols before the first comma') - and what's covered by the remaining part of the regex is just removed. So in first case the very first comma is erased, in the second - the comma and the rest of the string.
The reason behind that is that the pattern matches only to the first part of the word, i.e. only the Hello, part is replaced. The part ,* takes arbitrary amount of commas, and then nothing is set to be next, i.e. nothing else matches the pattern. For example:
hello,,,,,,,,,,,,,,,,,,world
would be replaced to
helloworld
A good example would be
sed -n 's/\([^,]*\),*$/\1/p' example.csv
This will work if and only if all the commas are at the end of the line and will trim them, e.g.
hello,,,,,,
Hope this makes the problem a bit clearer.
On regex the . (dot) is a place holder for one, single character.
Can I suggest not using sed?
cut -d, -f1 example.csv
Personally, I'm a huge sed fan, but cut is much more appropriate in this instance.
If you like first word, why not use awk
awk -F, '{print $1}' file
hello
this
Using sed with back reference
sed -nr 's/([^,]*),.*/\1/p' file
hello
this
It seems that to make it work you need the .* so it get the whole line.
The r option make you not need to escape the parentheses \(

my sed is close... but not quite there, can you help please?

I want to print only the lines that meet the criteria : "worde:" and "wordo;"
I got this far:
sed -n '/\([a-z]*\)\1e:\1o;/p;'
But it doesn't quite work.
Can someone please perfect it and tell me exactly how its a fixed version/what was wrong with mine?
(Please note there are no capital letters ever, hence why I didn't bother including that within my initial character range)
Thanks heaps,
This will handle lines where "worde:wordo;" (nothing between the words) appears:
sed -n '/\([a-z]*\)e:\1o;/p;'
If you need to allow for characters BETWEEN the words, you'll need something like this:
sed -n '/\([a-z]*\)e:.*\1o;/p;'
My interpretation of your question is that you want to match lines which contain both worde: and wordo;
sed -n '/worde:/{/wordo;/p}' infile
The -n parameter prevents sed from printing the pattern space (infile), the first regex matches, then control flows into the block, if the regex isn't matched, then the line is ignored. Inside the block, the if the second regex is matched, the line is printed.
One way using alternation:
sed -n '/word\(e:\|o;\)/ p' infile
Is it a requirement to use capture groups? I went without them.
$ sed -n '/[\w]*[oe][:;]/p'
[\w]* - Match any word character. (if you really want only [a-z], swap
that back in)
[oe] - Those word characters must end in an e or
o
[:;] - And then have a : or ;
This might work for you:
sed '/^\(.*\)[eE]:\s*\1[oO];/!d' file

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input