Regex to find text between second and third slashes - regex

I would like to capture the text that occurs after the second slash and before the third slash in a string. Example:
/ipaddress/databasename/
I need to capture only the database name. The database name might have letters, numbers, and underscores. Thanks.

How you access it depends on your language, but you'll basically just want a capture group for whatever falls between your second and third "/". Assuming your string is always in the same form as your example, this will be:
/.*/(.*)/
If multiple slashes can exist, but a slash can never exist in the database name, you'd want:
/.*/(.*?)/

/.*?/(.*?)/
In the event that your lines always have / at the end of the line:
([^/]*)/$
Alternate split method:
split("/")[2]

The regex would be:
/[^/]*/([^/]*)/
so in Perl, the regex capture statement would be something like:
($database) = $text =~ m!/[^/]*/([^/]*)/!;
Normally the / character is used to delimit regexes but since they're used as part of the match, another character can be used. Alternatively, the / character can be escaped:
($database) = $text =~ /\/[^\/]*\/([^\/]*)\//;

You can even more shorten the pattern by going this way:
[^/]+/(\w+)
Here \w includes characters like A-Z, a-z, 0-9 and _
I would suggest you to give SPLIT function a priority, since i have experienced a good performance of them over RegEx functions wherever it is possible to use them.

you can use explode function with PHP or split with other languages to so such operation.
anyways, here is regex pattern:
/[\/]*[^\/]+[\/]([^\/]+)/

I know you specifically asked for regex, but you don't really need regex for this. You simply need to split the string by delimiters (in this case a backslash), then choose the part you need (in this case, the 3rd field - the first field is empty).
cut example:
cut -d '/' -f 3 <<< "$string"
awk example:
awk -F '/' {print $3} <<< "$string"
perl expression, using split function:
(split '/', $string)[2]
etc.

Related

Grep a filename with a specific underscore pattern

I am trying to grep a pattern from files using egrep and regex without success.
What I need is to get a file with for example a convention name of:
xx_code_lastname_firstname_city.doc
The code should have at least 3 digits, the lastname and firstname and city can vary on size
I am trying the code below but it fails to achieve what I desire:
ls -1 | grep -E "[xx_][A-Za-z]{3,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[_][A-Za-z]{2,}[.][doc|pdf]"
That is trying to get the standard xx_ from the beggining, then any code that has at least 3 words and after that it must have another underscore, and so on.
Could anybody help ?
Consider an extglob, as follows:
#!/bin/bash
shopt -s extglob # turn on extended globbing syntax
files=( xx_[[:alpha:]][[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]])_[[:alpha:]]+([[:alpha:]]).#(doc|docx|pdf) )
[[ -e ${files[0]} ]] || -L ${files[0]} ]] && printf '%s\n' "${files[#]}"
This works because
[[:alpha:]][[:alpha:]]+([[:alpha:]])
...matches any string of three or more alpha characters -- two of them explicitly, one of them with the +() one-or-more extglob syntax.
Similarly,
#(doc|docx|pdf)
...matches any of these three specific strings.
So you're trying to match a literal xx_? Begin your pattern with that portion then.
xx_
Next comes the "3 digits" you're trying to match. I'm going to assume based off your own regex that by "digits" you mean characters (hence the [a-zA-Z] character classes). Let's make the quantifier non-greedy to avoid any unintentional capturing behavior.
xx_[a-zA-Z]{3,}?
For the firstname and lastname portions, I see you've specified a variable length with at least 2 characters. Let's make sure these quantifiers are non-greedy as well by appending the ? character after our quantifiers. According to your regex, it also looks like you expect your city construct to take a similar form to the firstname and lastname bits. Let's add all three then.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.
NOTE: We didn't need to make the city quantifier non-greedy since we asserted that it's followed by a literal ".", which we don't expect to appear anywhere else in the text we're interested in matching. Notice how it's escaped because it's a metacharacter in the regex syntax.
Lastly comes the file extensions, which your example has as "docx". I also see you put a "doc" and a "pdf" extension in your regex. Let's combine all three of these.
xx_[a-zA-Z]{3,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}?_[a-zA-Z]{2,}\.(docx?|pdf)
Hopefully this works. Comment if you need any clarification. Notice how the "doc" and the "docx" portions were condensed into one element. This is not necessary, but I think it looks more deliberate in this form. It could also be written as (doc|docx|pdf). A little repetitive for my taste.

How to substitute a string even it contains regex meta characters using Shell or Perl?

I want to substitue a word which maybe contains regex meta characters to another word, for example, substitue the .Precilla123 as .Precill, I try to use below solution:
sed 's/.Precilla123/.Precill/g'
but it will change below line
"Precilla123";"aaaa aaa";"bbb bbb"
to
.Precill";"aaaa aaa";"bbb bbb"
This side effect is not I wanted. So I try to use:
perl -pe 's/\Q.Precilla123\E/.Precill/g'
The above solution can disable interpreted regex meta characters, it will not have the side effect.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
Can anybody help this? Many thanks.
Please note that the word I want to substitute is NOT hard coded, it comes from a input file, you can consider it as variable.
Unfortunately, if the word contains $ or #, this solution still cannot work, because Perl keep $ and # as variable prefix.
This is not true.
If the value that you want to replace is in a Perl variable, then quotemeta will work on the variable's contents just fine, including the characters $ and #:
echo 'pre$foo to .$foobar' | perl -pe 'my $from = q{.$foo}; s/\Q$from\E/.to/g'
Outputs:
pre$foo to .tobar
If the words that you want to replace are in an external file, then simply load that data in a BEGIN block before composing your regular expressions for replacement.
sed 's/\.Precilla123/.Precill/g'
Escape the meta character with \.
Be carrefull, mleta charactere are not the same for search pattern that are mainly regex []{}()\.*+^$ where replacement is limited to &\^$ (+ the separator that depend of your first char after the s in both pattern)

Regex expression to find file extension in a file with multiple periods

How would you write a regular expression to find the file extension of the following files, keeping in mind that what I am looking for is the ".pdf" or ".xls" portion of the string?
REPORTPDF.20130810.pdf.pgp
REPORTXLS.20130810.xls.pgp
EDIT:
The resulting filenames I want to end up with are the following:
REPORT20130810.PDF
REPORT20130810.XLS
I am on a Windows platform. I've played around with this a bit at http://regexpal.com/ but so far I can only figure out how to match the date:
([0-9]{4}[0-9]{2}[0-9]{2})
Using sed:
sed 's/^\(.*[^.]*\)\.[^.]*$/\1/' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
Using grep -P (PCRE regex):
grep -oP '^.+[^.]+(?=\.[^.]+$)' <<< "REPORTPDF.20130810.pdf.pgp"
REPORTPDF.20130810.pdf
.+\.(\w+)\.\w+$ would deliver the last but one extension as group 1, how this is accessed would then be dependent of your host language for the regex.
If you don't need the file extension to be capitalized, this should work
([a-zA-Z]+)\.([0-9]{4}[0-9]{2}[0-9]{2})\.(xls|pdf)\.pgp
Matches:
REPORTXLS.20130810.xls.pgp
And then the groups you'd use are two and three
REPORT\2.\3
Matches:
REPORT20130810.xls
Problem is that you don't provide much context for how you're going about changing these file names.
You don't say what language/library you're using, but this Perl one-liner does the trick:
perl -lpe "s/^([^.]*)(...)\.(\d+)(\.\2)\.pgp/\1\3\4/i; $_=uc"
I think this will work for you :)
^(([A-Z a-z]*)(?:XLS.|PDF.)(\d{8})(.pdf|.xls))
Edit live on Debuggex
^ starts at the beginning of the string
(.*) any character before
\d any number 0-9
{8} only 8 times for that character section (in this case 8 times of
the numbers 0-9)
?: is non capture groups
I wrapped the capture groups into one large one so the thing that you want will be in the first capture group :).
This can be prob be replaced
([A-Z a-z]*)
with
(REPORT)
This (.*?(?:\..*)?)(\..*) will hold things like:
'hello.1a.2bb.3' ---> group(1) == 'hello.1a.2bb', group(2) == '.3'
'yep.1' ---> group(1) == 'yep', group(2) == '.1'
If the format is pretty much fixed you could use
(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)
and cherry pick replacement based on what you want
Used java here but regex match would still be same
String a = "REPORTPDF.20130810.pdf.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
;
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1--$2--$3--$4--$5");
System.out.println(a);
System.out.println(b);
REPORT--PDF--20130810--pdf--pgp
REPORT--XLS--20130810--xls--pgp
in your case "$1$3.$2"
String b = "REPORTXLS.20130810.xls.pgp".replaceAll(
"(REPORT)([^.]++)[.]([^.]++)[.]([^.]++)[.](pgp)",
"$1$3.$2");
which produces intended result
REPORT20130810.XLS

Regex substituting opening parenthesis

As part of a parsing script I'm trying to convert strings like this:
<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">
into
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
The regex for the closing parenthesis works fine
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%29).)*)%29([^\"\']*[\"\'])~\1)\2~g" "$pageName".html
giving me
<a href="http://www.web.com/%20Special%20event%202013%20%282).pdf">
The problem arrises with the equivalent regex for the opening parenthesis:
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(\2~g" "$pageName".html
just returns the two groups with nothing in between:
<a href="http://www.web.com/%20Special%20event%202013%202%29.pdf">
Escaping the ( in the substitution with a backslash (or two) has no effect. If I wrap it in some other characters (say ~\1#(#\2~g ) the parenthesis still disappears (giving me %20##2%29 ).
If however in a fit of desperation I add seven parenthesises into the substitution, it works.
perl -i -pe "s~(href\=\/?[\"\']\.\.\/$i\-(?:(?!%28).)*)%28([^\"\']*[\"\'])~\1(((((((\L\2~g" "$pageName".html
outputs
<a href="http://www.web.com/%20Special%20event%202013%20(2%29.pdf">
Can somebody please make sense of this.
Perhaps the following will be helpful or at least provide some direction. It will work on Perl version 10 and above.
use strict;
use warnings;
use v5.10.0; # For regex \K
use URI::Escape;
my $string = '<a href="http://www.web.com/%20Special%20event%202013%20%282%29.pdf">';
$string =~ s/.+2013%20\K([^.]+)(?=\.pdf)/uri_unescape($1)/e;
print $string;
Output:
<a href="http://www.web.com/%20Special%20event%202013%20(2).pdf">
Left enough of the date and the space (%20) as an anchor, then used \K to *K*eep all of that. Then captured the URI encoded text, which is later decoded and used as the substitution text.
The pattern you have doesn't match the string you show at all. It matches something that looks like
<a href=/"../$i-xxxxxxxxxxxxxxx%29xxxxxxxxxx">
with literal dots, and whatever $i contains.
Also, a couple of points about your substitution:
Don't escape characters that don't need escaping. It may take some experience to know without checking which characters you need to escape, but the main point of using ~ as a delimiter is to avoid having to escape slashes in the regex, so at least you could have avoided that.
Don't use \1, \2 etc. in the replacement string. Perl tries very hard to make this work, but normally in Perl those sequences mean to insert the characters \x01 and \x02. Use $1 and $2.
So your regex could be written
s~(href=/?["']\.\./$i-(?:(?!%29).)*)%29([^"']*["'])~$1)$2~;
but it still doesn't "work fine" with the string you gave, which would have to look something like
<a href=/"../$i-xxxxxxxxxxxxxxx%282%29xxxxxxxxxx">
again, containing whatever is in $i. I don't understand at all the optional slash before the href attribute value: it is invalid HTML.
However, using a string that your first regex matches, your second one also works, replacing opening parentheses correctly, so I can't guess at what the problem may be.
There is often no need to verify the entire string. You can just replace the parts you're interested in. So I would write something like
s/(href="[^"]+)%28(\d+)%29(\.pdf")/$1($2)$3/;
which works fine on the string you gave, and replaces both open and close parentheses at once.
I had some problems understanding your regex, but this might work:
perl -pe "s~(href\s*=\s*\"[^\"]*)%28(.*?)%29~\$1(\$2)~g" input

Using regex to find any last occurrence of a word between two delimiters

Suppose I have the following test string:
Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop
where _ means any characters, eg: StartaGetbbGetcccGetddddStopeeeeeStart....
What I want to extract is any last occurrence of the Get word within Start and Stop delimiters. The result here would be the three bolded Get below.
Start__Get__Get__Get__Stop__Start__Get__Get__Stop__Start__Get__Stop
I precise that I'd like to do this only using regex and as far as possible in a single pass.
Any suggestions are welcome
Thanks'
Get(?=(?:(?!Get|Start|Stop).)*Stop)
I'm assuming your Start and Stop delimiters will always be properly balanced and they can't be nested.
I would have done it with two passes. The first pass find the word "Get", and the second pass count the number of occurrences of it.
$ echo "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get__Stop" | awk -vRS="Stop" -F"_*" '{print $(NF-1)}'
Get
Get
Get
Something like this, maybe:
(?<=Start(?:.Get)*)Get(?=.Stop)
That requires variable-length lookbehind support, which not all regex engines support.
It could be made to have a max length, which a few more (but still not all) support, by changing the first * to {0,99} or similar.
Also, in the lookahead, possibly the . should be a .+ or .{1,2} depending on if the double underscore is a typo or not.
With Perl, i'd do :
my $test = "Start_Get_Get_Get_Stop_Start_Get_Get_Stop_Start_Get_Stop";
$test =~ s#(?<=Start_)((Get_)*)(Get)(?=_Stop)#$1<FOUND>$3</FOUND>#g;
print $test;
output:
Start_Get_Get_<FOUND>Get</FOUND>_Stop_Start_Get_<FOUND>Get</FOUND>_Stop_Start_<FOUND>Get</FOUND>_Stop
You should adapt to your regex flavour.