Regex to extract everything until it encounters a number after a slash - regex

I am looking to extract everything form a string but ignore everything after encountering numbers after a slash(alphanumeric allowed)
Examples:
http://www.test.com/products/cards/product_code100/12345/something_else
http://www.test.com/products/123abc/45678/
Desired output -
http://www.test.com/products/cards/product_code100/
http://www.test.com/products/123abc/
The following regex gives me everything in backreferences but it'll be great if I could get rid of numbers after a slash-
^(.*:)//([a-z\-.]+)(:[0-9]+)?(.*)
Additional Information - Languauge independent regex needed.
Many Thanks

this should work with most languages and should produce the desired output
(http://.*)(?=/\d+(?!\w+))
It takes every character until it finds (lookahead) \ followed by a number.
If you'd try to match
http://www.test.com/products/123abc/
or
http://www.test.com/products/123abc
it just would not find a match and you could be sure the string checked doesnt encounter a pure number after a slash

Example in Perl:
echo "http://...." | perl -pe 's/(.*\/)\d+\/.*/$1/'
or:
echo "http://...." | perl -ne 'print "$1\n" if /(.*\/)\d+\/.*/'
Edit: It's true what #creinig noted in his comment - there is no such thing as generic regex. Nonetheless, Perl is widely used, so it's an option.

Related

Regex does not match in Perl, while it does in other programs

I have the following string:
load Add 20 percent
to accommodate
I want to get to:
load Add 20 percent to accommodate
With, e.g., regex in sublime, this is easily done by:
Regex:
([a-z])\n\s([a-z])
Replace:
$1 $2
However, in Perl, if I input this command, (adapted to test if I can match the pattern in any case):
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
It doesn't match anything.
Does anyone know why Perl would be different in this case, and what the correct formulation of the Perl command should be?
By default, Perl -p flag read input lines one by one. You can't thus expect your regex to match anything after \n.
Instead, you want to read the whole input at once. You can do this by using the flag -0777 (this is documented in perlrun):
perl -0777 -pi.orig -e 's/([a-z])\n\s(to)/$1 $2/' file
Just trying to help and reminding below your initial proposal for perl regex:
perl -pi.orig -e 's/[a-z]\n.+to/TEST/g' file
Note that in perl regex, [a-z] will match only one character, NOT including any whitespace. Then as a start please include a repetition specifier and include capability to also 'eat' whitespaces. Also to keep the recognized (but 'eaten') 'to' in the replacement, you must put it again in the replacement string, like finally in the below example perl program:
$str = "load Add 20 percent
to accommodate";
print "before:\n$str\n";
$str =~ s/([ a-z]+)\n\s*to/\1 to/;
print "after:\n$str\n";
This program produces the below input:
before:
load Add 20 percent
to accommodate
after:
load Add 20 percent to accommodate
Then it looks like that if I understood well what you want to do, your regexp should better look like:
s/([ a-z]+)\n\s*to/\1 to/ (please note the leading whitespace before 'a-z').

Perl Regex Command Line Issue

I'm trying to use a negative lookahead in perl in command line:
echo 1.41.1 | perl -pe "s/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g"
to get an incremented version that looks like this:
1.41.2
but its just returning me:
![0-9]+\.[0-9]+\.: event not found
i've tried it in regex101 (PCRE) and it works fine, so im not sure why it doesn't work here
In Bash, ! is the "history expansion character", except when escaped with a backslash or single-quotes. (Double-quotes do not disable this; that is, history expansion is supported inside double-quotes. See Difference between single and double quotes in Bash)
So, just change your double-quotes to single-quotes:
echo 1.41.1 | perl -pe 's/(?![0-9]+\.[0-9]+\.)[0-9]$/2/g'
and voilĂ :
1.41.2
I'm guessing that this expression also might work:
([0-9.]+)\.([0-9]+)
Test
perl -e'
my $name = "1.41.1";
$name =~ s/([0-9.]+)\.([0-9]+)/$1\.2/;
print "$name\n";
'
Output
1.41.2
Please see the demo here.
If you want to "increment" a number then you can't hard-code the new value but need to capture what is there and increment that
echo "1.41.1" | perl -pe's/[0-9]+\.[0-9]+\.\K([0-9]+)/$1+1/e'
Here /e modifier makes it so that the replacement side is evaluated as code, and we can +1 the captured number, what is then substituted. The \K drops previous matches so we don't need to put them back; see "Lookaround Assertions" in Extended Patterns in perlre.
The lookarounds are sometimes just the thing you want, but they increase the regex complexity (just by being there), can be tricky to get right, and hurt efficiency. They aren't needed here.
The strange output you get is because the double quotes used around the Perl program "invite" the shell to look at what's inside whereby it interprets the ! as history expansion and runs that, as explained in ruakh's post.
As an alternate to lookahead, we can use capture groups, e.g. the following will capture the version number into 3 capture groups.
(\d+)\.(\d+)\.(\d+)
If you wanted to output the captured version number as is, it would be:
\1.\2.\3
And to just replace the 3rd part with the number "2" would be:
\1.\2.2
To adapt this to the OP's question, it would be:
$ echo 1.14.1 | perl -pe 's/(\d+)\.(\d+)\.(\d+)/\1.\2.2/'
1.14.2
$

How can I use sed to regex string and number in bash script

I want to separate string and number in a file to get a specific number in bash script, such as:
Branches executed:75.38% of 1190
I want to only get number
75.38
. I have try like the code below
$new_value=value | sed -r 's/.*_([0-9]*)\..*/\1/g'
but it was incorrect and it was failed.
How should it works? Thank you before for your help.
You can use the following regex to extract the first number in a line:
^[^0-9]*\([0-9.]*\).*$
Usage:
% echo 'Branches executed:75.38% of 1190' | sed 's/^[^0-9]*\([0-9.]*\).*$/\1/'
75.38
Give this a try:
value=$(sed "s/^Branches executed:\([0-9][.0-9]*[0-9]*\)%.*$/\1/" afile)
It is assumed that the line appears only once in afile.
The value is stored in the value variable.
There are several things here that we could improve. One is that you need to escape the parentheses in sed: \(...\)
Another one is that it would be good to have a full specification of the input strings as well as a good script that can help us to play with this.
Anyway, this is my first attempt:
Update: I added a little more bash around this regex so it'll be more easy to play with it:
value='Branches executed:75.38% of 1190'
new_value=`echo $value | sed -e 's/[^0-9]*\([0-9]*\.[0-9]*\).*/\1/g'`
echo $new_value
Update 2: as john pointed out, it will match only numbers that contain a decimal dot. We can fix it with an optional group: \(\.[0-9]\+\)?.
An explanation for the optional group:
\(...\) is a group.
\(...\)? Is a group that appears zero or one times (mind the question mark).
\.[0-9]\+ is the pattern for a dot and one or more digits.
Putting all together:
value='Branches executed:75.38% of 1190'
new_value=`echo $value | sed -e 's/[^0-9]*\([0-9]\+\(\.[0-9]\+\)\?\).*/\1/g'`
echo $new_value

Matching A File Name Using Grep

The overarching problem:
So I have a file name that comes in the form of
JohnSmith14_120325_A10_6.raw
and I want to match it using regex. I have a couple of issues in building a working example but unfortunately my issues won't be solved unless I get the basics.
So I have just recently learned about piping and one of the cool things I learned was that I can do the following.
X=ll_paprika.sc (don't ask)
VAR=`echo $X | cut -p -f 1`
echo $VAR
which gives me paprika.sc
Now when I try to execute the pipe idea in grep, nothing happens.
x=ll_paprika.sc
VAR=`echo $X | grep *.sc`
echo $VAR
Can anyone explain what I am doing wrong?
Second question:
How does one match a single underscore using regex?
Here's what I am ultimately trying to do;
VAR=`echo $X | grep -e "^[a-bA-Z][a-bA-Z0-9]*(_){1}[0-9]*(_){1}[a-bA-Z0-9]*(_){1}[0-9](\.){1}(raw)"
So the basic idea of my pattern here is that the file name must start with a letter
and then it can have any number of letters and numbers following it and it must have an _ delimit a series of numbers and another _ to delimit the next set of numbers and characters and another _ to delimit the next set of numbers and then it must have a single period following by raw. This looks grossly wrong and ugly (because I am not sure about the syntax). So how does one match a file extension? Can someone put up a simple example for something ll_parpika.sc so that I can figure out how to do my own regex?
Thanks.
x=ll_paprika.sc
VAR=`echo $X | grep *.sc`
echo $VAR
The reason this isn't doing what you want is that the grep matches a line and returns it. *.sc does in fact match 11_paprika.sc, so it returns that whole line and sticks it in $VAR.
If you want to just get a part of it, the cut line probably better. There is a grep -o option that returns only the matching portion, but for this you'd basically have to put in the thing you were looking for, at which point why bother?
the file name must start with a letter
`grep -e "^[a-zA-Z]
and then it can have any number
of letters and numbers following it
[a-zA-Z0-9]*
and it must have an _ delimit a
series of numbers and another _ to delimit the next set of numbers and
characters and another _ to delimit the next set of numbers
(_[0-9]+){3}
and then it must have a single period following by raw.
.raw"
For the first, use:
VAR=`echo $X | egrep '\.sc$'`
For the second, you can try this alternative instead:
VAR=`echo $X | egrep '^[[:alpha:]][[:alnum:]]*_[[:digit:]]+_[[:alnum:]]+_[[:digit:]]+\.raw'`
Note that your character classes from your expression differ from the description that follows in that they seem to only be permissive of a-b for lower case characters in some places. This example is permissive of all alphanumeric characters in those places.

Is there a truly universal wildcard in Grep? [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 3 years ago.
Really basic question here. So I'm told that a dot . matches any character EXCEPT a line break. I'm looking for something that matches any character, including line breaks.
All I want to do is to capture all the text in a website page between two specific strings, stripping the header and the footer. Something like HEADER TEXT(.+)FOOTER TEXT and then extract what's in the parentheses, but I can't find a way to include all text AND line breaks between header and footer, does this make sense? Thanks in advance!
When I need to match several characters, including line breaks, I do:
[\s\S]*?
Note I'm using a non-greedy pattern
You could do it with Perl:
$ perl -ne 'print if /HEADER TEXT/ .. /FOOTER TEXT/' file.html
To print only the text between the delimiters, use
$ perl -000 -lne 'print $1 while /HEADER TEXT(.+?)FOOTER TEXT/sg' file.html
The /s switch makes the regular expression matcher treat the entire string as a single line, which means dot matches newlines, and /g means match as many times as possible.
The examples above assume you're cranking on HTML files on the local disk. If you need to fetch them first, use get from LWP::Simple:
$ perl -MLWP::Simple -le '$_ = get "http://stackoverflow.com";
print $1 while m!<head>(.+?)</head>!sg'
Please note that parsing HTML with regular expressions as above does not work in the general case! If you're working on a quick-and-dirty scanner, fine, but for an application that needs to be more robust, use a real parser.
By definition, grep looks for lines which match; it reads a line, sees whether it matches, and prints the line.
One possible way to do what you want is with sed:
sed -n '/HEADER TEXT/,/FOOTER TEXT/p' "$#"
This prints from the first line that matches 'HEADER TEXT' to the first line that matches 'FOOTER TEXT', and then iterates; the '-n' stops the default 'print each line' operation. This won't work well if the header and footer text appear on the same line.
To do what you want, I'd probably use perl (but you could use Python if you prefer). I'd consider slurping the whole file, and then use a suitably qualified regex to find the matching portions of the file. However, the Perl one-liner given by '#gbacon' is an almost exact transliteration into Perl of the 'sed' script above and is neater than slurping.
The man page of grep says:
grep, egrep, fgrep, rgrep - print lines matching a pattern
grep is not made for matching more than a single line. You should try to solve this task with perl or awk.
As this is tagged with 'bbedit' and BBedit supports Perl-Style Pattern Modifiers you can allow the dot to match linebreaks with the switch (?s)
(?s).
will match ANY character. And yes,
(?s).+
will match the whole text.
As pointed elsewhere, grep will work for single line stuff.
For multiple-lines (in ruby with Regexp::MULTILINE, or in python, awk, sed, whatever), "\s" should also capture line breaks, so
HEADER TEXT(.*\s*)FOOTER TEXT
might work ...
here's one way to do it with gawk, if you have it
awk -vRS="FOOTER" '/HEADER/{gsub(/.*HEADER/,"");print}' file