How do I write a regular expression to get that returns only the letters and numbers without the asterisks in between ?
You could use a regex replacement here:
my $var = 'RMRIV43069411**2115.82';
$var =~ s/^.*?\D(\d+(?:\.\d+)*)$/$1/g;
print "$var"; // 2115.82
The idea is to capture the final number in the string, and then replace with only that captured quantity.
Here is an explanation of the pattern:
^ from the start of the input
.*? consume all content up until
\D the first non digit character, which is followed by
(\d+(?:\.\d+)*) match AND capture: a number, with optional decimal component,
occurring before
$ the end of the input
Then, we place with just this captured number, which is available in $1.
Related
I use the following:
if ($content =~ /([0-9]{11})/) {
my $digits = $1;
}
to extract 11 consecutive digits from a string. However, it grabs the first 11 consecutive digits. How can I get it to extract the last 11 consecutive digits so that I would get 24555199361 from a string with hdjf95724555199361?
/([0-9]{11})/
means
/^.*?([0-9]{11})/s # Minimal lead that allows a match.
You get what you want by making the .* greedy.
/^.*([0-9]{11})/s # Maximal lead that allows a match.
If the digits appear at the very end of the string, you can also use the following:
/([0-9]{11})\z/
Whenever you want to match something at the end of a string, use the end of line anchor $.
$content =~ m/(\d{11})$/;
If that pattern is not the very end, but you want to match the "last" occurence of that pattern, you would first match "the entire string" with /.*/ and then backtrack to the final occurence of the pattern. The /s flag permits the . metacharacter to match a line feed.
$content =~ m/.*(\d{11})/s;
See the Perl regexp tutorial for more information.
I'm trying to regex match patterns with the following criteria:
I want to match a string that only has one single occurrence in the entire string. I then want to capture the portion before the single colon.
Examples of valid strings:
JohnP: random text here
BobF::student: random text here (this is valid because there's only ONE occurrence of a single colon. the other is a double colon)
Paris: random text here::student (valid for the same reason as above)
Examples of invalid strings:
JohnP: student: random text here
BobF::student: random text here: more
I have no idea how to do a regex match like this. In the case of the valid strings, the group i want to return is:
JohnP
BobF::student
Paris
I would appreciate the help! I have tried $string =~ ^[^:]+:\s* but that only matches up to the first colon.
You can use this regex:
^((?:::|[^:])*+):(?!.*(?<!:):(?!:))
It looks for some number of pairs of colons or non-colon characters followed by a colon, using a possessive quantifier (*+) to prevent matching part-way through a double-colon in a string such as Bill:: xyz. Those characters are captured in group 1. A negative lookahead assertion is then used to check that there are no more single colons in the string.
Demo on regex101
Perhaps regular expression can be in form: match until : not preceded with : and not followed with :.
Note: code written in shorted form
use strict;
use warnings;
use feature 'say';
my $re = qr/^(.+)(?<!:):(?!:)/;
/$re/ && say $1 for <DATA>;
__DATA__
JohnP: random text here
BobF::student: random text here (this is valid because there's only ONE occurrence of a single colon. the other is a double colon)
Paris: random text here::student (valid for the same reason as above)
Output
JohnP
BobF::student
Paris
In a string like /p20 (can be any number) i want to replace the /p into /pag- but keep the number /pag-20
This is what i tried:
preg_replace('/\/p+[0-9]/', '/pag-', $string);
but the result is /pag-0
Use a capturing group and a backreference:
$string = "/p-20";
echo preg_replace('~/p-([0-9])~', '/pag-$1', $string);
^^^^^^^ ^^
Here, /p- matches a literal substring and ([0-9]) matches and captures any 1 digit into Group 1 that can be referred to with $1 backreference from the replacement pattern.
Alternatively, you may use a lookahead based solution:
preg_replace('~/p-(?=[0-9])~', '/pag-', $string);
See the PHP demo
Here, no backreference is necessary as the (?=[0-9]) positive lookahead does not consume text it matches, i.e. it does not add the text to the match value.
Here's the string I'm searching.
T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG
I want to capture the digits behind the number for X digits (X being the previous number) I also want to capture the complete string.
ie the capture should return:
+4ACCG
+12AAGTACTACCGT
etc.
and :
ACCG
AAGTACTACCGT
etc.
Here's the regex I'm using:
(\+(\d+)([ATGCatgcnN]){\2});
and I'm using $1 and $3 for the captures.
What am I missing ?
You can not use a backreference in a quantifier. \1 is a instruction to match what $1 contains, so {\1} is not a valid quantifier. But why do you need to match the exact number? Just match the letters (because the next part starts again with a +).
So try:
(\+\d+([ATGCatgcnN]+));
and find the complete match in $1 and the letters in $2
Another problem in your regex is that your quantifier is outside your third capturing group. That way only the last letter would be in the capturing group. Place the quantifier inside the group to capture the whole sequence.
You can also remove the upper or lower case letters from your class by using the i modifier to match case independent:
/(\+\d+([ATGCN]+))/gi
This loop works because the \G assertion tells the regex engine to begin the search after the last match , (digit(s)), in the string.
$_ = 'T+4ACCGT+12CAAGTACTACCGT+12CAAGTACTACCGT+4ACCGA+6CTACCGT+12CAAGTACTACCGT+12CAAGTACTACCG';
while (/(\d+)/g) {
my $dig = $1;
/\G([TAGCN]{$dig})/i;
say $1;
}
The results are
ACCG
CAAGTACTACCG
CAAGTACTACCG
ACCG
CTACCG
CAAGTACTACCG
CAAGTACTACCG
I think this is correct but not sure :-|
Update: Added the \G assertion which tells the regex to begin immediately after the last matched number.
my #sequences = split(/\+/, $string);
for my $seq (#sequences) {
my($bases) = $seq =~ /([^\d]+)/;
}
I have a string 1/temperatoA,2/CelcieusB!23/33/44,55/66/77 and I would like to extract the words temperatoA and CelcieusB.
I have this regular expression (\d+/(\w+),?)*! but I only get the match 1/temperatoA,2/CelcieusB!
Why?
Your whole match evaluates to '1/temperatoA,2/CelcieusB' because that matches the following expression:
qr{ ( # begin group
\d+ # at least one digit
/ # followed by a slash
(\w+) # followed by at least one word characters
,? # maybe a comma
)* # ANY number of repetitions of this pattern.
}x;
'1/temperatoA,' fulfills capture #1 first, but since you are asking the engine to capture as many of those as it can it goes back and finds that the pattern is repeated in '2/CelcieusB' (the comma not being necessary). So the whole match is what you said it is, but what you probably weren't expecting is that '2/CelcieusB' replaces '1/temperatoA,' as $1, so $1 reads '2/CelcieusB'.
Anytime you want to capture anything that fits a certain pattern in a certain string it is always best to use the global flag and assign the captures into an array. Since an array is not a single scalar like $1, it can hold all the values that were captured for capture #1.
When I do this:
my $str = '1/temperatoA,2/CelcieusB!23/33/44,55/66/77';
my $regex = qr{(\d+/(\w+))};
if ( my #matches = $str =~ /$regex/g ) {
print Dumper( \#matches );
}
I get this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB',
'23/33',
'33',
'55/66',
'66'
];
Now, I figure that's probably not what you expected. But '3' and '6' are word characters, and so--coming after a slash--they comply with the expression.
So, if this is an issue, you can change your regex to the equivalent: qr{(\d+/(\p{Alpha}\w*))}, specifying that the first character must be an alpha followed by any number of word characters. Then the dump looks like this:
$VAR1 = [
'1/temperatoA',
'temperatoA',
'2/CelcieusB',
'CelcieusB'
];
And if you only want 'temperatoA' or 'CelcieusB', then you're capturing more than you need to and you'll want your regex to be qr{\d+/(\p{Alpha}\w*)}.
However, the secret to capturing more than one chunk in a capture expression is to assign the match to an array, you can then sort through the array to see if it contains the data you want.
The question here is: why are you using a regular expression that’s so obviously wrong? How did you get it?
The expression you want is simply as follows:
(\w+)
With a Perl-compatible regex engine you can search for
(?<=\d/)\w+(?=.*!)
(?<=\d/) asserts that there is a digit and a slash before the start of the match
\w+ matches the identifier. This allows for letters, digits and underscore. If you only want to allow letters, use [A-Za-z]+ instead.
(?=.*!) asserts that there is a ! ahead in the string - i. e. the regex will fail once we have passed the !.
Depending on the language you're using, you might need to escape some of the characters in the regex.
E. g., for use in C (with the PCRE library), you need to escape the backslashes:
myregexp = pcre_compile("(?<=\\d/)\\w+(?=.*!)", 0, &error, &erroroffset, NULL);
Will this work?
/([[:alpha:]]\w+)\b(?=.*!)
I made the following assumptions...
A word begins with an alphabetic character.
A word always immediately follows a slash. No intervening spaces, no words in the middle.
Words after the exclamation point are ignored.
You have some sort of loop to capture more than one word. I'm not familiar enough with the C library to give an example.
[[:alpha:]] matches any alphabetic character.
The \b matches a word boundary.
And the (?=.*!) came from Tim Pietzcker's post.