I am new in Perl and I have a situation where I need to extract a number between two different strings.
I have this string variable:
my $var = "1234 23.3\"
How can I extract the number between the white-space and the dot? In the example the output should be 23.
The above var string may vary, so sometimes it may be 123 4.32 or 123 334.4\ in which the output should be 4 or 334 respectively.
White space can be matched using \s backslash sequence, see perlrecharclass:
\s matches any single character considered whitespace
Likewise, a digit can be matched using \d:
\d matches a single character considered to be a decimal digit.
To match a period or dot, beware that the dot is a regex meta character, that will match any character (except newline), see perlre and perlretut, so to match a dot explicitly you should escape it.
Hence, given $var = "1234 23.3", the following statement:
$var =~ /\s+(\d+)\./;
should extract the number after the space and before the dot into capture group variable $1. See perlre for more information on the + metacharacter and also for information about capture groups.
Related
I am trying to write a regex that can find only numbers from given string. What I mean is:
Input: My number is +12 345 678. I have galaxy s3, its symbol 34abc.
Output: 345 and 678 (but not +12, 3 from word s3 or 34 from 34abc)
I tried just numbers (\d+) and I combinations with white and words characters. The closest was^\d$ but that doesn't work as my numbers are part of the bigger string, not whole string themselves. Can you give me a hint?
------- EDIT
Looks like I just don't know how to check a character without actually getting it into result. Like "digit that follow space character (without this space)".
In general case, you can make use of lookbehind and lookahead:
(?<=^|\s)\d+(?=$|\s)
The part which makes it into the captured output is \d+.
Lookbehind and lookahead are not included in the match.
I just included spaces as delimiters in the regex, but you may replace \s with any character class, as defined by your requirements. For example, to allow dots as separators (both in front and after the digits), use the following regex:
(?<=^|[\s.])\d+(?=$|[\s.])
The (?<=^|\s) should be read as follows:
(?<= ... ) defines the lookbehind group.
The expression which must precede the \d+ is ^|\s, meaning "either start of the line (^) or whitespace".
Similarly, (?=$|\s) defines the lookahead group (it must follow the captured digits), which is either end of the line ($) or whitespace.
A note on \b mentioned in other answers: it is a nice feature, means "word boundary", but the "word characters" are not customizable. This means that, for example, the "+" character is considered to be a separator and you can't change this if you use \b. With lookaround, you can customize the separators to your needs.
What you seem to want is a sequence of digits (\d+) that is preceded by a whitespace (\s) or the start of the string (^), and followed by a whitespace or punctuation character ([\s.,:;!?]) or the end of the string ($), but the preceding/following whitespace or punctuation character should not be included in the match, so you need positive lookahead ((?=xxx)) and lookbehind ((?<=xxx)).
(?<=^|\s)\d+(?=[\s.,:;!?]|$)
See regex101 for demo.
Remember to double the backslashes in a Java literal.
Safer RegEx
Try this:
(?<=\s|^)\d+(?=\s|\b)
Live Demo on Regex101
How it works:
(?<=\s|^) # Start of String OR Whitespace (will not select +)
# Positive Lookbehind ensures the data is not included in the match
\d+ # Digit(s)
(?=\s|\b) # Whitespace OR Word Boundary
# Positive Lookahead ensures the data is not included in the match
Lookarounds do not take up any characters in the match, so they can be used so Capture Groups do not need to be. For example:
# Regex /.*barbaz/
barbaz # Matched Data Result: barbaz
foobarbaz # Matched Data Result: foobarbaz
# Regex (with Positive Lookahead) /.*bar(?=baz)/
barbaz # Matched Data Result: bar
foobarbaz # Matched Data Result: foobar
As you can see with the second RegEx, baz is never included in the matched data result, however it was required in the string for the RegEx to match. The RegEx above works on the same principle
Not as Safe (Old) RegEx
You can try this RegEx:
\b\d+\b
\b is a Word Boundary. This will, however, select 12 from +12.
You can change the RegEx to this to stop 12 from being selected:
(?<!\+)\b\d+\b
This uses a Negative Lookbehind and will fail if there is a + before the digits.
Live Demo on Regex101
I am looking for a pattern that can find apostrophes that are inside single quotes. For example the text
Foo 'can't' bar 'don't'
I want to find and replace the apostrophe in can't and don't, but I don't want to find the single quotes
I have tried something like
(.*)'(.*)'(.*)'
and apply the replace on the second matching group. But for text that has 2 words with apostrophes this pattern won't work.
Edit: to clarify the text could have single quotes with no apostrophes inside them, which should be preserved as is. For example
'foo' 'can't' bar 'don't'
I am still looking for only apostrophes, so the single quotes around foo should not match
I believe you need to require "word" characters to appear before and after a ' symbol, and it can be done with a word boundary:
\b'\b
See the regex demo
To only match the quote inside letters use
(?<=\p{L})'(?=\p{L})
(?<=[[:alpha:]])'(?=[[:alpha:]])
(?U)(?<=\p{Alpha})'(?=\p{Alpha}) # Java, double the backslashes in the string literal
Or ASCII only
(?<=[a-zA-Z])'(?=[a-zA-Z])
You can use the following regular expression:
'[^']+'\s|'[^']+(')[^' ]+'
it will return 3 matches, and if capture group 1 participated in the word, it will be the apostrophe in the word:
'foo'
'can't'
'don't'
demo
How it works:
'[^']+'\s
' match an apostrophe
[^']+ followed by at least one character that isn't an apostrophe
' followed by an apostrophe
\s followed by a space
| or
'[^']+(')[^' ]+'
' match an apostrophe
[^']+ followed by at least one character that isn't an apostrophe
(') followed by an apostrophe, and capture it in capture group 1
[^' ]+ followed by at least one character that is not an apostrophe or a space
' followed by an apostrophe
I am using the regex
(.*)\d.txt
on the expression
MyFile23.txt
Now the online tester says that using the above regex the mentioned string would be allowed (selected). My understanding is that it should not be allowed because there are two numeric digits 2 and 3 while the above regex expression has only one numeric digit in it i.e \d.It should have been \d+. My current expression reads. Zero of more of any character followed by one numeric digit followed by .txt. My question is why is the above string passing the regex expression ?
This regex (.*)\d.txt will still match MyFile23.txt because of .* which will match 0 or more of any character (including a digit).
So for the given input: MyFile23.txt here is the breakup:
.* # matches MyFile2
\d # matched 3
. # matches a dot (though it can match anything here due to unescaped dot)
txt # will match literal txt
To make sure it only matches MyFile2.txt you can use:
^\D*\d\.txt$
Where ^ and $ are anchors to match start and end. \D* will match 0 or more non-digit.
The pattern you have has one group (.*) which would match using your example:MyFile2
because the . allows any character.
Furthermore the . in the pattern after this group is not escaped which will result in allowing another character of any kind.
To avoid this use:
(\D*)\d+\.txt
the group (\D*) would now match all non digit characters.
Here is the explanation, your "MyFile23.txt" matches the regex pattern:
A literal period . should always be escaped as \. else it will match "any character".
And finally, (.*) matches all the string from the beginning to the last digit (MyFile2). Have a look at the "MATCH INFORMATION" area on the right at this page.
So, I'd suggest the following fix:
^\D*\d\.txt$ = beginning of a line/string, non-digit character, any number of repetitions, a digit, a literal period, a literal txt, and the end of the string/line (depending on the m switch, which depends on the input string, whether you have a list of words on separate lines, or just a separate file name).
Here is a working example.
What does the following do in Perl?
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
I undestand that [^a-zA-Z0-9]+ is a start of sentence and at least one of a-zA-Z0-9 and \s+ is at least one whitespace.
But I can not figure out what this snippet does as a whole.
First, it replaces any sequence of non-alphanumeric characters (being neither upper case chars, lower case chars nor numbers) in the string with a single space.
After that it replaces all multi-spaces, i.e. any sequence of whitespaces with just one space character.
the first pattern replace all that is not alphanumeric by a space.
The second replace any number of white characters (space, tab, newlines) by a single space
Note that you can replace these two patterns by an only pattern:
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
is more commonly written as
$string =~ s/[^a-zA-Z0-9]+/ /sg;
$string =~ s/\s+/ /sg;
The choice of delimiter isn't significant, but / is used by convention unless the pattern contains many some /.
Here we have two instances of the substitution operator. Between the first two delimiters is a regular expression pattern to search for. Between the last two delimiters is the string with which to replace the matching text. The trailing s and g are flags.
The s flag affects what . matches. Given that . isn't used, the s flag is useless.
The g flag causes the all matches to be replaced instead of just the first one.
The first regex pattern, [^a-zA-Z0-9]
[...] is a character class that matches a single character among those specified. A leading ^ negates the class, so [^a-zA-Z0-9] matches any character other than unaccented latin letters and numbers.
atom+ matches atom one or more times, so [^a-zA-Z0-9]+ matches a sequence of non-alphanumeric characters (and some alphanumeric characters such as "é").
Therefore, s/[^a-zA-Z0-9]+/ /g replaces all sequences of non-alphanumeric characters (and some alphanumeric characters such as "é") with a single space. For example, "abc - déf :)" becomes "abc d f ".
The second regex pattern, \s+
\s matches any whitespace character (except the vertical tab and the non-breaking space sometimes).
Therefore, s/\s+/ /g replaces all sequences of white space with a single space. For example, "abc\tdef ghi\n" becomes "abc def ghi ".
As a whole
When used together, the second statement does absolutely nothing. There will never be any sequences of two or more whitespace characters left in $string after the first statement.
So
$string =~ s#[^a-zA-Z0-9]+# #sg;
$string =~ s#\s+# #sg;
is the same as
$string =~ s/[^a-zA-Z0-9]+/ /g;
I'd like to understand what this line of JavaScript means...
(/^\w+, ?\w+, ?\w\.?$/)
i understand 'w stands for 'word', but need your help in understanding '/', '^', '+', '?', '.?$/'
Thank you..
That's a regular expression, not HTML.
It's inside of a regex literal (/.../) in Javascript.
^ matches the beginning of the string
\w matches any word character
+ matches one or more of the previous set.
? matches zero or one of the previous set (in this case a single space)
\. matches a .. (An unescaped . matches any single character)
$ matches the end of the string.
Let's break it down, because then it is easier to read:
^ beginning of the line
\w+ 1 or more 'word' characters
, a comma
? an optional space
\w+ 1 or more 'word' characters
, a comma
? an optional space
\w a single 'word' character
\.? an optional period
$ end of line
The meaning of a 'word' character is an alpha-numeric character or an underscore.
It is not HTML code but Regular Expression. Read more about it:
Regular expression
In computing, regular expressions,
also referred to as regex or regexp,
provide a concise and flexible means
for matching strings of text, such as
particular characters, words, or
patterns of characters. A regular
expression is written in a formal
language that can be interpreted by a
regular expression processor, a
program that either serves as a parser
generator or examines text and
identifies parts that match the
provided specification.
/^\w+, ?\w+, ?\w\.?$/
Outside in...
/ / delimiters
^ $ Matches the whole string (^ means to match the beginning, $ means to match the end)
One by one...
\w means word character (simply w doesn't match anything but the ASCII character w)
\w+ word characters (at least one, matches as much as possible)
? means the spaces are optional, matches 0 or 1 space character
. matches any character that is not a line break (can be configured with regex modifiers)
\. (like in the example) matches exactly one dot
It's a regular expression that looks for a string of word characters (like letters, digits, or underscores) that has two commas in it with an optional single space after each comma.