Regex Non Printable Character Plus Currency Symbols - regex

I want to match non printable character plus currency symbols the following matches non printable, how to add expectations for currency symbols?
$str = preg_replace('/[[:^print:]]/', '', $str);

The \p{Sc} pattern matches currency symbols, you just need to place it into the negated character class (or bracket expression in POSIX terminology).
Use
$re = '/(*UTF)[^[:print:]\p{Sc}]+/';
echo preg_replace($re, '', '£aA€');
See the regex demo and the PHP demo.
Details:
(*UTF) - a PCRE verb that makes PCRE engine treat the string as a Unicode string, not a byte string (note we cannot use /u modifier since it enables both the (*UTF) and (*UCP) verbs, the latter making all subpatterns Unicode aware and [^[:print:]] starts matching a lot more characters then)
[^[:print:]\p{Sc}]+ - matches any 1 or more symbols (due to the + quantifier) other than:
[:print:] - printable chars
\p{Sc} - currency symbols

Related

preg_replace last character

I have a code which works need to make red only the last two letters
$text = '£5,485.00';
$text = preg_replace('/(\b[a-z])/i','<span style="color:red;">\1</span>',$text);
echo $text;
need like this enter image description here
taken your question verbatim:
preg_replace('/\w{2}$/','<span style="color:red;">\0</span>', $text);
^^^^^^ ^^
\w{2} : two word characters \0 : main matching group
$ : anchored at the end
You may want to support Unicode (/u - u modifier) and prevent the $ to match end-of-string and new-line at end-of-string (/D - D modifier):
u (PCRE_UTF8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid.
D (PCRE_DOLLAR_ENDONLY)
If this modifier is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl.

Regular Expression Remove all English and Arabic numbers?

I want Regular Expression to remove Arabic and english numbers
my varibale is
$variable="12121212ABDHSتشؤآئ۳۳۴۳۴729384234owiswoisw";
i want remove all digits ! LIKE:
ABDHSتشؤآئowiswoisw
I found the following expression but not work !
$newvariable = preg_replace('/^[\u0621-\u064A]+$', '', $variable);
thanks for you helps
You may use
$newvariable = preg_replace('/\d+/u', '', $variable);
See the regex demo
The \d matches ASCII digits by default, but when you add the u modifier, it enables the PCRE_UCP option (together with PCRE_UTF8) that enables \d to match all Unicode digits.
See PCRE documentation:
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
\w, and some of the POSIX character classes. By default, only ASCII
characters are recognized, but if PCRE_UCP is set, Unicode properties
are used instead to classify characters.
You may fix your regex if you need to only restrict matching to ASCII and those of your choice:
preg_replace('/[0-9\u0621-\u064A]+/u', '', $variable)

Regex command is replacing two characters instead of one

I am attempting to replace the spaces in my string with an under-bar. With my limited coding experience, I have come up with this -
s/\b[ ]\D/_/g
This command works in finding all of the appropriate selections of my file however, it replaces the space and the proceeding character rather than only the space. How can I insure it only replaces the whitespaces and no additional characters?
Also, I would not like this to affect number characters (hence the \D).
The regex \b[ ]\D (which could also be written as \b \D, by the way) matches the space and the following non-digit character, so that's what's replaced with an underscore.
There are two (well, there are more, but these two are the straightforward ones) ways go go about fixing this in Perl:
With a capture group and back reference:
s/\b (\D)/_\1/g
Here the regex will still match the space and the non-digit character, but the non-digit character will be remembered as \1 and used as part of the replacement.
With a lookahead zero-length assertion:
s/\b (?=\D)/_/g
(?=\D) matches the empty string if (and only if) it is followed by something matching \D, so the non-digit character is no longer part of the match and is not replaced.
Addendum: By the way, I suspect you meant to use \b\D instead of just \D. \D matches spaces (because they are not digits), therefore
$ echo 'foo 123 bar baz qux' | perl -pe 's/\b (?=\D)/_/g'
foo 123_bar_ baz_qux
as opposed to
$ echo 'foo 123 bar baz qux' | perl -pe 's/\b (?=\b\D)/_/g'
foo 123_bar baz_qux
Try
s/\s/_/g
The \s is the character that will match all whitespace.
If you are worried about abutting spaces use \s+
the + means 1 or more whitespace characters.

What do these qr{} regular expressions mean?

What do these mean?
qr{^\Q$1\E[a-zA-Z0-9_\-]*\Q$2\E$}i
qr{^[a-zA-Z0-9_\-]*\Q$1\E$}i
If $pattern is a Perl regular expression, what is $identity in the code below?
$identity =~ $pattern;
When the RHS of =~ isn't m//, s/// or tr///, a match operator (m//) is implied.
$identity =~ $pattern;
is the same as
$identity =~ /$pattern/;
It matches the pattern or pre-compiled regex $pattern (qr//) against the value of $identity.
The binding operator =~ applies a regex to a string variable. This is documented in perldoc perlop
The \Q ... \E escape sequence is a way to quote meta characters (also documented in perlop). It allows for variable interpolation, though, which is why you can use it here with $1 and $2. However, using those variables inside a regex is somewhat iffy, because they themselves are defined during the use of a capture inside a regex.
The character class bracket [ ... ] defines a range of characters which it will match. The quantifier that follows it * means that particular bracket must match zero or more times. The dashes denote ranges, such as a-z meaning "from a through z". The escaped dash \- means a literal dash.
The ^ and $ (the dollar sign at the end) denotes anchors, beginning and end of string respectively. The modifier i at the end means the match is case insensitive.
In your example, $identity is a variable that presumably contains a string (or whatever it contains will be converted to a string).
The perlre documentation is your friend here. Search it for unfamiliar regex constructs.
A detailed explanation is below, but it is so hairy that I wonder whether using a module such as Text::Balanced would be a superior approach.
The first pattern matches possibly empty delimited strings, and the delimiters are in $1 and $2, which we do not know until runtime. Say $1 is ( and $2 is ), then the first pattern matches strings of the form
()
(a)
(9)
(abcABC_012-)
and so on …
The second pattern matches terminated strings, where the terminator is in $1—also not known until runtime. Assuming the terminator is ], then the second pattern matches strings of the form
]
a]
Aa9a_9]
Using \Q...\E around a pattern removes any special regex meaning from the characters inside, as documented in perlop:
For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and #). For example, the following matches:
'\s\t' =~ /\Q\s\t/
Because $ or # trigger interpolation, you'll need to use something like /\Quser\E\#\Qhost/ to match them literally.
The patterns in your question do want to trigger interpolation but do not want any regex metacharacters to have special meaning, as with parentheses and square brackets above that are meant to match literally.
Other parts:
Circumscribed brackets delimit a character class. For example, [a-zA-Z0-9_\-] matches any single character that is upper- or lowercase A through Z (but with no accents or other extras), zero through nine, underscore, or hyphen. Note that the hyphen is escaped at the end to emphasize that it matches a literal hyphen rather and does not specify part of a range.
The * quantifier means match zero or more of the preceding subpattern. In the examples from your question, the star repeats character classes.
The patterns are bracketed with ^ and $, which means an entire string must match rather than some substring to succeed.
The i at the end, after the closing curly brace, is a regex switch that makes the pattern case-insensitive. As TLP helpfully points out in the comment below, this makes the delimiters or terminators match without regard to case if they contain letters.
The expression $identity =~ $pattern tests whether the compiled regex stored in $pattern (created with $pattern = qr{...}) matches the text in $identity. As written above, it is likely being evaluated for its side effect of storing capture groups in $1, $2, etc. This is a red flag. Never use $1 and friends unconditionally but instead write
if ($identity =~ $pattern) {
print $1, "\n"; # for example
}

Pattern matches an hyphen too

I have a piece of Perl code (pattern matching) like this,
$var = "<AT>this is an at command</AT>";
if ($var =~ /<AT>([\s\w]*)<\/AT>/i)
{
print "Matched in AT command\n";
print "$var\n\n";
}
It works fine, if the content inbetween tags are without an Hyphen. It is not working if a hyphen is inserted between the string present inbetween tags like this... <AT>this is an at-command</AT>.
Can any one fix this regex to match even if hyphen is also inserted ??
help me pls
Senthil
On character class
Your pattern contains this subpattern:
[\s\w]*
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
\s is the shorthand for whitespace character class; \w for word character class. Neither contains the hyphen.
The * is the zero-or-more repetition specifier.
Now you should understand why this pattern does not match a hyphen: it matches zero-or-more of characters that is either a whitespace or a word character. If you want to match a hyphen, then you can include it into the character class.
[\s\w-]*
If you also want to include the period, question mark, and exclamation mark, for example, then you can simply add them in as well:
[\s\w.!?-]*
Special note on hyphen
BE CAUTIOUS when including the hyphen in a character class. It is used as a regex metacharacter in character class definition to define character range. For example,
[a-z]
matches one of any character the range between 'a' and 'z', inclusive. By contrast,
[az-]
matches one of exactly 3 characters, 'a', 'z', and '-'. When you put - as the last element in a character class, it becomes a literal hyphen instead of range definition. You can also put it as the first element, or escape it (by preceding with backslash, which is the way you escape all other regex metacharacters too).
That is, the following 3 character class are identical:
[az-] [-az] [a\-z]
Related questions
Regex: why doesn't [01-12] range work as expected?
You can just add a hyphen in the char class as:
if ($var =~ /<AT>([\s\w-]*)<\/AT>/i)
Also since your regex has a / in it you can use a different delimiter, this way you can avoid escaping /:
if ($var =~m{<AT>([\s\w-]*)</AT>}i)
Use \S instead of \w.
if ($var =~ /<AT>([\s\S]*)<\/AT>/i) {
If you want to have everything between and you can use
if ($var =~ /<AT>((?:(?!<AT>).)*)<\/AT>/i)
And it's ungreedy.
You need to add more characters to your class like [\s\w-]* (as codaddict told you).
Moreover, you should maybe use a lookahead to match the end of your command ("I want to match that only if it is followed by the ending statement") like :
if ($var =~ /<AT>([^<]*)(?=<\/AT>)/i)
[^<] stands for "any character (including hyphen) except "<".
You could even add a lookbehind :
if ($var =~ (?<=/<AT>)([^<]*)(?=<\/AT>)/i)
For more complexe things (since you seem to want a little parser), you should look at the theory of grammar and at lex/yacc.