I'm working on a module which passes data to an online accounting site, and one thing I need to do in order for it to parse correctly is remove a currency symbol from the price of a product.
My regex pattern is as follows:
$regex = '/^\D?([\d\.,]*)\D?$/is';
I've tested this on the https://regex101.com/ website and it works correctly, but when I do the preg_replace as follows:
$price_no_curr = preg_replace($regex,"$1",$product_price);
where $product_price is, for example £123.45, $price_no_curr just returns as £123.45 as it was originally. So, when I cast it to a float it returns nothing.
Where am I going wrong with this regex?
Simplest solution, use /u modifier to make it support UTF-8 characters.
$regex = '/^[^\d\.,]?([\d\.,]*)[^\d\.,]?$/u';
$price_no_curr = preg_replace($regex,"$1",$product_price);
£ is outside of the ASCII range and needs several bytes to be encoded in UTF-8:
$a="£";
echo implode(' ', array_map(function ($i) {
return dechex(ord($i));
}, str_split($a)));
// c2 a3
By default the regex engine works byte by byte (one byte = one character). That is why \D can't match the two bytes of £.
To make it works with multibyte strings, you need to switch on the u modifier. This way the regex engine will read the string character by character whatever the number of bytes used to encode it.
Your pattern can be written like this:
$regex = '/^\D?([\d.,]*)\D?$/u';
but you can also do it without the u modifier if you change your quantifiers:
$regex = '/^\D*([\d.,]*)\D*$/';
A more simple and flexible way consists to remove all that is a currency and eventually white-spaces without taking account of their position:
$str = preg_replace('~[\p{Sc}\s]+~u', '', $str);
\p{Sc} is a unicode character class that contains all currency symbols.
or more radically:
$str = preg_replace('~[^\d.,]+~u', '', $str);
or without regex:
$str = '£1823.45';
$allowed_chars = [0,1,2,3,4,5,6,7,8,9,'.',','];
echo implode('', array_intersect(str_split($str), $allowed_chars));
Related
I want to create extract the first two words from a sentence using a Perl function in PostgreSQL. In PostgreSQL, I can do this with:
text = "I am trying to make this work";
Select substring(text from '(^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)');
It would return "I Am"
I tried to build a Perl function in Postgresql that does the same thing.
CREATE OR REPLACE FUNCTION extract_first_two (text)
RETURNS text AS
$$
my $my_text = $_[0];
my $temp;
$pattern = '^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)';
my $regex = qr/$pattern/;
if ($my_text=~ $regex) {
$temp = $1;
}
return $temp;
$$ LANGUAGE plperl;
But I receive a syntax error near the regular expression. I am not sure what I am doing wrong.
Extracting words is none trivial even in English. Take the following contrived example using Locale::CLDR
use 'Locale::CLDR';
my $locale = Locale::CLDR->new('en');
my #words = $locale->split_words('adf543. 123.25');
#words now contains
adf543
.
123.25
Note that the full stop after adf543 is split into a separate word but the one between 123 and 25 is kept as part of the number 123.25 even though the '.' is the same character
If gets worse when you look at non English languages and much worse when you use non Latin scripts.
You need to precisely define what you think a word is otherwise the following French gets split incorrectly.
Je avais dit «Elle a dit «Il a dit «Ni» il ya trois secondes»»
The parentheses are mismatched in our regex pattern. It has three opening parentheses and four closing ones.
Also, you have two single quotes in the middle of a singly-quoted string, so
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
is parsed as two separate strings
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'')?(\s+)?\w+)'
and
'^\w+-\w+|^\w+(\s+)?(!|,|\&|'
')?(\s+)?\w+)'
But I can't suggest how to fix it as I don't understand your intention.
Did you mean a double quote perhaps? In which case (!|,|\&|")? can be written as [!,&"]?
Update
At a rough guess I think you want this
my $regex = qr{ ^ \w++ \s* [-!,&"]* \s* \w+ }x;
$temp = $1 if $my_text=~ /($regex)/;
but I can't be sure. If you describe what you're looking for in English then I can help you better. For instance, it's unclear why you don't have question marks, full stops, and semicolons in the list of intervening punctuation.
Problem Statement - I am processing some data files. in that data dump I have some strings which contain unicode values of characters.Characters may be in upper case and lower case both. Now I need to do below processing on this string.
1- if there is any - , _ ) ( } { ] [ ' " then delete them.All these characters are there in string in its Unicode form as ( $4-hexa-digits)
2- All upper case characters need to be converted to lower case ( including all different unicode characters 'Φ' -> 'φ', 'Ω' -> 'ω', 'Ž' -> 'ž')
3- Later I will use this final string for matching for different user inputs.
Problem detail description-- I have some strings like Buna$002C_Texas , Zamboanga_$0028province$0029 and many more.
Here $002C, $0028 and $0029 are unicode values and I am converting them to their character representation using below .
$str =~s/\$(....)/chr(hex($1))/eg;
OR
$str =~s/\$(....)/pack 'U4', $1/eg;
Now I am substituting all the characters as per my requirement. Then I am decoding the string into utf-8 to get lowercase of all the characters including unicode as below as lc directly do not support unicode characters.
$str =~ s/(^\-|\-$|^\_|\_$)//g;
$str =~ s/[\-\_,]/ /g;
$str =~ s/[\(\)\"\'\.]|ʻ|’|‘//g;
$str =~ s/^\s+|\s+$//g;
$str =~ s/\s+/ /g;
$str = decode('utf-8',$str);
$str = lc($str);
$str = encode('utf-8',$str);
But I am getting below error when Perl tries to decode the string.
Cannot decode string with wide characters at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 173
This error is also obvious as described here. # http://www.perlmonks.org/?node_id=569402
Now I changed my logic as per above url. I have used below to convert unicode to character representation.
$str =~s/\$(..)(..)/chr(hex($1)).chr(hex($2))/eg;
But now I do not get the character representation.I gets something non-printable character.
So how to deal with this problem when I am not aware how many different unicode representation will be there.
You want to decode the string before you do your transformations, preferably by using an PerlIO-layer like :utf8. Because you interpolate the escaped codepoints before decoding, your string may already contain multi-byte characters. Remember, Perl (seemingly) operates on codepoints, not bytes.
So what we'll do is the following: decode, unescape, normalize, remove, case fold:
use strict; use warnings;
use utf8; # This source file holds Unicode chars, should be properly encoded
use feature 'unicode_strings'; # we want Unicode semantics everywhere
use Unicode::CaseFold; # or: use feature 'fc'
use Unicode::Normalize;
# implicit decode via PerlIO-layer
open my $fh, "<:utf8", $file or die ...;
while (<$fh>) {
chomp;
# interpolate the escaped code points
s/\$(\p{AHex}{4})/chr hex $1/eg;
# normalize the representation
$_ = NFD $_; # or NFC or whatever you like
# remove unwanted characters. prefer transliterations where possible,
# as they are more efficient:
tr/.ʻ//d;
s/[\p{Quotation_Mark}\p{Open_Punctuation}\p{Close_Punctuation}]//g; # I suppose you want to remove *all* quotation marks?
tr/-_,/ /;
s/\A\s+//;
s/\s+\z//;
s/\s+/ /g;
# finally normalize case
$_ = fc $_
# store $_ somewhere.
}
You may be interested in perluniprops, a list of all available Unicode character properties, like Quotation_Mark, Punct (punctuation), Dash (dashes like - – —), Open_Punctuation (parens like ({[〈 and quotation marks like „“) etc.
Why do we perform unicode normalization? Some graphemes (visual characters) can have multiple distinct representations. E.g á can be represented as “a with acute“ or “a” + “combining acute”. The NFC tries to combine the information into one code point, whereas NFD decomposes such information into multiple code points. Note that these operations change the length of the string, as the length is measured in code points.
Before outputting data which you decomposed, it might be good to recompose it again.
Why do we use case folding with fc instead of lowercasing? Two lowercase characters may be equivalent, but wouldn't compare the same, e.g. the Greek lowercase sigma: σ and ς. Case folding normalizes this. The German ß is uppercased as the two-character sequence SS. Therefore, "ß" ne (lc uc "ß"). Case folding normalizes this, and transforms the ß to ss: fc("ß") eq fc(uc "ß"). (But whatever you do, you will still have fun with Turkish data).
I have run into another problem in relation to a site I am trying to scrape.
Basically I have stripped most of what I don't want from the page content and thanks to some help given here have managed to isolate the dates I wanted. Most of it seems to be working fine, despite some initial problems matching a non-breaking space. However, I am now having difficulty with the final regex, which is intended to split each line of data into fields. Each line represents the price of a share price index. The fields on each line are:
A name of arbitrary length made from characters from the latin alphabet and sometimes a comma or ampersand, no numerics.
A number with two digits after the decimal point (the absolute value of the index).
A number with two digits after the decimal point (the change in the value).
A number with two digits after the decimal point followed by a percent sign (the percentage change in value).
Here is an example string, before splitting:
"Fishery, Agriculture & Forestry243.45-1.91-0.78% Mining360.74-4.15-1.14% Construction465.36-1.01-0.22% Foods783.2511.281.46% Textiles & Apparels412.070.540.13% Pulp & Paper333.31-0.29-0.09% Chemicals729.406.010.83% "
The regex I am using to split this line is this:
$mystr =~ s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;
It works sometimes but not other times and I cannot work out why this should be. (The doubled equal signs in the example output below are used to make the field split more easily visible.)
Fishery, Agriculture & Forestry == 243.45 == -1.91 == -0.78%
Mining360.74-4.15-1.14%
Construction == 465.36 == -1.01 == -0.22%
Foods783.2511.281.46%
I thought the minus sign was an issue for those indices that saw a negative change in the price of the index, but sometimes it works despite the minus sign.
Q. Why is the final regex shown below failing to split the fields consistently?
Example code follows.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;
my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
# get dates:
(my #dates) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
foreach my $date (#dates) { # convert to yyyy-mm-dd
$date =~ s/\//-/ig;
}
my $tree = HTML::Tree->new();
$tree->parse($content);
my $mystr = $tree->as_text;
$mystr =~ s/\xA0//gi; # remove non-breaking spaces
# remove first chunk of text:
$mystr =~
s/^(TSE.*?)IndustryIndexChange ?/IndustryIndexChange\n$dates[0]\n\n/gi;
$mystr =~ s/IndustryIndexChange ?/IndustryIndexChange/ig;
$mystr =~ s/IndustryIndexChange/Industry Index Change\n/ig;
$mystr =~ s/% /%\n/gi; # percent symbol is market for end of line
# indicate breaks between days:
$mystr =~ s/Stock.*?IndustryIndexChange/\nDAY DELIMITER\n/gi;
$mystr =~ s/Exemption from Liability.*$//g; # remove boilerplate at bottom
# and here's the problem regex...
# try to split it:
$mystr =~
s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;
print $mystr;
It appears to be doing every other one.
My guess is that your records have a single \n between them, but your pattern starts and ends with a \n. So the final \n on the first match consumes the \n that the second match needed to find the second record. The net result is that it picks up every other record.
You might be better off wrapping your pattern in ^ and $ (instead of \n and \n), and using the m flag on the s///.
The problem is that you have \n both at the start and at the end of the regex.
Consider something like this:
$s = 'abababa';
$s =~ s/aba/axa/g;
that will set $s to axabaxa, not axaxaxa, because there are only two non-overlapping occurrences of aba.
My interpretation (pseudocode) -
one = [a-zA-Z,& ]+
two = \d{1,4}.\d\d
three = <<two>>
four = <<two>>%
regex = (<<one>>)(<<two>>)(<<three>>)(<<four>>)
= ([a-zA-Z,& ]+)(\d{1,4}.\d\d)(\d{1,4}.\d\d)(\d{1,4}.\d\d%)
However, you are already presented with 'structured' data in the form of HTML. Why not take advantage of this?
HTML parsing in perl references MOJO
for DOM based parsing in perl, and unless there are serious performance reasons,
I'd highly recommend such an approach.
I have created a regexp in Perl that is about 95 characters in length, I wish to shorten it to 78 characters but can't find a suitable method. Any advice welcome, the regexp is similar to the code below, ideally there is something similar to \ in C.
my ($foo, $bar, $etc) = $input_line =~
/^\d+: .... (\X+)\(\X(\d+.\d+|\d+)\/\X(\d+.\d+|\d+) (\X+)\)$/
There is a way to tell regex to skip embedded whitespace and comments, so not only you can split it up into multiple lines, but also comment it, format it to sections etc. I think it's 'x', but I don't have documentation handy right now, so look it up in the man page.
So you'd change it to something like:
my ($foo, $bar, $etc) = $input_line =~ /
^\d+: ....
(\X+)\(
\X(\d+.\d+|\d+) # numerator
\/\X(\d+.\d+|\d+) # denominator
\ (\X+)\)$/x # mind the escaped space!
It's also possible to construct pieces of regular expression separately via the 'qr' string prefix and combine them using variable substitution. Something like
my $num_re = qr/(\X+)\(\X(\d+.\d+|\d+)\/\X(\d+.\d+|\d+)/;
my ($foo, $bar, $etc) = $input_line =~ /^\d+: .... $num_re (\X+)\)$/;
I have not done this for long, so I am not sure you whether any flags are needed.
Perl interpolates regex, so you could do something like this
my $input_line = '123: .... X(X1.1/X5 XXX)';
my $dOrI = '(\d+.\d+|\d+)';
my ($foo, $bar, $etc) = $input_line =~
/^\d+: .... (\X+)\(\X$dOrI\/\X$dOrI (\X+)\)$/;
print "$foo, $bar, $etc";
Output -
X, 1.1, 5
One thing I see in the regex is the period in '\d+.\d+'.
You know that '.' in a regex matches ANY character, not only an actual period character.
If you want to specify only an actual period character, you'll have to use '\.' instead.
The other thing is that you may be able to replace '\d+.\d+|\d+' with '\d+.?\d+'
[EDIT]
One more thing, if you use the interpolated regex more than once and don't change it in between uses, (say, in a loop), you should use the /o option to have Perl compile the entire regex so it doesn't need to be compiled everytime.
Say I have a text file to parse, which contains some fixed length content:
123jackysee 45678887
456charliewong 32145644
<3><------16------><--8---> # Not part of the data.
The first three characters is ID, then 16 characters user name, then 8 digit phone number.
I would like to write a regular expression to match and verify the input for each line, the one I come up with:
(\d{3})([A-Za-z ]{16})(\d{8})
The user name should contains 8-16 characters. But ([A-Za-z ]{16}) would also match null value or space. I think of ([A-Za-z]{8,16} {0,8}) but it would detect more than 16 characters. Any suggestions?
No, no, no, no! :-)
Why do people insist on trying to pack so much functionality into a single RE or SQL statement?
My suggestion, do something like:
Ensure the length is 27.
Extract the three components into separate strings (0-2, 3-18, 19-26).
Check that the first matches "\d{3}".
Check that the second matches "[A-Za-z]{8,} *".
Check that the third matches "\d{8}".
If you want the entire check to fit on one line of source code, put it in a function, isValidLine(), and call it.
Even something like this would do the trick:
def isValidLine(s):
if s.len() != 27 return false
return s.match("^\d{3}[A-za-z]{8,} *\d{8}$"):
Don't be fooled into thinking that's clean Python code, it's actually PaxLang, my own proprietary pseudo-code. Hopefully, it's clear enough, the first line checks to see that the length is 27, the second that it matches the given RE.
The middle field is automatically 16 characters total due to the first line and the fact that the other two fields are fixed-length in the RE. The RE also ensures that it's eight or more alphas followed by the right number of spaces.
To do this sort of thing with a single RE would be some monstrosity like:
^\d{3}(([A-za-z]{8} {8})
|([A-za-z]{9} {7})
|([A-za-z]{10} {6})
|([A-za-z]{11} {5})
|([A-za-z]{12} )
|([A-za-z]{13} )
|([A-za-z]{14} )
|([A-za-z]{15} )
|([A-za-z]{16}))
\d{8}$
You could do it by ensuring it passes two separate REs:
^\d{3}[A-za-z]{8,} *\d{8}$
^.{27}$
but, since that last one is simply a length check, it's no different to the isValidLine() above.
I would use the regex you suggested with a small addition:
(\d{3})([A-Za-z]{3,16} {0,13})(\d{8})
which will match things that have a non-whitespace username but still allow space padding. The only addition is that you would then have to check the length of each input to verify the correct number of characters.
Hmm... Depending on the exact version of Regex you're running, consider:
(?P<id>\d{3})(?=[A-Za-z\s]{16}\d)(?P<username>[A-Za-z]{8,16})\s*(?P<phone>\d{8})
Note 100% sure this will work, and I've used the whitespace escape char instead of an actual space - I get nervous with just the space character myself, but you may want to be more restrictive.
See if it works. I'm only intermediate with RegEx myself, so I might be in error.
Check out the named groups syntax for your version of RegEx a) exists and b) matches the standard I've used above.
EDIT:
Just to expand what I'm trying to do (sorry to make your eyes bleed, Pax!) for those without a lot of RegEx experience:
(?P<id>\d{3})
This will try to match a named capture group - 'id' - that is three digits in length. Most versions of RegEx let you use named capture groups to extract the values you matched against. This lets you do validation and data capture at the same time. Different versions of RegEx have slightly different syntaxes for this - check out http://www.regular-expressions.info/named.html for more detail regarding your particular implementation.
(?=[A-Za-z\s]{16}\d)
The ?= is a lookahead operator. This looks ahead for the next sixteen characters, and will return true if they are all letters or whitespace characters AND are followed by a digit. The lookahead operator is zero length, so it doesn't actually return anything. Your RegEx string keeps going from the point the Lookahead started. Check out http://www.regular-expressions.info/lookaround.html for more detail on lookahead.
(?P<username>[A-Za-z]{8,16})\s*
If the lookahead passes, then we keep counting from the fourth character in. We want to find eight-to-sixteen characters, followed by zero or more whitespaces. The 'or more' is actually safe, as we've already made sure in the lookahead that there can't be more than sixteen characters in total before the next digit.
Finally,
(?P<phone>\d{8})
This should check the eight-digit phone number.
I'm a bit nervous that this won't exactly work - your version of RegEx may not support the named group syntax or the lookahead syntax that I'm used to.
I'm also a bit nervous that this Regex will successfully match an empty string. Different versions of Regex handle empty strings differently.
You may also want to consider anchoring this Regex between a ^ and $ to ensure you're matching against the whole line, and not just part of a bigger line.
Assuming you mean perl regex and if you allow '_' in the username:
perl -ne 'exit 1 unless /(\d{3})(\w{8,16})\s+(\d{8})/ && length == 28'
#OP,not every problem needs a regex. your problem is pretty simple to check. depending on what language you are using, they would have some sort of built in string functions. use them.
the following minimal example is done in Python.
import sys
for line in open("file"):
line=line.strip()
# check first 3 char for digit
if not line[0:3].isdigit(): sys.exit()
# check length of username.
if len(line[3:18]) <8 or len(line[3:18]) > 16: sys.exit()
# check phone number length and whether they are digits.
if len(line[19:26]) == 8 and not line[19:26].isdigit(): sys.exit()
print line
I also don't think you should try to pack all the functionality into a single regex. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my #fields = split;
if (
( my ($id, $name) = $fields[0] =~ /^([0-9]{3})([A-Za-z]{8,16})$/ )
and ( my ($phone) = $fields[1] =~ /^([0-9]{8})$/ )
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
And here is another way:
#!/usr/bin/perl
use strict;
use warnings;
while ( <DATA> ) {
chomp;
last unless /\S/;
my ($id, $name, $phone) = unpack 'A3A16A8';
if ( is_valid_id($id)
and is_valid_name($name)
and is_valid_phone($phone)
) {
print "ID=$id\nNAME=$name\nPHONE=$phone\n";
}
else {
warn "Invalid line: $_\n";
}
}
sub is_valid_id { ($_[0]) = ($_[0] =~ /^([0-9]{3})$/) }
sub is_valid_name { ($_[0]) = ($_[0] =~ /^([A-Za-z]{8,16})\s*$/) }
sub is_valid_phone { ($_[0]) = ($_[0] =~ /^([0-9]{8})$/) }
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
Generalizing:
#!/usr/bin/perl
use strict;
use warnings;
my %validators = (
id => make_validator( qr/^([0-9]{3})$/ ),
name => make_validator( qr/^([A-Za-z]{8,16})\s*$/ ),
phone => make_validator( qr/^([0-9]{8})$/ ),
);
INPUT:
while ( <DATA> ) {
chomp;
last unless /\S/;
my %fields;
#fields{qw(id name phone)} = unpack 'A3A16A8';
for my $field ( keys %fields ) {
unless ( $validators{$field}->($fields{$field}) ) {
warn "Invalid line: $_\n";
next INPUT;
}
}
print "$_ : $fields{$_}\n" for qw(id name phone);
}
sub make_validator {
my ($re) = #_;
return sub { ($_[0]) = ($_[0] =~ $re) };
}
__DATA__
123jackysee 45678887
456charliewong 32145644
678sdjkfhsdjhksadkjfhsdjjh 12345678
You can use lookahead: ^(\d{3})((?=[a-zA-Z]{8,})([a-zA-Z ]{16}))(\d{8})$
Testing:
123jackysee 45678887 Match
456charliewong 32145644 Match
789jop 12345678 No Match - username too short
999abcdefghijabcde12345678 No Match - username 'column' is less that 16 characters
999abcdefghijabcdef12345678 Match
999abcdefghijabcdefg12345678 No Match - username column more that 16 characters