To format thousands of SQL queries I need to convert all the characters not in quotation mark pair to upper case.
For example:
select * from region where regionkey = 'America'
to be converted to
SELECT * FROM REGION WHERE REGIONKEY = 'America'
With perl I'm able to convert those quoted characters to upper case by:
perl -p -e 's/('.+?')/\U\1/g'
and get:
select * from region where regionkey = 'AMERICA'
The question is how to "reverse" the capture result, say, to march to not in quotation marks?
s/([^']*)('[^']*'|\z)/\U$1\E$2/g
so
perl -pe's/([^'\'']*)('\''[^'\'']*'\''|\z)/\U$1\E$2/g'
ysth suggests a mixed-quote approach:
perl -pe"s/([^']*)('[^']*'|\z)/"'\U$1\E$2/g'
If the quotes can have backslash escapes in them, change
'[^']*'
to
'(?:[^'\\]+|\\.)*'
Split the string on quoted substrings and uppercase every other chunk. Like this
my $str = "select * from region where regionkey = 'America'";
my $uc;
$str = join '', map { ($uc = !$uc) ? uc : $_ } split /('[^']*')/, $str;
print $str;
output
SELECT * FROM REGION WHERE REGIONKEY = 'America'
Related
I'm trying to use preg_replace to search for a string but only replace a portion of the string, rather than the entire string, in a dynamic fashion.
For example, I am able to find the strings 'od', ':od', 'od:', '#od', and 'od ' with my code below. I want to replace only the 'od' portion with the word 'odometer' and leave the colon, hashtag, and white spaces untouched. However, the way that my current preg_replace is written would replace the colons and the hashtag in addition to the letters themselves. Any creative solutions to replace the characters only but preserve the surrounding symbols?
Thank you!
if(isset($_POST["text"]))
{
$original = $_POST["text"];
$abbreviation= array();
$abbreviation[0] = 'od';
$abbreviation[1] = 'rn';
$abbreviation[2] = 'ph';
$abbreviation[3] = 'real';
$translated= array();
$translated[0] ='odometer';
$translated[1] ='run';
$translated[2] ='pinhole';
$translated[3] ='fake';
function add_regex_finders($str){
return "/[\s:\#]" . $str . "[\s:]/i";
}
$original_parsed = array_map('add_regex_finders',$original);
preg_replace($original_parsed,$translated,$original);
}
You can add capture groups around the characters before and after the matched abbreviation, and then add the group references to the replacement string:
function add_regex_finders($str){
return "/([\s:\#])" . $str . "([\s:])/i";
}
$abbrevs_parsed = array_map('add_regex_finders', $abbreviation);
$translt_parsed = array_map(function ($v) { return '$1' . $v . '$2'; }, $translated);
echo preg_replace($abbrevs_parsed, $translt_parsed, $original);
Demo on 3v4l.org
Note you had a typo in your code, passing $original to the call to add_regex_finders when it should be $abbreviation.
I have multiple complex logic and one of them is comparison of 2 strings.
$db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
$file_string_param1 = 'AK1B25';
I need to test whether $file_string_param1 contains any of the content of $db_string_param1, delimited by space without making $db_string_param1 an array.
I am thinking maybe this is possible using regex, and for now I am not that good using complex regex.
Suppose your data hasn't contain the special character you could use simple | in a substitution s/\s/|/g.
In below RegEx substitution I have handle the special characters.
Group the space and non word character. Then e flag used for evaluate the right-hand side as an expression. Then check if $1 is defined or not if it is defined use | to separate. if it is not defined it means $2 contain the special character so you could escape the character.
$file_string_param1 = 'AK1B25';
$db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
$db_string_param1=~s/(\s)|(\W)/ defined($1) ? "|" : "\\$2" /ge;
$file_string_param1=~m/$db_string_param1/;
print $&,"\n";
This solution doesn't suffer from regex injection bug. It will work no matter what characters are separated by whitespace.
my $pat = join '|', map quotemeta, split ' ', $db_string_param1;
if ( $file_string_param1 =~ /$pat/ ) {
...
}
You don't explain why you don't want your database parameter converted into an array so it is hard to understand your intention, but this program demonstrates how to convert the string into a regex pattern and test the file parameter against it
use strict;
use warnings 'all';
use feature 'say';
my $db_string_param1 = 'AA1 AB AC1 AK1 BKK2';
my $file_string_param1 = 'AK1B25';
( my $db_re = $db_string_param1 ) =~ s/\A\s+|\s+\z//g;
$db_re =~ s/\s+/|/g;
$db_re = qr/$db_re/;
say $file_string_param1 =~ /(?<![A-Z])(?:$db_re)(?![0-9])/ ? 'found' : 'not found';
output
found
I'm working on a module which passes data to an online accounting site, and one thing I need to do in order for it to parse correctly is remove a currency symbol from the price of a product.
My regex pattern is as follows:
$regex = '/^\D?([\d\.,]*)\D?$/is';
I've tested this on the https://regex101.com/ website and it works correctly, but when I do the preg_replace as follows:
$price_no_curr = preg_replace($regex,"$1",$product_price);
where $product_price is, for example £123.45, $price_no_curr just returns as £123.45 as it was originally. So, when I cast it to a float it returns nothing.
Where am I going wrong with this regex?
Simplest solution, use /u modifier to make it support UTF-8 characters.
$regex = '/^[^\d\.,]?([\d\.,]*)[^\d\.,]?$/u';
$price_no_curr = preg_replace($regex,"$1",$product_price);
£ is outside of the ASCII range and needs several bytes to be encoded in UTF-8:
$a="£";
echo implode(' ', array_map(function ($i) {
return dechex(ord($i));
}, str_split($a)));
// c2 a3
By default the regex engine works byte by byte (one byte = one character). That is why \D can't match the two bytes of £.
To make it works with multibyte strings, you need to switch on the u modifier. This way the regex engine will read the string character by character whatever the number of bytes used to encode it.
Your pattern can be written like this:
$regex = '/^\D?([\d.,]*)\D?$/u';
but you can also do it without the u modifier if you change your quantifiers:
$regex = '/^\D*([\d.,]*)\D*$/';
A more simple and flexible way consists to remove all that is a currency and eventually white-spaces without taking account of their position:
$str = preg_replace('~[\p{Sc}\s]+~u', '', $str);
\p{Sc} is a unicode character class that contains all currency symbols.
or more radically:
$str = preg_replace('~[^\d.,]+~u', '', $str);
or without regex:
$str = '£1823.45';
$allowed_chars = [0,1,2,3,4,5,6,7,8,9,'.',','];
echo implode('', array_intersect(str_split($str), $allowed_chars));
I am trying to use either Perl or MATLAB to parse a few numbers out of a single line of text. My text line is:
t10_t20_t30_t40_
now in matlab, i used the following script
str = 't10_t20_t30_t40_';
a = regexp(str,'t(\d+)_t(\d+)','match')
and it returns
a =
't10_t20' 't30_t40'
What I want is for it to also return 't20_t30', since this obviously is a match. Why doesn't regexp scan it?
I thus turned to Perl, and wrote the following in Perl:
#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ /(t\d+_t\d+)/g)
{
print "$1\n";
}
and the result is the same as matlab
t10_t20
t30_t40
but I really wanted "t20_t30" also be in the results.
Can anyone tell me how to accomplish that? Thanks!
[update with a solution]:
With help from colleagues, I identified a solution using the so-called "look-around assertion" afforded by Perl.
#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ m/(?=(t\d+_t\d+))/g)
{print "$1\n";}
The key is to use "zero width look-ahead assertion" in Perl. When Perl (and other similar packages) uses regexp to scan a string, it does not re-scan what was already scanned in the last match. So in the above example, t20_t30 will never show up in the results. To capture that, we need to use a zero-width lookahead search to scan the string, producing matches that do not exclude any substrings from subsequent searches (see the working code above). The search will start from zero-th position and increment by one as many times as possible if "global" modifier is appended to the search (i.e. m//g), making it a "greedy" search.
This is explained in more detail in this blog post.
The expression (?=t\d+_t\d+) matches any 0-width string followed by t\d+_t\d+, and this creates the actual "sliding window". This effectively returns ALL t\d+_t\d+ patterns in $str without any exclusion since every position in $str is a 0-width string. The additional parenthesis captures the pattern while its doing sliding matching (?=(t\d+_t\d+)) and thus returns the desired sliding window outcome.
Using Perl:
#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;
my $re = qr/(?=(t\d+_t\d+))/;
my #l = 't10_t20_t30_t40' =~ /$re/g;
say Dumper(\#l);
Output:
$VAR1 = [
't10_t20',
't20_t30',
't30_t40'
];
Once the regexp algorithm has found a match, the matched characters are not considered for further matches (and usually, this is what one wants, e.g. .* is not supposed to match every conceivable contiguous substring of this post). A workaround would be to start the search again one character after the first match, and collect the results:
str = 't10_t20_t30_t40_';
sub_str = str;
reg_ex = 't(\d+)_t(\d+)';
start_idx = 0;
all_start_indeces = [];
all_end_indeces = [];
off_set = 0;
%// While there are matches later in the string and the first match of the
%// remaining string is not the last character
while ~isempty(start_idx) && (start_idx < numel(str))
%// Calculate offset to original string
off_set = off_set + start_idx;
%// extract string starting at first character after first match
sub_str = sub_str((start_idx + 1):end);
%// find further matches
[start_idx, end_idx] = regexp(sub_str, reg_ex, 'once');
%// save match if any
if ~isempty(start_idx)
all_start_indeces = [all_start_indeces, start_idx + off_set];
all_end_indeces = [all_end_indeces, end_idx + off_set];
end
end
display(all_start_indeces)
display(all_end_indeces)
matched_strings = arrayfun(#(st, en) str(st:en), all_start_indeces, all_end_indeces, 'uniformoutput', 0)
I have run into another problem in relation to a site I am trying to scrape.
Basically I have stripped most of what I don't want from the page content and thanks to some help given here have managed to isolate the dates I wanted. Most of it seems to be working fine, despite some initial problems matching a non-breaking space. However, I am now having difficulty with the final regex, which is intended to split each line of data into fields. Each line represents the price of a share price index. The fields on each line are:
A name of arbitrary length made from characters from the latin alphabet and sometimes a comma or ampersand, no numerics.
A number with two digits after the decimal point (the absolute value of the index).
A number with two digits after the decimal point (the change in the value).
A number with two digits after the decimal point followed by a percent sign (the percentage change in value).
Here is an example string, before splitting:
"Fishery, Agriculture & Forestry243.45-1.91-0.78% Mining360.74-4.15-1.14% Construction465.36-1.01-0.22% Foods783.2511.281.46% Textiles & Apparels412.070.540.13% Pulp & Paper333.31-0.29-0.09% Chemicals729.406.010.83% "
The regex I am using to split this line is this:
$mystr =~ s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;
It works sometimes but not other times and I cannot work out why this should be. (The doubled equal signs in the example output below are used to make the field split more easily visible.)
Fishery, Agriculture & Forestry == 243.45 == -1.91 == -0.78%
Mining360.74-4.15-1.14%
Construction == 465.36 == -1.01 == -0.22%
Foods783.2511.281.46%
I thought the minus sign was an issue for those indices that saw a negative change in the price of the index, but sometimes it works despite the minus sign.
Q. Why is the final regex shown below failing to split the fields consistently?
Example code follows.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;
my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
# get dates:
(my #dates) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
foreach my $date (#dates) { # convert to yyyy-mm-dd
$date =~ s/\//-/ig;
}
my $tree = HTML::Tree->new();
$tree->parse($content);
my $mystr = $tree->as_text;
$mystr =~ s/\xA0//gi; # remove non-breaking spaces
# remove first chunk of text:
$mystr =~
s/^(TSE.*?)IndustryIndexChange ?/IndustryIndexChange\n$dates[0]\n\n/gi;
$mystr =~ s/IndustryIndexChange ?/IndustryIndexChange/ig;
$mystr =~ s/IndustryIndexChange/Industry Index Change\n/ig;
$mystr =~ s/% /%\n/gi; # percent symbol is market for end of line
# indicate breaks between days:
$mystr =~ s/Stock.*?IndustryIndexChange/\nDAY DELIMITER\n/gi;
$mystr =~ s/Exemption from Liability.*$//g; # remove boilerplate at bottom
# and here's the problem regex...
# try to split it:
$mystr =~
s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;
print $mystr;
It appears to be doing every other one.
My guess is that your records have a single \n between them, but your pattern starts and ends with a \n. So the final \n on the first match consumes the \n that the second match needed to find the second record. The net result is that it picks up every other record.
You might be better off wrapping your pattern in ^ and $ (instead of \n and \n), and using the m flag on the s///.
The problem is that you have \n both at the start and at the end of the regex.
Consider something like this:
$s = 'abababa';
$s =~ s/aba/axa/g;
that will set $s to axabaxa, not axaxaxa, because there are only two non-overlapping occurrences of aba.
My interpretation (pseudocode) -
one = [a-zA-Z,& ]+
two = \d{1,4}.\d\d
three = <<two>>
four = <<two>>%
regex = (<<one>>)(<<two>>)(<<three>>)(<<four>>)
= ([a-zA-Z,& ]+)(\d{1,4}.\d\d)(\d{1,4}.\d\d)(\d{1,4}.\d\d%)
However, you are already presented with 'structured' data in the form of HTML. Why not take advantage of this?
HTML parsing in perl references MOJO
for DOM based parsing in perl, and unless there are serious performance reasons,
I'd highly recommend such an approach.