Pattern matching in postgres 9.1 - regex

I am trying to extract the camera make & model from exifdata.
The exifdata itself is quite long, 4 lines follow:
JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'
I use the following regexs, but they do not match. Are the patterns correct?
make := substring( meta from 'Make\\s+=\\s+(.*)');
model := substring( meta from 'Model\\s+=\\s+(.*)');

subtring([str] from [pattern]) doesn't work like you seem to think it does. You can find the details of how it works here: 9.7.2. SIMILAR TO Regular Expressions. That's the regular expression syntax your call uses.
For starters, there's this relevant bit of info:
As with SIMILAR TO, the specified pattern must match the entire data string, or else the function fails and returns null.
Your regular expressions clearly don't match the entire string.
Second is the next sentence:
To indicate the part of the pattern that should be returned on success, the pattern must contain two occurrences of the escape character followed by a double quote (")
This isn't quite the standard regex, so you need to be aware of it.
Rather than trying to get subtring([str] from [pattern]) working, I'm going to recommend an alternative: regexp_matches. This function uses standard POSIX regex syntax, and it returns a text[] containing all the captured groups from the match. Here's a quick test to show that it works:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, '(Make)') m;
(I'm using dollar quoting for your example string, in case you're not familiar with that syntax.)
This gives back the array {Make}.
Second, your regex actually doesn't work, as I found out in my testing. You have two problems:
The double slashes are incorrect. You don't need to escape the \ as PostgreSQL doesn't treat it as an escape character by default. You can read about escaping in strings here; the most relevant section is probably 4.1.2.2. String Constants with C-style Escapes. That section describes what you thought was happening by default, but it actually requires an E prefix to enable.
Fixing that improves the result:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, 'Make\s+=\s+(.*)') m;
now gives an array containing this string:
'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'
This brings us to...
The (.*) is matching everything to the end of the string, not the end of the line. You can actually fix this by doing something you probably want to do anyway: get the single quote marks out of the match. You can use this pattern to do that:
$$Make\s+=\s+'([^']+)'$$
I've used dollar quoting again, this time to avoid the ugliness of escaping all those single quote marks. Now the query is:
SELECT regexp_matches($$JPEG.APP1.Ifd0.ImageDescription = ' '
JPEG.APP1.Ifd0.Make = 'Canon'
JPEG.APP1.Ifd0.Model = 'Canon PowerShot S120'
JPEG.APP1.Ifd0.Orientation = 1 = '0,0 is top left'$$, $$Make\s+=\s+'([^']+)'$$) m;
which gives you pretty much exactly what you want: an array containing just the string Canon. You'll need to extract the result from the array, of course, but I'll leave that as an exercise for you.
That should be enough info for you to get the second expression working, too.
P.S. PostgreSQL's truly fine manual is a blessing.

Related

How to extract part of a string with slash constraints?

Hello I have some strings named like this:
BURGERDAY / PPA / This is a burger fest
I've tried using regex to get it but I can't seem to get it right.
The output should just get the final string of This is a burger fest (without the first whitespace)
Here, we can capture our desired output after we reach to the last slash followed by any number of spaces:
.+\/\s+(.+)
where (.+) collects what we wish to return.
const regex = /.+\/\s+(.+)/gm;
const str = `BURGERDAY / PPA / This is a burger fest`;
const subst = `$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log(result);
DEMO
Advices
Based on revo's advice, we can also use this expression, which is much better:
\/ +([^\/]*)$
According to Bohemian's advice, it may not be required to escape the forward slash, based on the language we wish to use and this would work for JavaScript:
.+/\s+(.+)
Also, we assume in target content, we would not have forward slash, otherwise we can change our constraints based on other possible inputs/scenarios.
Note: This is a pythonish answer (my mistake). I'll leave this for it's value as it could apply many languages
Another approach is to split it and then rejoin it.
data = 'BURGERDAY / PPA / This is a burger fest'
Here it is in four steps:
parts = data.split('/') # break into a list by '/'
parts = parts[2:] # get a new list excluding the first 2 elements
output = '/'.join(parts) # join them back together with a '/'
output = output.strip() # strip spaces from each side of the output
And in one concise line:
output= str.join('/', data.split('/')[2:]).strip()
Note: I feel that str.join(..., ...) is more readable than '...'.join(...) in some contexts. It is the identical call though.

Extract root, month letter-year and yellow key from a Bloomberg futures ticker

A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))

Regex to select semicolons that are not enclosed in double quotes

I have string like
a;b;"aaa;;;bccc";deef
I want to split string based on delimiter ; only if ; is not inside double quotes. So after the split, it will be
a
b
"aaa;;;bccc"
deef
I tried using look-behind, but I'm not able to find a correct regular expression for splitting.
Regular expressions are probably not the right tool for this. If possible you should use a CSV library, specify ; as the delimiter and " as the quote character, this should give you the exact fields you are looking for.
That being said here is one approach that works by ensuring that there are an even number of quotation marks between the ; we are considering the split at and the end of the string.
;(?=(([^"]*"){2})*[^"]*$)
Example: http://www.rubular.com/r/RyLQyR8F19
This will break down if you can have escaped quotation marks within a string, for example a;"foo\"bar";c.
Here is a much cleaner example using Python's csv module:
import csv, StringIO
reader = csv.reader(StringIO.StringIO('a;b;"aaa;;;bccc";deef'),
delimiter=';', quotechar='"')
for row in reader:
print '\n'.join(row)
Regular expression will only get messier and break on even minor changes. You are better off using a csv parser with any scripting language. Perl built in module (so you don't need to download from CPAN if there are any restrictions) called Text::ParseWords allows you to specify the delimiter so that you are not limited to ,. Here is a sample snippet:
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::ParseWords;
my $string = 'a;b;"aaa;;;bccc";deef';
my #ary = parse_line(q{;}, 0, $string);
print "$_\n" for #ary;
Output
a
b
aaa;;;bccc
deef
This is kind of ugly, but if you don't have \" inside your quoted strings (meaning you don't have strings that look like this ("foo bar \"badoo\" goo") you can split on the " first and then assume that all your even numbered array elements are, in fact, strings (and split the odd numbered elements into their component parts on the ; token).
If you *do have \" in your strings, then you'll want to first convert those into some other temporary token that you'll convert back later after you've performed your operation.
Here's a fiddle...
http://jsfiddle.net/VW9an/
var str = 'abc;def;ghi"some other dogs say \\"bow; wow; wow\\". yes they do!"and another; and a fifth'
var strCp = str.replace(/\\"/g,"--##--");
var parts = strCp.split(/"/);
var allPieces = new Array();
for(var i in parts){
if(i % 2 == 0){
var innerParts = parts[i].split(/\;/)
for(var j in innerParts)
allPieces.push(innerParts[j])
}
else{
allPieces.push('"' + parts[i] +'"')
}
}
for(var a in allPieces){
allPieces[a] = allPieces[a].replace(/--##--/g,'\\"');
}
console.log(allPieces)
Match All instead of Splitting
Answering long after the battle because no one used the way that seems the simplest to me.
Once you understand that Match All and Split are Two Sides of the Same Coin, you can use this simple regex:
"[^"]*"|[^";]+
See the matches in the Regex Demo.
The left side of the alternation | matches full quoted strings
The right side matches any chars that are neither ; nor "

preg_match Part of a url

I have a link that looks like this http://site.com/numbers_and_letters/This_is_what-I-need_to-retrieve.html
I basically need to retrieve this part: This_is_what-I-need_to-retrieve
And also replace the the dashes and underscores with spaces so it would end up looking like this: This is what I need to retrieve
I'm new to regex so this is what i'm using:
(it works but has poor performance)
function clean($url)
{
$cleaned = preg_replace("/http:\/\/site.com\/.+\//", '', $url);
$cleaned = preg_replace("/[-_]/", ' ', $cleaned);
//remove the html extension
$cleaned = substr($cleaned, 0,-4);
return $cleaned;
}
What you've got isn't that bad. But maybe you can try comparing its performance to this:
preg_match('[^/]+$', $url, $match);
$cleaned = preg_replace('[-_]', ' ', $match);
EDIT:
If all you have is a hammer, everything looks like a nail.
How about avoiding regex altogether? (I presume each input is a valid URL.)
$cleaned = strtr(substr($url, strrpos($url, '/') + 1, -5), '-_', ' ');
This even removes the .html extension! (I'm making all the same assumptions you already seem to be making, i.e. that all links end in .html.) A brief explanation:
strtr translates a set of characters, e.g. -_, to respective characters in another set, e.g. spaces. (I imagine it'd be more efficient than invoking the entire regex engine.)
substr, you must know, but note that if the last argument is negative, e.g. -5, it indicates the number of characters from the end to ignore. Handy for this case, and again, probably more efficient than regex.
strrpos, of course, finds the last position of a character in a string, e.g. /.

Replace using RegEx outside of text markers

I have the following sample text and I want to replace '[core].' with something else but I only want to replace it when it is not between text markers ' (SQL):
PRINT 'The result of [core].[dbo].[FunctionX]' + [core].[dbo].[FunctionX] + '.'
EXECUTE [core].[dbo].[FunctionX]
The Result shoud be:
PRINT 'The result of [core].[dbo].[FunctionX]' + [extended].[dbo].[FunctionX] + '.'
EXECUTE [extended].[dbo].[FunctionX]
I hope someone can understand this. Can this be solved by a regular expression?
With RegLove
Kevin
Not in a single step, and not in an ordinary text editor. If your SQL is syntactically valid, you can do something like this:
First, you remove every string from the SQL and replace with placeholders. Then you do your replace of [core] with something else. Then you restore the text in the placeholders from step one:
Find all occurrences of '(?:''|[^'])+' with 'n', where n is an index number (the number of the match). Store the matches in an array with the same number as n. This will remove all SQL strings from the input and exchange them for harmless replacements without invalidating the SQL itself.
Do your replace of [core]. No regex required, normal search-and-replace is enough here.
Iterate the array, replacing the placeholder '1' with the first array item, '2' with the second, up to n. Now you have restored the original strings.
The regex, explained:
' # a single quote
(?: # begin non-capturing group
''|[^'] # either two single quotes, or anything but a single quote
)+ # end group, repeat at least once
' # a single quote
JavaScript this would look something like this:
var sql = 'your long SQL code';
var str = [];
// step 1 - remove everything that looks like an SQL string
var newSql = sql.replace(/'(?:''|[^'])+'/g, function(m) {
str.push(m);
return "'"+(str.length-1)+"'";
});
// step 2 - actual replacement (JavaScript replace is regex-only)
newSql = newSql.replace(/\[core\]/g, "[new-core]");
// step 3 - restore all original strings
for (var i=0; i<str.length; i++){
newSql = newSql.replace("'"+i+"'", str[i]);
}
// done.
Here is a solution (javascript):
str.replace(/('[^']*'.*)*\[core\]/g, "$1[extended]");
See it in action