Detect quoted strings in sql query - regex

I am writing a bash script that I am using to detect certain classes of strings in a SQL query (like all upper-case, all lowercase, all numeric characters, etc...). Before doing the classification, I want to extract all quoted strings. I am having trouble getting a regex that will properly extract the quoted strings from the query string. For example, take this query from the TPCH benchmark:
select
o_year,
sum(case
when nation = 'JAPAN' then volume
else 0
end) / sum(volume) as mkt_share
from
(
select
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) as volume,
n2.n_name as nation
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'ASIA'
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01' and date '1996-12-31'
and p_type = 'MEDIUM BRUSHED BRASS'
) as all_nations
group by
o_year
order by
o_year;
Its a complex query, but that is besides the point. I need to be able to extract all of the single-quoted strings from this file and print them on their own line. ie:
'JAPAN'
'ASIA'
'1995-01-01'
'1996-12-31'
'MEDIUM BRUSHED BRASS'
Right now, (being that I'm not very familiar with regex) all I have is:
printf '%s\n' $SQL_FILE_VARIABLE | grep -E "'*'"
But this doesn't support strings with spaces, and it doesn't work when multiple strings are on the same line of the file. Ideally, I can get this to work in my bash script, so preferably the solution will be grep/sed/perl. I have done some googling and have found solutions to similar problems, but I have not been able to get them to work for this in particular.
Any Ideas how I can achieve this? Thanks.

You want something like this:
printf '%s\n' $SQL_FILE_VARIABLE | grep -E "'[^']*'"

Why not try /'(.*)?'/g
This means, between the quotes, match everything and extract it.

Related

How do I keep parts (not all the same) of a string that match a regular expression in Teradata?

I want to keep parts of a string in my output. The problem is this parts can be many things.
My input data have a store name and a selling code that looks like garbage:
'Amazon Store cod14%359'
'Roses Market XYwK$p##2'
'Amazon Store cod99#9ab'
'MyStore _89ab$$3'
There are many answers to do this using the store name in the output string, like
regexp_replace(store, '(Amazon Store).*$', 'Amazon Store')
but this implies a rule for each of many stores (too much job, Regular Expressions are better than this!)
Theoretically, I should be able to do things like this:
my $date = "2009-27-02";
$date =~ s/(\d{4})-(\d{2})-(\d{2})/$1-$3-$2/;
Here, year is carried to output string as $1, month as $3 and day is $2.
I would like to do something like this:
sel store,
regexp_replace(store, '((Amazon Store)|(Roses Market)|(MyStore)).*$', '$1') as BetterStore
from myDatabase
( I had tried a lot of syntax variations, with {}, [], \ and all I could wonder, without success...)
Does somebody know if this is possible, in Teradata?
You can use
regexp_replace(store, '(Amazon Store|Roses Market|MyStore).*', '\1')
Details:
(Amazon Store|Roses Market|MyStore) - Group 1 (\1): one of three phrases
.* - the rest of the line.

REGEXP_SUBSTR Redshift

I am trying to extract a substring from a text string in postgresql. The column name of the text string is short_description and I am using the REGEXP_SUBSTR function to define a regex that will return only the portion that I want:
SELECT short_description,
REGEXP_SUBSTR(short_description,'\\[[^=[]*') AS space
FROM my_table
This returns the following:
short_description space
----------------------------------------------------------------------------
[ABC12][1][ABCDEFG] ACB DEF [HIJ] | [ABC12]
What I would like to pull is the following:
short_description space
----------------------------------------------------------------------------
[ABC12][1][ABCDEFG] ACB DEF [HIJ] | [ABCDEFG]
Any ideas?
You can use the Regex character classes to help with this kind of match. Here I'm looking for letters only, surrounded by brackets, and a following space. Note the use of double backslash \\ to escape the literal brackets and the double brackets [[:a:]] for the character class
SELECT REGEXP_SUBSTR('[ABC12][1][ABCDEFG] ACB DEF [HIJ]','\\[[[:alpha:]]+\\] ');
regexp_substr
---------------
[ABCDEFG]
You could also use the SPLIT_PART function achieve something similar by splitting on a closing bracket ] and choosing the 3rd value.
SELECT SPLIT_PART('[ABC12][1][ABCDEFG] ACB DEF [HIJ]',']',3);
split_part
------------
[ABCDEFG
I recommend using the built in functions rather than a UDF if at all possible. UDFs are fantastic when you need them but they do incur a performance penalty.
Here you go.
I have found the right regexp expression using
https://txt2re.com
Then, I have implemented it as a python redshift UDF
create or replace function f_regex (input_str varchar(max),regex_expression varchar(max))
returns VARCHAR(max)
stable
as $$
import re
rg = re.compile(regex_expression,re.IGNORECASE|re.DOTALL)
return rg.search(input_str).group(1)
$$ language plpythonu;
select f_regex('[ABC12][1][ABCDEFG] ACB DEF [HIJ] '::text,'.*?\\[.*?\\].*?\\[.*?\\](\\[.*?\\])'::text);
Once you have created the function, you can use it within any of your redshift selects.
So, in your case:
SELECT short_description,
f_regex(short_description::text,'.*?\\[.*?\\].*?\\[.*?\\](\\[.*?\\])'::text) AS space
FROM my_table

Parsing Transact SQL with RegEx

I'm quite inexperienced with RegEx - just an occasional straighforward RegEx for a programming task that I worked out by trial and error, but now I have a serious regEx challenge:
I have about 970 text files containing Sybase Transact SQL snippets, and I need to find every table name in those files and preface the table name with ' #'. So my options are to either spend a week editing the files by hand or write a script or application using regEx (Python 3 or Delphi-PRCE) that will perform this task.
The rules are as follows:
Table names are ALWAYS upperCase - so I'm only looking for upperCase
words;
Column names, SQL expressions and variables are ALWAYS lowerCase;
SQL keywords, Table aliases and column values CAN BE upperCase, but must NOT be prefixed with ' #';
Table aliases (must not be prefixed) will always have whiteSpace preceding them until the end of the
previous word, which will be a table name.
Column values (must not be prefixed) will either be numerical values or characters enclosed in
quotes.
Here is some sample text requiring application of all the above mentioned rules:
update SYBASE_TABLE
set ok = convert(char(10),MB.limit)
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
AND PPL.mot_ind = 'B'
AND PPL.trade_type_ind = 'P'
So far with I've gotten only this far: (not too far...)
(?-i)[[:upper:]]
Any help would be most appreciated.
TIA,
MN
This is not doable with a simple regex-replacement. You will not be able to make a distinction between upper case words that are tables, are string literals or are commented:
update TABLE set x='NOT_A_TABLE' where y='NOT TABLES EITHER'
-- AND NO TABLES HERE AS WELL
EDIT
You seem to think that determining if a word is inside a string literal or not is easy, then consider SQL like this:
-- a quote: '
update TABLE set x=42 where y=666
-- another quote: '
or
update TABLE set x='not '' A '''' table' where y=666
EDIT II
Okay, I (obsessively) hammered on the fact that a simple regex replacements is not doable. But I didn't offer a (possible) solution yet. What you could do is create some sort of "hybrid-lexer" based on a couple of different regex-es. What you do is scan through the input file and at the start of each character, try to match either a comment, a string literal, a keyword, or a capitalized word. And if none of these 4 previous patterns matched, then just consume a single character and repeat the process.
A little demo in Python:
#!/usr/bin/env python
import re
input = """
UPDATE SYBASE_TABLE
SET ok = convert(char(10),MB.limit) -- ignore me!
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
-- comment '
AND PPL.mot_ind = 'B '' X'
-- another comment '
AND PPL.trade_type_ind = 'P -- not a comment'
"""
regex = r"""(?xs) # x = enable inline comments, s = enable DOT-ALL
(--[^\r\n]*) # [1] comments
| # OR
('(?:''|[^\r\n'])*') # [2] string literal
| # OR
(\b(?:AND|UPDATE|SET)\b) # [3] keywords
| # OR
([A-Z][A-Z_]*) # [4] capitalized word
| # OR
. # [5] fall through: matches any char
"""
output = ''
for m in re.finditer(regex, input):
# append a `#` if group(4) matched
if m.group(4): output += '#'
# append the matched text (any of the groups!)
output += m.group()
# print the adjusted SQL
print output
which produces:
UPDATE #SYBASE_TABLE
SET ok = convert(char(10),#MB.limit) -- ignore me!
from #MOVE_BOOKS #MB, #PEOPLEPLACES #PPL
where #MB.move_num = #PPL.move_num
-- comment '
AND #PPL.mot_ind = 'B '' X'
-- another comment '
AND #PPL.trade_type_ind = 'P -- not a comment'
This may not be the exact output you want, but I'm hoping the script is simple enought for you to adjust to your needs.
Good luck.

How can I replace just the last occurrence in a string with a Perl regex?

Ok, here is my test (this is not production code, but just a test to illustrate my problem)
my $string = <<EOS; # auto generated query
SELECT
users.*
, roles.label AS role_label
, hr_orders.position_label
, deps.label AS dep_label
, top_deps.label AS top_dep_label
FROM
users
LEFT JOIN $conf->{systables}->{roles} AS roles ON users.id_role = roles.id
LEFT JOIN (
SELECT
id_user
, MAX(dt) AS max_dt
FROM
hr_orders
WHERE
fake = 0
AND
IFNULL(position_label, ' ') <> ' '
GROUP BY id_user
) last_hr_orders ON last_hr_orders.id_user = users.id
LEFT JOIN hr_orders ON hr_orders.id_user = last_hr_orders.id_user AND hr_orders.dt = last_hr_orders.max_dt
$join
WHERE
$filter
ORDER BY
$order
$limit
EOS
my $where = "WHERE\nusers.fake = -1 AND ";
$string =~ s{where}{$where}i;
print "result: \n$string";
Code, which generates the query, ends with simple s{where}{$where}i, which replaces EVERY occurence of where.
I want to replace top-level WHERE (last occurence of WHERE?) with 'WHERE users.fake = -1' (actually, with more complex pattern, but it doesn't matter).
Any ideas?
Why do you want to build your sql queries by hard-coding strings and then making replacements on them? Wouldn't something like
my $proto_query = <<'EOQ'
select ... where %s ...
EOQ
my $query = sprintf $proto_query, 'users.fake = -1 AND ...';
or (preferably, as it avoids a lot of issues your initial approach and the above has) using a module such as Data::Phrasebook::SQL make a lot of things easier?
If you really wanted to go for string substitutions, you're probably looking for something like
my $foo = "foo bar where baz where moo";
$foo =~ s/(.*)where/$1where affe and/;
say $foo; # "foo bar where baz where affe and moo"
That is, capturing as much as you can until you can't capture any more without not having a "where" immediately follow what you captured, and then inserting whatever you captured captured again, plus whatever modifications you want to make.
However, note that this has various limitations if you're using that to mangle SQL queries. To do things right, you'd have to actually understand the SQL at some level. Consider, for example, select ... where user.name = 'where'.
apparently, what I need was Look-ahead regex feature
my regex is
s{where(?!.*where)}{$where}is;
The right way to parse SQL queries is to do it using a parser and not using regex.
see SQL::Statement::Structure - parse and examine structure of SQL queries

How can I find the count of semicolon separated values?

I have a list of all email ids which I have copied from the 'To' field, from an email I received in MS Outlook. These values (email ids) are separated by a semicolon. I have copied this big list of email ids into Excel. Now I want to find the number of email ids in this list; basically by counting the number of semi colons.
One way I can do this is by writing C code. i.e. store the big list as string buffer, and keep comparing the chars to ";" in a while(char == ';') loop.
But I want to do it quickly.
Is there any quick way to find that out using either:
1.) Regular expression (I use powergrep for processing the regexps)
2.) In excel itself (any excel macro/plugin for that?)
3.) DOS script method
4.) Any other quick way of getting it done?
I believe the following should work in Excel:
= Len(A1) - Len(Substitute(A1, ";", "")) + 1
/EDIT: if you've pasted the email addresses over several cells, you can count the cells with the following function:
= CountA(A1:BY1)
CountA counts non-empty cells in a given range. You can specify the range by typing =CountA( into a cell and then selecting your cell range with the mouse cursor.
Bash/Cygwin One-Liner
$ echo "user#domain.tld;user#domain.tld;user#domain.tld" | sed -e 's/;/\n/g' | wc -l
3
If you already have Cygwin installed it's effectively instant. If not, cygwin is worth installing IMHO. It basically provides a Linux bash prompt overlaid over your Windows system.
As an aside, stuff like this is why I prefer *nix over Windows for work stuff, I can't live on a windows box without Cygwin since bash scripts are so much more powerful than batch scripts.
If counting the number of semicolons is good enough for you, you can do it in Perl using this solution: Perl FAQ 4.24: How can I count the number of occurrences of a substring within a string
PowerShell:
> $a = 'blah;blah;blah'
> $a.Split(';').Count
3
3) if you don't have neither cygwin, nor powershell installed try this .cmd
#echo off
set /a i = 0
for %%i in (name1#mail.com;name2#mail.com;name3#mail.com) do set /a i = i + 1
#echo %i%
If you are using Excel you can use this code and expose it.
Public Function CountSubString(ByVal RHS As String, ByVal Delimiter As String) As Integer
Dim V As Variant
V = Split(RHS, Delimiter)
CountSubString = UBound(V) + 1
End Function
If you have .NET you can make a little command line utility
Module CountSubString
Public Sub Main(ByVal Args() As String)
If Args.Length <> 2 Then
Console.WriteLine("wrong arguments passed->")
Else
Dim I As Integer = 0
Dim Items() = Split(Args(0), Args(1))
Console.WriteLine("There are " & CStr(UBound(Items) + 1) & "
End If
End Sub
End Module
Load the list in your favorite (not Notepad!) editor, replace ; by \n, see in the status bar how many lines you have, remove the last line if needed.
C# 3.0 with LINQ would make this easy if it is an option for you over C
myString.ToCharArray().Count(char => char = ';')
If awk, echo is awailable (and it is, even on windows):
echo "addr1;addr2;addr3...." | awk -F ";" "{print NF}"
looping over it with a while loop and counting the ';' is probably going to be the fastest, and the most readable.
Consider Konrad's suggestions, that too will loop through the string and check every char and see if it is a simicolon, and then in modifies the string (may or may not be mutable, I don't know with excel), and then it counts the length between it and the original string.