Parsing Transact SQL with RegEx - regex

I'm quite inexperienced with RegEx - just an occasional straighforward RegEx for a programming task that I worked out by trial and error, but now I have a serious regEx challenge:
I have about 970 text files containing Sybase Transact SQL snippets, and I need to find every table name in those files and preface the table name with ' #'. So my options are to either spend a week editing the files by hand or write a script or application using regEx (Python 3 or Delphi-PRCE) that will perform this task.
The rules are as follows:
Table names are ALWAYS upperCase - so I'm only looking for upperCase
words;
Column names, SQL expressions and variables are ALWAYS lowerCase;
SQL keywords, Table aliases and column values CAN BE upperCase, but must NOT be prefixed with ' #';
Table aliases (must not be prefixed) will always have whiteSpace preceding them until the end of the
previous word, which will be a table name.
Column values (must not be prefixed) will either be numerical values or characters enclosed in
quotes.
Here is some sample text requiring application of all the above mentioned rules:
update SYBASE_TABLE
set ok = convert(char(10),MB.limit)
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
AND PPL.mot_ind = 'B'
AND PPL.trade_type_ind = 'P'
So far with I've gotten only this far: (not too far...)
(?-i)[[:upper:]]
Any help would be most appreciated.
TIA,
MN

This is not doable with a simple regex-replacement. You will not be able to make a distinction between upper case words that are tables, are string literals or are commented:
update TABLE set x='NOT_A_TABLE' where y='NOT TABLES EITHER'
-- AND NO TABLES HERE AS WELL
EDIT
You seem to think that determining if a word is inside a string literal or not is easy, then consider SQL like this:
-- a quote: '
update TABLE set x=42 where y=666
-- another quote: '
or
update TABLE set x='not '' A '''' table' where y=666
EDIT II
Okay, I (obsessively) hammered on the fact that a simple regex replacements is not doable. But I didn't offer a (possible) solution yet. What you could do is create some sort of "hybrid-lexer" based on a couple of different regex-es. What you do is scan through the input file and at the start of each character, try to match either a comment, a string literal, a keyword, or a capitalized word. And if none of these 4 previous patterns matched, then just consume a single character and repeat the process.
A little demo in Python:
#!/usr/bin/env python
import re
input = """
UPDATE SYBASE_TABLE
SET ok = convert(char(10),MB.limit) -- ignore me!
from MOVE_BOOKS MB, PEOPLEPLACES PPL
where MB.move_num = PPL.move_num
-- comment '
AND PPL.mot_ind = 'B '' X'
-- another comment '
AND PPL.trade_type_ind = 'P -- not a comment'
"""
regex = r"""(?xs) # x = enable inline comments, s = enable DOT-ALL
(--[^\r\n]*) # [1] comments
| # OR
('(?:''|[^\r\n'])*') # [2] string literal
| # OR
(\b(?:AND|UPDATE|SET)\b) # [3] keywords
| # OR
([A-Z][A-Z_]*) # [4] capitalized word
| # OR
. # [5] fall through: matches any char
"""
output = ''
for m in re.finditer(regex, input):
# append a `#` if group(4) matched
if m.group(4): output += '#'
# append the matched text (any of the groups!)
output += m.group()
# print the adjusted SQL
print output
which produces:
UPDATE #SYBASE_TABLE
SET ok = convert(char(10),#MB.limit) -- ignore me!
from #MOVE_BOOKS #MB, #PEOPLEPLACES #PPL
where #MB.move_num = #PPL.move_num
-- comment '
AND #PPL.mot_ind = 'B '' X'
-- another comment '
AND #PPL.trade_type_ind = 'P -- not a comment'
This may not be the exact output you want, but I'm hoping the script is simple enought for you to adjust to your needs.
Good luck.

Related

Combine two references into one or some serious regex work [duplicate]

This question already has an answer here:
Parse parameters and values of smarty-like string in PHP
(1 answer)
Closed 2 years ago.
Question, basically
If I have a regex ((key1)(value1)|(key2)(value2)), key1 is ref'd by $2 & key2 by $4. Is it possible to combine these into the same reference? (I'm guessing no)
Thus $7 might be key & $8 might be value, regardless of which capture group it originated in
Any regex masters who can solve the below? I've spent a couple hours on it and am kinda stuck.
I would like it to work across different regex engines with minimal modifications. Been testing with PCRE on regexr.com
What I'm doing
I'm trying to make a file format that is parsed into key/value pairs with a single regex.
There's just a few rules:
Keys are a string of characters at the start of a line, followed by a colon (:).
So far, I'm just using [a-z]+ for the keys, but that will be expanded to some more characters. I don't think that will functionally change the regex.
values can be multi-line
all white-space is trimmed from values
I don't think I've added this to the regex yet
Values end when another key begins
delimiters can be used to wrap values in the format key:DELIM: then the value, then :DELIM: on it's own line.
Delimiter can be an empty string, thus :: serves as a delimiter
The regex I have
Correctly matches non-delimited keys & values
([a-z]+):((?:(?:.|\n|\r)(?!^[a-z]+:))+)
Correctly matches delimited keys & values
([a-z]+):([A-Z]*:)((.|\r|\n)*)^:\2
Matches everything correctly, BUT requires two sets of references
(?:(?:([a-z]+):([A-Z]*:)((.|\r|\n)*)^:\2)|([a-z]+):((?:(?:.|\n|\r)(?!^[a-z]+:))+))
$1 & $5 are keys. $3 & $6 are values
Sample Input
key: value 1
nightmare:DELIM:
notakey:
obviously not a key
notakey:
:DELIM:
abc: value 2
new line
anotherkey:: value
nostring: on this one
::
Which would yield These key/value pairs
key
value1
nightmare
notakey:
obviously not a key
notakey:
abc
value 2
new line
anotherkey
value
nostring: on this one
My latest attempt
My latest attempt got me here, but it doesn't actually match anything:
^([a-z]+): # key CP#1
((?:[A-Z]*:)? # delimiter, optional
(?:\s*(\r?\n|$)) # whitespace, new line OR end of file (line?)
) # CP#2
( # value, CP#3
(?:(?:
(?:.|\n|\r) # characters we want
(?!^[a-z]+:) # But NOT if those characters make up a key
)+)
| # or
((.|\r|\n)*) # characters we want
^:\2 # Ends with delimiter
) # delimited value
Thanks to the commenter for the ?| operator, which turns out to be what I needed.
((key1)(value1)|(key2)(value2)) => (?|(key1)(value1)|(key2)(value2)).
(?|(?:([a-z]+):([A-Z]*:)((.|\r|\n)*)^:\2)|([a-z]+):()((?:(?:.|\n|\r)(?!^[a-z]+:))+)) basically does it, though the final product certainly still needs more work.

regular expression

I am trying to find a regular expression which should satisfy the following needs.
It should identify all space(s) as separators until a doublepoint is passed 2 times. After this pass, it should continue to use spaces as separators until a 3rd doublepoint is identified. This 3rd colon should be used as separator as well. But all spaces before and after this specific colon should not be used as separator. After this special doublepoint has been identified, no more separator should be found even its a space or a colon.
2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf c.w.f.w.NiceController : z rest as async texting: json, special character, spacses.....
I would like to have the separators her identified as following (Separator shown as X)
2019-12-28X13:00:00.112XDEBUGXn-somethingspecial.atX---X[9999-118684]X3894ß8349ß84930ßaa14e38eae18e3ebfXc.w.f.w.NiceControllerXz rest as async texting: json, special character, spacses.....
2019-12-28 X 13:00:00.112 X DEBUG X n-somethingspecial.at X --- X [9999-118684] X 3894ß8349ß84930ßaa14e38eae18e3ebf X c.w.f.w.NiceController X z rest as async texting: json, special character, spacses.....
Exactly 8 separtors are found here.
Any ideas how to do this via regular expression?
My current approach does not work as I tried to to this like the following
Any ideas about this?
Update:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?<=DEBUG)\s|(?<=\s---)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=\[[0-9a-z\#\.\-]{15}\])\s|((?<=\[[0-9a-z\#\.\-]{15}\]\s)\s|(?<=\[[0-9a-z\#\.\-]{15}\]\s[a-z0-9]{32})\s)|\s(?=---)|(?<=[a-zA-Z])\s+\:\s
That's my current syntax to identify the separators.
Update 2:
Regex above is faulty.
Update 3:
(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
This is the current regex. Targetapproach is to call
df = pd.read_csv(file_name,
sep="(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.domain\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s)|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})",
names=['date', 'time', 'level', 'host', 'template', 'threadid', 'logid', 'classmethods', 'line'],
engine='python',
nrows=100)
This could be extended later to dask which gives me the change to parse multiple log files in one dataframe.
The last column line is not identified correctly. For unknown reasons yet.
If that log format is sufficiently regular, you can take the lines apart much more easily with str.split.
The assumptions are that none of the first eight fields have an internal space, and that all of them are always present (or, if not all are present, that the last field, which starts after the colon, is also not present). You can then use the maxsplit argument to str.split in order to stop splitting when the ninth field starts:
def separate(logline):
fields = logline.split(maxsplit=8) # 8 space separate fields + the rest
if len(fields) > 8:
# Fix up the ninth field. Perhaps you want to remove the colon:
fields[8] = fields[8][1:]
# or perhaps you want the text starting at the first non-whitespace
# character after the colon:
#
# if fields[8][0] == ':':
# fields[8] = fields[8].split(maxsplit=1)[1]
#
# etc.
return fields
>>> logline = ( "2019-12-28 13:00:00.112 DEBUG n-somethingspecial.at"
... + " --- [9999-118684] 3894ß8349ß84930ßaa14e38eae18e3ebf"
... + " c.w.f.w.NiceController"
... + " : z rest as async texting: json, special character, spaces.....")
>>> separate(logline)
['2019-12-28', '13:00:00.112', 'DEBUG', 'n-somethingspecial.at', '---',
'[9999-118684]', '3894ß8349ß84930ßaa14e38eae18e3ebf',
'c.w.f.w.NiceController',
' z rest as async texting: json, special character, spaces.....']
Solution
The current outcome of my problem can be solved via the following regular expression.
(?:(?<=\d{4}-\d{2}-\d{2})\s|(?<=\d{2}:\d{2}:\d{2}\.\d{3})\s|(?:(?<=DEBUG)\s|(?<=WARN)\s|(?<=ERROR)\s|(?<=INFO)\s)|(?<=(?:p|t)-.{7}\-.{5}\.hostname\.sys)\s|(?<=\s---)\s|(?<=\[[\s0-9a-z\#\.\-]{15}\])\s|(?:(?<=\[[\s0-9a-z\#\.\-]{15}\]\s)\s|(?<=[a-z0-9]{32})\s))|\s+\:\s(?<=[\sa-z]{1}\s{1}\:\s{1})
Maybe minor adaptions have to be done maybe but for now it works pretty good.

How to select section in regular expression in linux commands

I have these lines that every line begin a word then equal and several sentence so I like select every section. For example:
delete = \account
user\
admin
admin right is good.
add = \
nothing
no out
input output is not good
edit = permission
bob
admin
alice killed bob!!!
I want to select a section for example:
add = \
nothing
no out
input output is not good
I like do it with regular expression.
Your question is a bit vague but you could try the following ...
/\s*(\w+) = ([^=]*\n)*/m
... subject to the requirement that the last section is terminated with \n.
this works by:
'\s*' matching some optional leading whitespace
'(\w+)' capturing the name of the section
' = ' matches the space equals space separator
'([^=]*\n)' it then captures a string that does not include an equals and ends with a newline
'*' and it does that last bit multiple times
The m flag is then required to set multi-line.
See the following to quickly see the groups that are output for each match ...
https://regex101.com/r/oDKSy9/1
(NOTE: The g flag will probably not be required depending on how you use the regex.)
Solution by OP.
I find this solution:
csplit -k fileName '/.*=/' '{*}'
Thanks #haggisandchips

Python script to extract data from text file

I have a text file which have some website list links like
test.txt:
http://www.site1.com/
http://site232546ee.com/
https://www.site3eiue213.org/
http://site4.biz/
I want to make a simple python script which can extract only site names with length of 8 characters... no name more than 8 characters.... the output should be like:
output.txt:
site1
site2325
site3eiu
site4
i have written some code:
txt1 = open("test.txt").read()
txt2 = txt1.split("http://www.")
f = open('output.txt', 'w')
for us in txt2:
f.write(us)
print './done'
but i don't know how to split() more than one command in one line ... i also tried it with import re module but don't able to know that how to write code for it.
can some one help me please to make this script. :(
you can achieve this using regular expression as below.
import re
no = 8
regesx = "\\bhttp://www.|\\bhttp://|\\bhttps://www."
text = "http://site232546ee.com/"
match = re.search(regesx, text)
start = match.end(0)
end = start+no
string1 = text[start:end]
end = string1.find('.')
if end > 0:
final = string1[0:end]
else:
final = string1
print(final)
You said you want to extract site names with 8 characters, but the output.txt example shows bits of domain names. If you want to filter out domain names which have eight or less characters, here is a solution.
Step 1: Get all the domain names.
import tldextract
import pandas as pd
text_s=''
list_u=('http://www.site1.com/','http://site232546ee.com/','https://www.site3eiue213.org/','http://site4.biz/')
#http:\//www.(\w+).*\/?
for l in list_u:
extracted = tldextract.extract(l)
text_s+= extracted.domain + ' '
print (text_s) #gives a string of domain names delimited by whitespace
Step 2: filter domain names with 8 or less characters.
word= text_s.split()
lent= [len(x) for x in text_s.split()]
word_len_list = pd.DataFrame(
{'words': word,
'char_length': lent,
})
word_len_list[(word_len_list.char_length <= 8)]
Output looks like this:
words char_length
0 site1 5
3 site4 5
Disclaimer: I am new to Python. Please ignore any unnecessary and/or stupid steps I may have written
Have you tried printing txt2 before doing anything with it? You will see that it did not do what (I expect) you wanted it to do, since there's only one "http://www." available in the text. Try to split at a newline \n. That way you get a list of all the urls.
Then, for each url you'll want to strip the front and back, which you can do with regular expression but which can be quite hard, depending on what you want to be able to strip off. See here.
When you have found a regular expression that works for you, simply check the domain for its length and write those domains to a file that satisfy your conditions using an if statement (if len(domain) <= 8: f.write(domain))

Extract root, month letter-year and yellow key from a Bloomberg futures ticker

A Bloomberg futures ticker usually looks like:
MCDZ3 Curcny
where the root is MCD, the month letter and year is Z3 and the 'yellow key' is Curcny.
Note that the root can be of variable length, 2-4 letters or 1 letter and 1 whitespace (e.g. S H4 Comdty).
The letter-year allows only the letter listed below in expr and can have two digit years.
Finally the yellow key can be one of several security type strings but I am interested in (Curncy|Equity|Index|Comdty) only.
In Matlab I have the following regular expression
expr = '[FGHJKMNQUVXZ]\d{1,2} ';
[rootyk, monthyear] = regexpi(bbergtickers, expr,'split','match','once');
where
rootyk{:}
ans =
'mcd' 'curncy'
and
monthyear =
'z3 '
I don't want to match the ' ' (space) in the monthyear. How can I do?
Assuming there are no leading or trailing whitespaces and only upcase letters in the root, this should work:
^([A-Z]{2,4}|[A-Z]\s)([FGHJKMNQUVXZ]\d{1,2}) (Curncy|Equity|Index|Comdty)$
You've got root in the first group, letter-year in the second, yellow key in the third.
I don't know Matlab nor whether it covers Perl Compatible Regex. If it fails, try e.g. with instead of \s. Also, drop the ^...$ if you'd like to extract from a bigger source text.
The expression you're feeding regexpi with contains a space and is used as a pattern for 'match'. This is why the matched monthyear string also has a space1.
If you want to keep it simple and let regexpi do the work for you (instead of postprocessing its output), try a different approach and capture tokens instead of matching, and ignore the intermediate space:
%// <$1><----------$2---------> <$3>
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2}) (.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
You can also simplify the expression to a more genereic '(.+)(\w{1}\d{1,2})\s+(.+)', if you wish.
Example
bbergtickers = 'MCDZ3 Curncy';
expr = '(.+)([FGHJKMNQUVXZ]\d{1,2})\s+(.+)';
tickinfo = regexpi(bbergtickers, expr, 'tokens', 'once');
The result is:
tickinfo =
'MCD'
'Z3'
'Curncy'
1 This expression is also used as a delimiter for 'split'. Removing the trailing space from it won't help, as it will reappear in the rootyk output instead.
Assuming you just want to get rid of the leading and or trailing spaces at the edge, there is a very simple command for that:
monthyear = trim(monthyear)
For removing all spaces, you can do:
monthyear(isspace(monthyear))=[]
Here is a completely different approach, basically this searches the letter before your year number:
s = 'MCDZ3 Curcny'
p = regexp(s,'\d')
s(min(p)
s(min(p)-1:max(p))