I have many sql files. I am trying to locate files that contain a variable (format of #varname) ONLY if they appear within matching single or double quotes. I only care that it exists and is there, I just need to know the files that this occurs in.
I can match all the quoted strings, but can't figure out how to test that even just a single # char appears within the match
matching single and double quote pairs (["'])(.*?)\1
example file:
...sql statements
select #sql = 'select * from
Users where id = #id '
...more sql statements
Thanks in advance.
EDIT
Here is a better example file, with comments (sql comments) on which statements should match and examples of ones that shouldn't
...sql statements
-- only this quoted string would match
select #sql = "select * from
Users where id = #id "
-- other statements that wouldn't match because not in a pair of quotes
if ltrim(isnull(#stat,'')) <> '' and #stat <> '""'
begin
select #sql = #sql + " and Stat in ("+#stat+")"
end
if isnull(#atype,'') <> ''
begin
select #sql = #sql + " and Type in ("+#atype+")"
end
...more sql statements
For the sample text given....
Try:
(?:\s|=)(?:\"[^"]*#[^"]*\"|(?:\s|=)\'[^']*#[^']*\')
Demo:
https://regex101.com/r/BXcYt4/1
Using PCRE and to
to test that even just a single # char appears within the match
You can use an alternation excluding either " or ' and also exclude matching an # adding it to the negated character class.
To get both values in the same group, you can use a branch reset group.
=\h*(?|"([^"#]*#[^"#]+)"|'([^#']*#[^'#]*)')
The pattern matches:
=\h* Match = and optional horizontal whitespace chars
(?| Branch reset group
"( Match " and start group 1
[^"#]*# Match optional chars other than " or # and then match #
[^"#]+ Match 1+ chars other than " or #
)" Close group 1 and atch "
| Or
'([^#']*#[^'#]*)' The same as previous pattern, this time for '
) Close branch reset group
Regex demo
Related
I have a long string in BigQuery where that I need to extract out some data.
Part of the string looks like this:
... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...
I want to extract out data such as agent, hi, and 0.61254.
I'm trying to use regexp_extract but I can't get the regexp to work correctly:
select
regexp_extract([col],r'score: [0-9]*\.[0-9]+') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]
What should the regexp be to just get agent or 0.61254 without the field name and no quotation marks?
Thank you in advance.
I love non-trivial approaches - below one of such -
select * except(col) from (
select col, split(kv, ': ')[offset(0)] key,
trim(split(kv, ': ')[offset(1)], '"') value,
from your_table,
unnest(regexp_extract_all(col, r'\w+: "?[\w.]+"?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))
if applied to sample data as in your question
with your_table as (
select '... source: "agent" resolved_query: "hi" score: 0.61254 parameters ... ' col union all
select '... source: "agent2" resolved_query: "hello" score: 0.12345 parameters ... ' col
)
the output is
As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same
You can use
select
regexp_extract([col],r'score:\s*(\d*\.?\d+)') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]
Here,
score:\s*(\d*\.?\d+) matches score: string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional . and then one or more digits
resolved_query:\s*"([^"]*)" matches a resolved_query: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char
source:\s*"([^"]*)" matches a source: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char.
I have to find and remove a substring from the text using regexp in PostgreSQL. The substring corresponds to the condition: <any text between double-quotes containing for|while inside>
Example
Text:
PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", " Script for data loading: ", false, v_sql, 0);
So, my purpose is to find and remove the substring "Script for data loading: ".
When I tried to use the script below:
SELECT regexp_replace(
'PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", "> Table for loading: "||cc.source_table_name , false, null::text, 0);'
, '(\")(.*(for|while)(\s).*)(\")'
, '');
I have all the texts inside double-quotes replaced. The result looks like:
PERFORM xxkkcsort.f_write_log(||cc.source_table_name , false, null::text, 0);
What's a proper regular expression to solve the issue?
You canuse
SELECT regexp_replace(
'PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", "> Table for loading: "||cc.source_table_name , false, null::text, 0);',
'"[^"]*(for|while)\s[^"]*"',
'') AS Result;
Output:
PERFORM xxkkcsort.f_write_log("INFO", "xxkkcsort.f_load_tables__outof_order_system", ||cc.source_table_name , false, null::text, 0);
See the regex demo and the DB fiddle. Details:
" - a " char
[^"]* - zero or more chars other than "
(for|while) - for or while
\s - a whitespace
[^"]*" - zero or more chars other than " and then a " char.
any text between double-quotes containing for|while inside
SELECT regexp_replace(string, '"[^"]*\m(?:for|while)\M[^"]*"', '');
" ... literal " (no special meaning here, so no need to escape it)
[^"]* ... character class including all characters except ", 0-n times
\m ... beginning of a word
(?:for|while) ... two branches in non-capturing parentheses
(regexp_replace() works with simple capturing parentheses, too, but it's cheaper this way since you don't use the captured substring. But try either with the replacement '\1', where it makes a difference ...)
\M ... end of a word
[^"]* ... like above
" ... like above
I dropped \s from your expression, as the task description does not strictly require a white-space character (end of string or punctuation delimiting the word ...).
Related:
Escape function for regular expression or LIKE patterns
How to create a regex , so as to search two strings "greet" AND inside this method string "name" .
I tried
(^.*greet(\n|.|\t)*)(.*name*)
def greet(name):
print("Hello, " + name + ". Good morning!") <--- this name should be selected
def meet(name):
print("Lets meet, " + name )
I would use this regex:
greet([^\n]|\n+[^\S\n])*name
Here the strings greet and name are separated by characters that are not a linebreak ([^\n]) or, in the case, they must be eventually followed by a space that is not a linebreak ([^\S\n]). In this way you ensure that name is in the same method of greet.
See demo.
You can capture in a group what is between the parenthesis, and use a backreference \1 in the next line to match the same.
If you want to select it, you could also capture that in a group itself.
\bdef greet\(([^\s()]+)\):\r?\n.*(\1)
Regex demo
If it should be name only
\bdef greet\([^\s()]+\):\r?\n.*\b(name)\b
Regex demo
I want to match a sentence if that sentence has some specific words in it. So below is the piece of code MY SQL code from the PHP and i want to match the SQL sentence if it has 4 or more JOIN keywords in it. The starting of the match should start from the equal sign and the end should be the semicolon
Below is the PHP code ,
its not working for the below source file, its selects so many lines actually i have only one query which has four joins
<?php
$query = "SELECT users.id, com.com, com.dt ".
"FROM com, users ".
"WHERE com.user_id = users.user_id ".
"LIMIT 10";
$query = "SELECT cd.id, sf.cap
la.val as last_sort
FROM cd AS cd
JOIN ci AS ci
ON cd.id = ci.id
JOIN sf AS sf
ON ci.field_id = sf.f_id
JOIN ci as name
ON cd.ci = ln.c_id
JOIN ca as ad
ON cd.cid = ad.c_id
AND ln.fld = 7
WHERE cd.progress = 1"
$result = mysql_quy($queryr));
?>
I tried with the regex below and it always stops at the last JOIN word .
=[[:space:]]*"[[:space:]]*SELECT[a-zA-Z0-9\"\,\.\= \n\_\*]+JOIN
But the problem with the above regex it selects the query which has 2 JOIN words queries as well.
To match JOIN you have to range {4} in your regex and end the regex with a ;.
You can use this regex:
$re = '/=\s*"\s*SELECT\s+(?:[^;]+?JOIN){4}[^;]+?;/i';
i.e. match = surrounded by spaces then match SELECT followed by 1 or more spaces. 1 or more non-semi colon characters before JOIN and group it. {4} will match 4 of the grouped text and [^;]+?; will match till ; is found.
Please note than ; should not be in the quoted text.
RegEx Demo
In my java file there are many sql queries assigned to java strings like:
/* ... */
String str1 = "SELECT item1, item2 from table1 where this=that and MYWORD=that2 and this3=that3";
/* ... */
String str2 = "SELECT item1, item2 from table1 where this=that and" +
" MYWORD=that2 and this3=that3 and this4=that4";
/* ... */
/* ... */
String str3 = "SELECT item1, item2 from table2 where this=that and this2=that2 and" +
" this3=that3 and this4=that4";
/* ... */
String str4 = "SELECT item1, item2 from table3 where this=that and MYWORD=that2" +
" and this3=that3 and this4=that4";
/* ... */
String str5 = "SELECT item1, item2 from table4 where this=that and this2=that2 and this3=that3";
/* ... */
Now I want to find out the 'SELECT...' queries that doesn't contain the word 'MYWORD' in it.
From one of my previous S/O question I got the answer how to find all the 'SELECT...' queries, but I need to extend that solution to find the ones that doesn't contain certain word.
I have tried the regex SELECT(?!.*MYWORD).*; that can't find the multiline queries (like str3 above), finds only the single line ones.
I've also tried the regex SELECT[\s\S]*?(?!MYWORD).*(?<=;)$ that finds all the queries and is unable to determine whether the word 'MYWORD' is present in the query or not.
I know I'm very near to the solution, still can't figure it out.
Can anyone help me, please?
(I am using notepad++ on windows)
The problem with the first regex is that . doesn't match a newline. In normal regexes, there is an option to change that, but I don't know whether that feature exists in notepad++.
The problem with the second regex is that is matches "select, then some stuff, then anything that doesn't match MYWORD, then more stuff, then a semicolon" Even if MYWORD exists, the regex engine will happily match (?!MYWORD) to some other part of the string that is not MYWORD.
Something like this should work (caveat: not tested on Notepad++):
SELECT(?![^;]*MYWORD)[^;]*;
Instead of ., match anything that is not a semicolon. This should allow you to match a newline.
Beyond that, it is also important that you don't allow a semicolon to be part of the match. Otherwise, the pattern can expand to gobble up multiple SELECT statements as it tries to match.
Try this (on a current version of Notepad++ using Perl-compatible regexes; older versions don't support multiline regexes):
SELECT (?:(?!MYWORD)[^"]|"\s*\+\s*")*"\s*;
Explanation:
SELECT # Match SELECT
(?: # Match either...
(?!MYWORD) # (as long as it's not the word MYWORD)
[^"] # any character except a quote
| # or
"\s* # an ending quote, optional whitespace,
\+\s* # a plus sign, optional whitespace (including newlines),
" # and another opening quote.
)* # Repeat as needed.
"\s*; # Match a closing quote, optional whitespace, and a semicolon.