Extract string after last match strings [duplicate] - regex

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I'm using BigQuery and I want to extract string after the specific match strings, in my case, the strings is sc
I have a string like this :
www.xxss.com?psct=T-EST2%20.coms&.com/u[sc'sc(mascscin', sc'.c(scscossccnfiscg.scjs']-/ci=1(sctitis)
My expected result is:
titis)
Is this possible?

In general, across all RDBMS finding the index of the last instance of a match in a string is easy to compute by first reversing the string. Then we are only looking for the first match.
Update: BigQuery
Follow the documentation for REGEXP_EXTRACT in the String Functions documentation for BigQuery
NOTE: BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.
However, this problem can be solved without RegEx.
BigQuery supports array processing and has a SPLIT function, so you could split by the lookup variable and capture only the last result:
SELECT ARRAY_REVERSE(SPLIT( !YOUR COLUMN HERE! , "sc"))[OFFSET(1)]
The following adaptation from my original submission may still work:
SELECT REVERSE(SUBSTR(REVERSE(#text), 1, STRPOS(REVERSE(#text), "cs") -1))
For those who have a similar requirement in MS SQL Server the following syntax can be used.
other RDBMS can use a similar query, you will have to use the appropriate platform functions to acheive the result.
DECLARE #text varchar(200) = 'www.xxss.com?psct=T-EST2%20.coms&.com/u[sc''sc(mascscin'', sc''.c(scscossccnfiscg.scjs'']-/ci=1(sctitis)'
SELECT REVERSE(LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1))
Produces: titis)
You could achieve a similar result by obtaining the last index of 'sc' as above and using that value in a SUBSTRING however for that to work you need to re-compute the Length, this solution instead uses the LEFT function and then REVERSE's the result , reducing the functional complexity of the query by 1 (1 less function call)
Step this through:
Reverse the value:
SELECT REVERSE(#text)
Results in:
)sititcs(1=ic/-]'sjcs.gcsifnccssocscs(c.'cs ,'nicscsam(cs'cs[u/moc.&smoc.02%2TSE-T=tcsp?moc.ssxx.www
Now we find the first Index of 'cs'
Note: we have to reverse the sequece of the lookup string as well!
SELECT CharIndex('cs', REVERSE(#text),1)
Result: 7
Select the characters before this index:
Note: we must use -1 here because SQL uses 1-based index result from CharIndex so we must reduce it by 1
SELECT LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1)
Finally, we reverse the result:
SELECT REVERSE(LEFT(REVERSE(#text), CharIndex('cs', REVERSE(#text),1) -1))

Guess you could use 'sc' as seperator, define (if constant string length) string length in your query (wildcard),
STRING_SPLIT ( string , separator )

Related

Extract Specific Parameter value using Regex Postgresql

Given input string as
'PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,kWh,800,Gallons,500'
and unit_param = 'Gallons'
I need to extract value of unit_param (Gallons) which is 1000 using postgresql regex functions.
As of now, I have a function that first extracts value for THRESHOLDLIST which is "kWh,2000,Gallons,1000,litre,3000", then splits and loops over the array to get the value.
Can I get this efficiently using regex.
SELECT substring('PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,xyz' FROM '%THRESHOLDLIST=#".........#",%' FOR '#')
Use substring() with the target input grouped:
substring(myCol, 'THRESHOLDLIST=[^=]*Gallons,([0-9]+)')
The expression [^=]* means “characters that are not =”, so it won’t match Gallons within another parameter.
select
Substring('PARAM_1=TRUE,THRESHOLDLIST=kWh,2000,Gallons,1000,litre,3000,PARAM_2=TRUE,PARAM_3=abc,123,xyz' from 'Gallons,\d*');
returns Gallons,1000

regular expression replace for SQL

I have to replace a string pattern in SQL with empty string, could anyone please suggest me?
Input String 'AC001,AD001,AE001,SA001,AE002,SD001'
Output String 'AE001,AE002
There are the 4 digit codes with first 2 characters "alphabets" and last two are digits. This is always a 4 digit code. And I have to replace all codes except the codes starting with "AE".
I can have 0 or more instances of "AE" codes in the string. The final output should be a formatted string "separated by commas" for multiple "AE" codes as mentioned above.
Here is one option calling regex_replace multiple times, eliminating the "not required" strings little by little in each iteration to arrive at the required output.
SELECT regexp_replace(
regexp_replace(
regexp_replace(
'AC001,AD001,AE001,SA001,AE002,SD001', '(?<!AE)\d{3},{0,1}', 'X','g'
),'..X','','g'
),',$','','g'
)
See Demo here
I would convert the list to an array, unnest that to rows then filter out those that should be kept and aggregate it back to a string:
select string_agg(t, ',')
from unnest(string_to_array('AC001,AD001,AE001,SA001,AE002,SD001',',') as x(t)
where x.t like 'AE%'; --<< only keep those
This is independent of the number of elements in the string and can easily be extended to support more complex conditions.
This is a good example why storing comma separated values in a single column is not such a good idea to begin with.

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Does mongodb $regex without the option `i` still make use of the index if I am searching on the Index?

I have a model with a normal index using Mongoose.
const mod = new mongoose.Schema({
number: { type: String, required: true, index: { unique: true } },
});
I am using a regex in a query to get the mod corresponding to a specific number. Will my regex query utilize the index that is on this model?
query.number = {
$regex: `.*Q10.*`
}
modelName.find(query)
I am concerned that this is looking through the entire collection without using the indexes. What would be the best way to know if I am using the index. Or if you happen to know a way that will utilize the index could you show me? Here I am looking for all close to Q10, not trying to get an exact match. Would using /^Q10.* be better and use the index?
Referencing MongoDB regex information on index and comments made on this post stackoverflow previous question
The best way to confirm index usage for a given query is using MongoDB's query explain() feature. See Explain Results in the manual for your version of MongoDB for more information on the output fields and interpretation.
With regular expressions a main concern is efficient use of indexes. An unanchored substring match like /Q10/ will require examining all index keys (assuming a candidate index exists, as in your example). This is an improvement over scanning the full collection data (as would be the case without an index), but not as ideal as being able to check a subset of relevant index keys as is possible with a regex prefix search.
If you are routinely searching for substring matches and there is a common pattern to your strings, you could design a more scalable schema. For example, you could save whatever your Q10 value represents into a separate field (such as part_number) where you could use a prefix match or an exact match (non-regex).
To illustrate, I set up some test data using MongoDB 3.4.2 and the mongo shell:
// Needles: strings to search for
db.mod.insert([{number:'Q10'}, {number: 'foo-Q10'}, {number:'Q10-123'}])
// Haystack: some string values to illustrate key comparisons
for (i=0; i<1000; i++) { db.mod.insert({number: "I" + i}) }
Regex search without an index:
db.mod.find({ number: { $regex: /Q10/ }}).explain('executionStats')
The winningPlan is a COLLSCAN (collection scan) which requires the server retrieve every document in the collection to perform the comparison. Note that the original regex includes an unnecessary .* prefix and suffix; this is implicit with a substring match so can be written more concisely as /Q10/.
Highlights from the executionStats section of the explain output:
"nReturned": 2,
"totalKeysExamined": 0,
"totalDocsExamined": 1003,
The explain output confirms there are no index keys examined and 1003 documents (all docs in this collection).
Add an index for the following two examples:
db.mod.createIndex({number:1}, {unique: true})
Regex substring search with an index:
db.mod.find({ number: { $regex: /Q10/}}).explain('executionStats')
The winningPlan is still an IXSCAN, but now has to examine all 1003 indexed string values to find substring matches:
"nReturned": 3,
"totalKeysExamined": 1003,
"totalDocsExamined": 3,
Regex prefix search with an index:
db.mod.find({ number: { $regex: /^Q10/}}).explain('executionStats')
The winningPlan is an IXSCAN (Index scan) which requires 3 key comparisons and 2 document fetches to return the 2 matching documents:
"nReturned": 2,
"totalKeysExamined": 3,
"totalDocsExamined": 2,
A prefix search isn't equivalent to the first two searches, as it will not match the document with value foo-Q10. However, this does illustrate a more efficient regex search.
Note that totalKeysExamined is 3. It might be reasonable to expect this to be 2 since there were only 2 matches, however this metric includes any comparisons with out-of-range keys (eg. end of a range of values). For more information see Explain Results: keysExamined.
With the index enabled, For case sensitive regular expression queries, the query traverses the entire index (load into memory), then load the matching documents to be returned into memory. Its expensive but still could be better than a full collection scan.
For /John Doe/ regex ,mongo will scan the entire keyset in the index
then fetch the matched documents.
However, if you use a prefix query :
Further optimization can occur if the regular expression is a “prefix
expression”, which means that all potential matches start with the
same string. This allows MongoDB to construct a “range” from that
prefix and only match against those values from the index that fall
within that range.

Simplest way to find out if at least one cell in a cell array matches a regular expression

I need to search a cell array and return a single boolean value indicating whether any cell matches a regular expression.
For example, suppose I want to find out if the cell array strs contains foo or -foo (case-insensitive). The regular expression I need to pass to regexpi is ^-?foo$.
Sample inputs:
strs={'a','b'} % result is 0
strs={'a','foo'} % result is 1
strs={'a','-FOO'} % result is 1
strs={'a','food'} % result is 0
I came up with the following solution based on How can I implement wildcard at ismember function of matlab? and Searching cell array with regex, but it seems like I should be able to simplify it:
~isempty(find(~cellfun('isempty', regexpi(strs, '^-?foo$'))))
The problem I have is that it looks rather cryptic for such a simple operation. Is there a simpler, more human-readable expression I can use to achieve the same result?
NOTE: The answer refers to the original regexp in the question: '-?foo'
You can avoid the find:
any(~cellfun('isempty', regexpi(strs, '-?foo')))
Another possibility: concatenate first all cells into a single string:
~isempty(regexpi([strs{:}], '-?foo'))
Note that you can remove the "-" sign in any of the above:
any(~cellfun('isempty', regexpi(strs, 'foo')))
~isempty(regexpi([strs{:}], 'foo'))
And that allows using strfind (with lower) instead of regexpi:
~isempty(strfind(lower([strs{:}]),'foo'))