Regexextract behaving unexpectedly - regex

I am trying to extract a (variable) substring from a longer result output string in a cell.
=SPLIT(TRIM(REGEXEXTRACT($Z3,“.(?s)+\([R][1][-][1][M]\)\s+\w+\s+\w+\s+\w+\s+[-]+\s+(.*)“)),1)
Typical content of cell Z3 is:
(F1-1D) Unique identifier schemes found [‘url’], (R1-1M) Resource type specified - webpage, (R1.2-1M) Found date-related picture information, (A1-2M) Access to metadata found: slurp
I want to extract the word between - and , following (R1-1M).
In this example it is webpage.
The string can contain any number of the comma-separated elements.

EDIT
Taking a better look at the OP's question
I want to extract the word between - and , following (R1-1M).
In this example it is webpage.
The string can contain any number of the comma-separated elements.
I believe the whole formula can be further simplified to
=REGEXEXTRACT($A$3, "- (\w+),")
Original answer
You can try the following
=REGEXEXTRACT($A$3, "(\w+),[ \(R1\.2\-1M\)]")
or even
=REGEXEXTRACT($A$3, "[\(R1\-1M\)] (\w+),")
(Do adjust ranges to your needs)

try:
=ARRAYFORMULA(TRIM(FLATTEN(QUERY(TRANSPOSE(IFNA(REGEXEXTRACT(
IFERROR(SPLIT(A1:A, ",")), "\(R1-1M\).+- (.+)"))),,9^9))))

Related

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Split string and get last element

Let's say I have a column which has values like:
foo/bar
chunky/bacon/flavor
/baz/quz/qux/bax
I.e. a variable number of strings separated by /.
In another column I want to get the last element from each of these strings, after they have been split on /. So, that column would have:
bar
flavor
bax
I can't figure this out. I can split on / and get an array, and I can see the function INDEX to get a specific numbered indexed element from the array, but can't find a way to say "the last element" in this function.
Edit:
this one is simplier:
=REGEXEXTRACT(A1,"[^/]+$")
You could use this formula:
=REGEXEXTRACT(A1,"(?:.*/)(.*)$")
And also possible to use it as ArrayFormula:
=ARRAYFORMULA(REGEXEXTRACT(A1:A3,"(?:.*/)(.*)$"))
Here's some more info:
the RegExExtract function
Some good examples of syntax
my personal list of Regex Tricks
This formula will do the same:
=INDEX(SPLIT(A1,"/"),LEN(A1)-len(SUBSTITUTE(A1,"/","")))
But it takes A1 three times, which is not prefferable.
You could do this too
=index(SPLIT(A1, "/"), COLUMNS(SPLIT(A1, "/"))-1)
Also possible, perhaps best on a copy, with Find:
.+/
(Replace with blank) and Search using regular expressions ticked.
You can try use this!
You've got the array of String, so you can acess the last element by length
String message = "chunky/bacon/flavor";
String[] outSplited = message.split("/");
System.out.println(outSplited[outSplited.length -1]);

Regular Expression: allow only some specific set of words in the json string

I have an case where in I have to allow only some specific set of words in the json string, example:
{"name":"prashant","id":123,"address":"e-56 first floor"} This is valid\n
{"name":"prashant","address":"e-56 first floor"} This is valid\n
{"id":123,"address":"e-56 first floor"} This is valid\n
{"address":"e-56 first floor"} This is valid\n
Any permutation combination for given words (name|id|address) is ok
Now Invalid
{"name":"prashant","id":123,"address":"e-56 first floor" ,"phno":"9999999999"}
Not valid
if any word which is not the part of the given list has been used , then the whole string will be invalid , even though its contains 3 valid one invalid , will not allowed.
Can any one provide me the REGEX for this?
I would sugest to parse the data as JSON, there are many libraries for many languages. After you parsed the JSON you just need to check if all required fields are set.
This should be enough to validate if every key in the dictionary is valid. It will not detect some malformed json.
{("(name|id|address)":[^,]*,?)+}
The following one does not allow malformed json
{(("(name|address)":"[^"]*",)|("id":\d+,))*(("(name|address)":"[^"]*")|("id":\d+))}

variable number of capturing groups

I have a xpath expression which I want to use to extract City and date from a td which contains a string of this kind:
City(may contain spaces and may be missing, but the following space is always present) on 2013/07/20
So far, I got to the following solution for extracting the date, which works partially:
//path/to/my/td/text()/replace(.,'(.*) on (.*)','$3')
This works when City is present, but when City is missing I get "on 2013/07/20" as a result.
I think this is because the first capturing group fails and so the number of groups is different.
How can I get this expression to work?
I did not fully check your regex, but it looks fine at first sight. Anyway, you can also go an easier way if you only want to get the date by extracting the text after "on ":
//path/to/my/td/text()/substring-after(.,'on ')
edit: or you may go the substring-way and select the last 10 characters of the content:
//path/to/my/td/text()/substring(., string-length(.) - 9)

How can I use regexextract function in Google Docs spreadsheets to get "all" occurrences of a string?

My text string is in cell D2:
Decision, ERC Case No. 2009-094 MC, In the Matter of the Application for Authority to Secure Loan from the National Electrification Administration (NEA), with Prayer for Issuance of Provisional Authority, Dinagat Island Electric Cooperative, Inc. (DIELCO) applicant(12/29/2011)
This function:
=regexextract(D2,"\([A-Z]*\)")
will grab (NEA) but not (DIELCO)
I would like it to extract both (NEA) and (DIELCO)
You can use capture groups, which will cause regexextract() to return an array. You can use this as the cell result, in which case you will get a range of results, or you can feed the array to another function to reformat it to your purpose. For example:
regexextract( "abracadabra" ; "(bra).*(bra)" )
will return the array:
{bra,bra}
Another approach would be to use regexreplace(). This has the advantage that the replace is global (like s/pattern/replacement/g), so you do not need to know the number of results in advance. For example:
regexreplace( "aBRAcadaBRA" ; "[a-z]+" ; "..." )
will return the string:
...BRA...BRA
Here are two solutions, one using the specific terms in the author's example, the other one expanding on the author's sample regex pattern which appears to match all ALLCAPS terms. I'm not sure which is wanted, so I gave both.
(Put the block of text in A1)
Generic solution for all words in ALLCAPS
=regexreplace(regexreplace(REGEXREPLACE(A1,"\b\w[^A-Z]*\b","|"),"\W+","|"),"^\||\|$","")
Result:
ERC|MC|NEA|DIELCO
NB: The brunt of the work is in the CAPITALIZED formula, the lowercase functions are just for cleanup.
If you want space separation, the formula is a little simpler:
=trim(regexreplace(REGEXREPLACE(A1,"\b\w[^A-Z]*\b"," "),"\W+"," "))
Result:
ERC MC NEA DIELCO
(One way I like playing with regex in google spreadsheets is to read the regex pattern from another cell so I can change it without having to edit or re-paste into all the cells using that pattern. This looks so:
Cell A1:
Block of text
Cell B1 (no quote marks):
\b\w[^A-Z]*\b
Formula, in any cell:
=trim(regexreplace(REGEXREPLACE(A1,B$1," "),"\W+"," "))
By anchoring it to B$1, I can fill all my rows at once and the reference won't increment.)
Previous answer:
Specific solution for selected terms (ERC, DIELCO)
=regexreplace(join("|",IF(REGEXMATCH(A1,"ERC"),"ERC",""),IF(REGEXMATCH(A1,"DIELCO"),"DIELCO","")),"(^\||\|$)","")
Result:
ERC|DIELCO
As before, the brunt of the work is in the CAPITALIZED formula, the lowercase functions are just for cleanup.
This formula will find any ERC or DIELCO, or both in the block of text. The initial order doesn't matter, but the output will always be ERC followed by DIELCO (the order of appearance is lost). This fixes the shortcoming with the previous answer using "(bra).*(bra)" in that isolated ERC or DIELCO can still be matched.
This also has a simpler form with space separation:
=trim(join(" ",IF(REGEXMATCH(A1,"ERC"),"ERC",""),IF(REGEXMATCH(A1,"DIELCO"),"DIELCO","")))
Result:
ERC DIELCO
Please try:
=SPLIT(regexreplace(A1 ; "(?s)(.)?\(([A-Z]+)\)|(.)" ; "🧸$2");"🧸")
or
=REGEXEXTRACT(A1;"\Q"&REGEXREPLACE(A1;"\([A-Z]+\)";"\\E(.*)\\Q")&"\E")