I have a query which I was using in an Access database to match a field. The rows I wish to retrieve have a field which contains a sequence of characters in two possible forms (case-insensitive):
*PO12345, 5 digits preceded by *PO, or
PO12345, 5 digits preceded by PO.
In Access I achieved this with:
WHERE MyField LIKE '*PO#####*'
I have tried to replicate the query for use in an Oracle database:
WHERE REGEXP_LIKE(MyField, '/\*+PO[\d]{5}/i')
However, it doesn't return anything. I have tinkered with the Regex slightly, such as placing brackets around PO, but to no avail. To my knowledge what I have written is correct.
Your regex \*+PO[\d]{5} is wrong. There shouldn't be + after \* as it's optional.
Using ? like this /\*?PO\d{5}/i solves the problem.
Use i (case insensitive) as parameter like this: REGEXP_LIKE (MyField, '^\*?PO\d{5}$', 'i');
Regex101 Demo
Read REGEXP_LIKE documentation.
Related
I want to select the records containing non-alphanumeric and remove those symbols from strings. The result I expecting is strings with only numbers and letters.
I'm not really familiar with regular expression and sometime it's really confusing. The code below is from answers to similar questions. But it also returns records having only letters and space. I also tried to use /s in case some spaces are not spaces but tabs instead. But I got the same result.
Also, I want to remove all symbols, characters excepting letters, numbers and spaces. I found a function named removesymbols from google could reference. But it seems this function does not exist at all. The website introduces removesymbols is https://cloud.google.com/dataprep/docs/html/REMOVESYMBOLS-Function_57344727. How can I remove all symbols? I don't want to use replace because there are a lot of symbols and I don't know all kinds of non-alphanumeric they have.
-- the code here only shows I want to select all records with non-alphanumeric
SELECT EMPLOYER
FROM fec.work
WHERE EMPLOYER NOT LIKE '[^a-zA-Z0-9/s]+'
GROUP BY 1;
Below is for BigQuery Standard SQL
SELECT
REGEXP_REPLACE(EMPLOYER, '[^a-zA-Z\\d\\s\\t]', ''), -- option 1
REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s\t]', ''), -- option 2
REGEXP_REPLACE(EMPLOYER, r'[^\w]', ''), -- option 3
REGEXP_REPLACE(EMPLOYER, r'\W', '') -- option 4
FROM fec.work
As you can see - option 1 is most verbose and you can avoid double escaping by using r in front of string regular expression as it is in option 2
To further simplify - you can use \w or directly \W as in options 3 and 4
Note: BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.
I suggest using REGEXP_REPLACE for select, to remove the characters, and using REGEXP_CONTAINS to get only the one you want.
SELECT REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s]', '')
FROM fec.work
WHERE REGEXP_CONTAINS(EMPLOYER, r'[^a-zA-Z\d\s]')
You say you don't want to use replace because you don't know how many alphanumerical there is. But instead of listing all non-alphanumerical, why not use ^ to get all but alphanumerical ?
EDIT :
To complete with what Mikhail answered, you have multiple choices for your regex :
'[^a-zA-Z\\d\\s]' // Basic regex
r'[^a-zA-Z\d\s]' // Uses r to avoid escaping
r'[^\w\s]' // \w = [a-zA-Z0-9_] (! underscore as alphanumerical !)
If you don't consider underscores to be alphanumerical, you should not use \w
I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:
I want to use regular expression to split String values from a field.
Here is something to follow my question
mydatabase=> SELECT regexp_replace('a1=1,2;B2b=2,3,4;C3c={3,4,5;4,5,6};D4d={4,5,6;7,8,9}',
'([^0-9]|^)([=.*])(?=;|$)', '\1 \2', 'g');
regexp_replace
------------------------------------------------------
a1=1,2;B2b=2,3,4;C3c={3,4,5;4,5,6};D4d={4,5,6;7,8,9}
(1 row)
But I want the result like below
mydatabase=>YOUR_ANSER_QUERY
regexp_replace
------------------
a1=1
B2b=2,3,4
C3c={3,4,5;4,5,6}
D4d={4,5,6;7,8,9}
(4 rows)
You have semi-colon within your brackets. to escape them, i have added (?![0-9]) a negative look-ahead so specified pattern not exist. to separate them into table
This should do it:
SELECT regexp_split_to_table( 'a1=1,2;B2b=2,3,4;C3c={3,4,5;4,5,6};D4d={4,5,6;7,8,9}', ',?[1-2]?(;(?![0-9]))');
I used online regex replace verifier at http://regexr.com
The regex to search for is ([^=]+)=(\{[^\}]+\}|[^;]+)(?:;|$)
and the regex to replace is $1=$2\r.
For your input string they give the required result.
Note that this verifier requires $ sign (+ number) to refer to a capturing group.
I want to only get the records that have some words at one column, I have tried using WHERE ... IN (...) but Postgres is case sensitive in this where clause.
This is why I tried regex and ~* operator.
The following is a SQL snippet that returns all the columns and tables from the DB, I want to restrict the rows to bring only the tables in the regex expresion.
SELECT ordinal_position as COLUMN_ID, TABLE_NAME, COLUMN_NAME
FROM information_schema.columns
WHERE table_schema = 'public' and table_name ~* 'PRODUCTS|BALANCES|BALANCESBARCODEFORMATS|BALANCESEXPORTCATEGORIES|BALANCESEXPORTCATEGORIESSUB'
order by TABLE_NAME, COLUMN_ID
This regex will bring all the columns of BALANCES and the columns of the tables that contain the 'BALANCES' keyword.
I want to restrict the result to complete names only.
Using regexes, the common solution is using word boundaries before and after the current expression.
See effect without: http://regexr.com?35ecl
See effect with word boundaries: http://regexr.com?35eci
In PostgreSQL, the word boundaries are denoted by \y (other popular regex engines, such as PCRE, C# and Java, use \b instead - thus its use in the regex demo above - thanks #IgorRomanchenko).
Thus, for your case, the expression below could be used (the matches are the same as the example regexes in the links above):
'\y(PRODUCTS|BALANCES|BALANCESBARCODEFORMATS|BALANCESEXPORTCATEGORIES|BALANCESEXPORTCATEGORIESSUB)\y'
See demo of this expression in use here:
http://sqlfiddle.com/#!12/9f597/1
If you want to match only whole table_name use something like
'^(PRODUCTS|BALANCES|BALANCESBARCODEFORMATS|BALANCESEXPORTCATEGORIES|BALANCESEXPORTCATEGORIESSUB)$'
^ matches at the beginning of the string.
$ matches at the end of the string.
Details here.
Alternatively you can use something like:
upper(table_name) IN ('PRODUCTS','BALANCES','BALANCESBARCODEFORMATS','BALANCESEXPORTCATEGORIES', ...)
to make IN case insensitive.
I am working on Oracle 10gR2.
I am working on a column which stores username. Let's say that one of the values in this column is "Ankur". I want to fetch all records where username is a concatenated string of "Ankur" followed by some numerical digits, like "Ankur1", "Ankur2", "Ankur345" and so on. I do not want to get records with values such as "Ankurab1" - that is anything which is concatenation of some characters to my input string.
I tried to use REGEX functions to achieve the desired result, but am not able to.
I was trying:
SELECT 1 FROM dual WHERE regexp_like ('Ankur123', '^Ankur[:digit:]$');
Can anyone help me here?
Oracle uses POSIX EREs (which don't support the common \d shorthand), so you can use
^Ankur[0-9]+$
Your version would nearly have worked, too:
^Ankur[[:digit:]]+$
One set of [...] for the character class, and one for the [:digit:] subset. And of course a + to allow more than just one digit.
You were close
Try regexp_like ('Ankur123', '^Ankur[[:digit:]]+$') instead.
Note the [[ instead of the single [ and the + in front of the $.
Without the + the regexp only matches for exactly one digit after the String Ankur.