Oracle REGEX functions - regex

I am working on Oracle 10gR2.
I am working on a column which stores username. Let's say that one of the values in this column is "Ankur". I want to fetch all records where username is a concatenated string of "Ankur" followed by some numerical digits, like "Ankur1", "Ankur2", "Ankur345" and so on. I do not want to get records with values such as "Ankurab1" - that is anything which is concatenation of some characters to my input string.
I tried to use REGEX functions to achieve the desired result, but am not able to.
I was trying:
SELECT 1 FROM dual WHERE regexp_like ('Ankur123', '^Ankur[:digit:]$');
Can anyone help me here?

Oracle uses POSIX EREs (which don't support the common \d shorthand), so you can use
^Ankur[0-9]+$
Your version would nearly have worked, too:
^Ankur[[:digit:]]+$
One set of [...] for the character class, and one for the [:digit:] subset. And of course a + to allow more than just one digit.

You were close
Try regexp_like ('Ankur123', '^Ankur[[:digit:]]+$') instead.
Note the [[ instead of the single [ and the + in front of the $.
Without the + the regexp only matches for exactly one digit after the String Ankur.

Related

Prxmatch in SAS - using $ to limit results doesn't work

I'm trying to use prxmatch to verify if postcode format (UK) is correct. The ('/^[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2}$/') bit covers (I think) all the possible post code formats used in UK, however I only want exact and not partial matches and no additional chars before or after match.
data pc_flag ; set abc ;
format pc_correct_flag $1. compressed_postcode $100.;
compressed_postcode = compress(postcode);
pc_regex = prxparse('/^[A-Z]{1,2}\d{2,3}[A-Z]{2}|[A-Z]{1,2}\d[A-Z]\d[A-Z]{2}$/');
if prxmatch(pc_regex,compressed_postcode)>0
then pc_correct_flag='Y';
else pc_correct_flag='N';run;
I was expecting 'Y' only on exact matches on full string, i.e. with no additional characters before and after regex. However, I'm also getting false positives, where a part of 'compressed_postcode' matches regex, but there are additional characters after the match, which I thought using $ would prevent.
I.e. I'd expect only something like AA11AA to match, but not AA11AAAA. I suspect this has to do with $ positioning but can't figure out exactly what's wrong. Any idea what I've missed?
SAS character variables contain trailing spaces out to the length of the variable. Either trim the value to be examined, or add \s*$ as the pattern termination.
if prxmatch(pc_regex,TRIM(compressed_postcode))>0 then …
Your regex is quite permissive - it allows every letter of the alphabet in every valid character position, so it matches quite a lot of strings that look like valid postcodes but do not exist as such, e.g. ZZ1 1ZZ.
I provided a more specific SAS-compatible postcode regex as an answer to another question - here's link in case this proves useful to you:
https://stackoverflow.com/a/43793562/667489
That one still matches some non-postcode strings, but it filters out any with characters on Royal Mail's blacklists for each position within the postcode.
As per Richard's answer, you need to trim the string being matched before applying the regex, or amend the regex to match extra trailing blanks.

How can I select all records with non-alphanumeric and remove them?

I want to select the records containing non-alphanumeric and remove those symbols from strings. The result I expecting is strings with only numbers and letters.
I'm not really familiar with regular expression and sometime it's really confusing. The code below is from answers to similar questions. But it also returns records having only letters and space. I also tried to use /s in case some spaces are not spaces but tabs instead. But I got the same result.
Also, I want to remove all symbols, characters excepting letters, numbers and spaces. I found a function named removesymbols from google could reference. But it seems this function does not exist at all. The website introduces removesymbols is https://cloud.google.com/dataprep/docs/html/REMOVESYMBOLS-Function_57344727. How can I remove all symbols? I don't want to use replace because there are a lot of symbols and I don't know all kinds of non-alphanumeric they have.
-- the code here only shows I want to select all records with non-alphanumeric
SELECT EMPLOYER
FROM fec.work
WHERE EMPLOYER NOT LIKE '[^a-zA-Z0-9/s]+'
GROUP BY 1;
Below is for BigQuery Standard SQL
SELECT
REGEXP_REPLACE(EMPLOYER, '[^a-zA-Z\\d\\s\\t]', ''), -- option 1
REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s\t]', ''), -- option 2
REGEXP_REPLACE(EMPLOYER, r'[^\w]', ''), -- option 3
REGEXP_REPLACE(EMPLOYER, r'\W', '') -- option 4
FROM fec.work
As you can see - option 1 is most verbose and you can avoid double escaping by using r in front of string regular expression as it is in option 2
To further simplify - you can use \w or directly \W as in options 3 and 4
Note: BigQuery provides regular expression support using the re2 library; see that documentation for its regular expression syntax.
I suggest using REGEXP_REPLACE for select, to remove the characters, and using REGEXP_CONTAINS to get only the one you want.
SELECT REGEXP_REPLACE(EMPLOYER, r'[^a-zA-Z\d\s]', '')
FROM fec.work
WHERE REGEXP_CONTAINS(EMPLOYER, r'[^a-zA-Z\d\s]')
You say you don't want to use replace because you don't know how many alphanumerical there is. But instead of listing all non-alphanumerical, why not use ^ to get all but alphanumerical ?
EDIT :
To complete with what Mikhail answered, you have multiple choices for your regex :
'[^a-zA-Z\\d\\s]' // Basic regex
r'[^a-zA-Z\d\s]' // Uses r to avoid escaping
r'[^\w\s]' // \w = [a-zA-Z0-9_] (! underscore as alphanumerical !)
If you don't consider underscores to be alphanumerical, you should not use \w

REGEXP_LIKE in Oracle

I have a query which I was using in an Access database to match a field. The rows I wish to retrieve have a field which contains a sequence of characters in two possible forms (case-insensitive):
*PO12345, 5 digits preceded by *PO, or
PO12345, 5 digits preceded by PO.
In Access I achieved this with:
WHERE MyField LIKE '*PO#####*'
I have tried to replicate the query for use in an Oracle database:
WHERE REGEXP_LIKE(MyField, '/\*+PO[\d]{5}/i')
However, it doesn't return anything. I have tinkered with the Regex slightly, such as placing brackets around PO, but to no avail. To my knowledge what I have written is correct.
Your regex \*+PO[\d]{5} is wrong. There shouldn't be + after \* as it's optional.
Using ? like this /\*?PO\d{5}/i solves the problem.
Use i (case insensitive) as parameter like this: REGEXP_LIKE (MyField, '^\*?PO\d{5}$', 'i');
Regex101 Demo
Read REGEXP_LIKE documentation.

Interesting easy looking Regex

I am re-phrasing my question to clear confusions!
I want to match if a string has certain letters for this I use the character class:
[ACD]
and it works perfectly!
but I want to match if the string has those letter(s) 2 or more times either repeated or 2 separate letters
For example:
[AKL] should match:
ABCVL
AAGHF
KKUI
AKL
But the above should not match the following:
ABCD
KHID
LOVE
because those are there but only once!
that's why I was trying to use:
[ACD]{2,}
But it's not working, probably it's not the right Regex.. can somebody a Regex guru can help me solve this puzzle?
Thanks
PS: I will use it on MYSQL - a differnt approach can also welcome! but I like to use regex for smarter and shorter query!
To ensure that a string contains at least two occurencies in a set of letters (lets say A K L as in your example), you can write something like this:
[AKL].*[AKL]
Since the MySQL regex engine is a DFA, there is no need to use a negated character class like [^AKL] in place of the dot to avoid backtracking, or a lazy quantifier that is not supported at all.
example:
SELECT 'KKUI' REGEXP '[AKL].*[AKL]';
will return 1
You can follow this link that speaks on the particular subject of the LIKE and the REGEXP features in MySQL.
If I understood you correctly, this is quite simple:
[A-Z].*?[A-Z]
This looks for your something in your set, [A-Z], and then lazily matches characters until it (potentially) comes across the set, [A-Z], again.
As #Enigmadan pointed out, a lazy match is not necessary here: [A-Z].*[A-Z]
The expression you are using searches for characters between 2 and unlimited times with these characters ACDFGHIJKMNOPQRSTUVWXZ.
However, your RegEx expression is excluding Y (UVWXZ])) therefore Z cannot be found since it is not surrounded by another character in your expression and the same principle applies to B ([ACD) also excluded in you RegEx expression. For example Z and A would match in an expression like ZABCDEFGHIJKLMNOPQRSTUVWXYZA
If those were not excluded on purpose probably better can be to use ranges like [A-Z]
If you want 2 or more of a match on [AKL], then you may use just [AKL] and may have match >= 2.
I am not good at SQL regex, but may be something like this?
check (dbo.RegexMatch( ['ABCVL'], '[AKL]' ) >= 2)
To put it in simple English, use [AKL] as your regex, and check the match on the string to be greater than 2. Here's how I would do in Java:
private boolean search2orMore(String string) {
Matcher matcher = Pattern.compile("[ACD]").matcher(string);
int counter = 0;
while (matcher.find())
{
counter++;
}
return (counter >= 2);
}
You can't use [ACD]{2,} because it always wants to match 2 or more of each characters and will fail if you have 2 or more matching single characters.
your question is not very clear, but here is my trial pattern
\b(\S*[AKL]\S*[AKL]\S*)\b
Demo
pretty sure this should work in any case
(?<l>[^AKL\n]*[AKL]+[^AKL\n]*[AKL]+[^AKL\n]*)[\n\r]
replace AKL for letters you need can be done very easily dynamicly tell me if you need it
Is this what you are looking for?
".*(.*[AKL].*){2,}.*" (without quotes)
It matches if there are at least two occurences of your charactes sorrounded by anything.
It is .NET regex, but should be same for anything else
Edit
Overall, MySQL regular expression support is pretty weak.
If you only need to match your capture group a minimum of two times, then you can simply use:
select * from ... where ... regexp('([ACD].*){2,}') #could be `2,` or just `2`
If you need to match your capture group more than two times, then just change the number:
select * from ... where ... regexp('([ACD].*){3}')
#This number should match the number of matches you need
If you needed a minimum of 7 matches and you were using your previous capture group [ACDF-KM-XZ]
e.g.
select * from ... where ... regexp('([ACDF-KM-XZ].*){7,}')
Response before edit:
Your regex is trying to find at least two characters from the set[ACDFGHIJKMNOPQRSTUVWXZ].
([ACDFGHIJKMNOPQRSTUVWXZ]){2,}
The reason A and Z are not being matched in your example string (ABCDEFGHIJKLMNOPQRSTUVWXYZ) is because you are looking for two or more characters that are together that match your set. A is a single character followed by a character that does not match your set. Thus, A is not matched.
Similarly, Z is a single character preceded by a character that does not match your set. Thus, Z is not matched.
The bolded characters below do not match your set
ABCDEFGHIJKLMNOPQRSTUVWXYZ
If you were to do a global search in the string, only the italicized characters would be matched:
ABCDEFGHIJKLMNOPQRSTUVWXYZ

PCRE regex replace a text pattern within double quotes

In Notepad++ 6.5.1 I need to replace certain patterns within quote pairs. I want to save the replace as part of a macro, so all replacements need to happen in one step.
For example, in the following string, replace all 'a' characters within quote pairs with a dash, while leaving characters outside the quote pairs untouched:
Input: aa"bbabaavv"kdjhas"bbabaavv"x
Desired result: aa"bb-b--vv"kdjhas"bb-b--vv"x
Note that the quotes are matched up pairwise, such that the 'a' in kdjhas is untouched.
So far I have tried searching for (?:"[^"a]*|\G)\Ka([^"a]*) and replacing with -$1, but that simply replaces all the a's, with the result --"bb-b--vv"kdjh-s"bb-b--vv"x. I'm attempting PCRE regex that will let me recursively replace the quote-delimited text.
Edit: Quote marks within a quoted string are escaped with an extra quote, e.g. "". However, assume I will have already replaced these in a previous pass with a special character. Therefore a regex solution to this problem will not have to deal with escaped quotes.
It is hard to tell if this is possible as you've only provided one line of input text.
But assuming that input follows this pattern:
BOL|any text|string with two groups of a's|any text|string with two groups of a's|any text|EOL
aa "bbabaavv" kdjhas "bbabaavv" x
I was able to create this regexp search string:
^(.+?\".+?)([a]+)(.+?)([a]+)(.*?\")(.+?\".+?)([a]+)(.+?)([a]+)(.*?\".*)$
With this replace string:
\1-\3-\5\6-\8-\A
and it turn your input string from this:
aa"bbabaavv"kdjhas"bbabaavv"x
into this:
aa"bb-b-vv"kdjhas"bb-b-vv"x
Now naturally the search an replace will fail if the input varies from that pattern described as the search is looking for those four groups of a's inside the two groups of quoted strings.
Also I tested that regexp using Zeus which can create a regexp with more than 9 groups.
As you can see the regexp requires 10 groups.
I'm not familar with Notpad++ so I don't know if it supports that many groups.
If your data have variable number of occurrences of quoted strings, then it is not possible to perform replacements only via regex at least in its form offered by Notepad++.
To replace using regex, you would need to perform regex find in existing regex match. As far as I know such a functionality is not available in Notepad++ regexes.
Self-answer
I may have been reaching for the stars in trying to get Notepad++ to do this regex replace, but I think I found a workaround.
The actual task I was attempting involved creating a SQL Server VALUES list from an Excel spreadsheet, where I was copying and pasting selected cells into Notepad++. The delimiters are \t and \r\n. But, cells can have linefeeds too, which are delimited by ". So, I was going to replace these linefeeds with <br> (or something like it), so that
"line1
line2"
would become "line1<br>line2", before processing the actual end-of-row line feeds.
Having such parsing work reliably, especially when more than two lines were in a single cell, may have been too much to ask of Notepad++'s regex capability.
So I came up with a workaround that seems to be working:) Basically it starts with selecting a blank "dummy" column to the right of my column selection (which I can insert if I'm partially selecting from the middle). This will leave a trailing \t at the end of each row, which effectively sets these EOL's apart from ones that might exist with a text cell, freeing me from having to parse line feeds from a "..." field.
So I compiled a macro from the following steps, which seems to be working well:
replace ' with ''
replace \t\r\n with '\)\r\n, \('
replace \t with ', '
replace "" with ''
replace " with <blank>
replace ^ with \(' (cleanup - first row only)
replace ^, \('$ with <blank> (cleanup - last row only)
Example transformation:
from
line1 line 2
"line3
line3b
line3c" line 4
to
('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
which can now be easily modified into a SELECT statement:
SELECT *
FROM (VALUES('line1', 'line 2')
, ('line3
line3b
line3c', 'line 4')
) t(a,b)