remove number and special character using owa_pattern - regex

Using owa_pattern.change in oracle 9i.
Is it possible to remove a number and the trailing special character (pls note only the trailing) special character in a string?
I refer to special character as characters that is neither a word nor a number.
e.g _ , # , # ,$ etc ...
For example.
String = TEST_STRING_10
desired output would be TEST_STRING (notice only the trailing special character _ was removed).
I have already figured out how to remove the number but is stuck in the special character part.
I have this code so far.
OWA_PATTERN.CHANGE (string, '\d', '', 'g');
Appreciate any inputs.
Thanks!

Try the following.
OWA_PATTERN.CHANGE (string, '[^a-zA-Z]+$', '');
Regular expression
[^a-zA-Z]+ any character except: 'a' to 'z', 'A' to 'Z'
(1 or more times (matching the most amount possible))
$ before an optional \n, and the end of the string

This will do it:
DECLARE
result VARCHAR2(255);
BEGIN
string := 'TEST_STRING_10';
result := REGEXP_REPLACE(string, '([[:alnum:]_].*)_[[:digit:]]+', '\1', 1, 0, 'c');
END;

Related

How to check if a RegEx matches all the target string?

I need to check if a regex pattern matches with all the target string.
For example, if the pattern is '[0-9]+':
Target string '123' should result True
Target string '123' + sLineBreak should result False
The code should looks like the following:
uses
System.RegularExpressions;
begin
if(TRegEx.IsFullMatch('123' + sLineBreak, '[0-9]+'))
then ShowMessage('Match all')
else ShowMessage('Not match all');
end;
I've tried TRegEx.Match(...).Success and TRegEx.IsMatch without success and I'm wondering if there is an easy way for checking if a pattern matches the whole target string.
I've also tried using ^ - start of line and $ - end of line but without any success.
uses
System.RegularExpressions;
begin
if(TRegEx.IsMatch('123' + sLineBreak, '^[0-9]+$'))
then ShowMessage('Match all')
else ShowMessage('Not match all');
end;
Here you can find an online test demonstrating that if the target string ends with a new line, the regex still matches even using start/end of line.
Make sure the whole string matches:
\A[0-9]+\z
Explanation
--------------------------------------------------------------------------------
\A the beginning of the string
--------------------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount possible))
--------------------------------------------------------------------------------
\z the end of the string
Also, see Whats the difference between \z and \Z in a regular expression and when and how do I use it?
var str = '123';
var sLineBreak = '\n';
console.log(str.match(/^\d+$/)); //123
console.log((str + 'b').match(/^\d+$/)); //123b
console.log((str + sLineBreak).match(/^\d+$/)); //123\n
You can use : ^\d+$
^ start of string
\d+ at lease one or more number of digits
$ end of string

regex that allows 5-10 characters but can have spaces in-between not counting

Problem
Build a regex statement that allows the following:
minimum 5 characters
maximum 10 characters
can contain whitespace but whitespace does not increment character count
any non-whitespace characters increment character count
Test Cases:
expected_to_pass = ['testa', ' test a', 12342, 1.234, 'test a']
expected_to_fail = [' test', 'test ', ' test ', ' ', 1234, 0.1, ' ','12345678901']
Example regex statements and their purpose
Allow 5-10 non-whitespace characters:
[\S]{5,10}$
Allow 5-10 characters regardless of whitespace:
[\s\S]{5,10}$
I've been farting around with this for a few hours and cannot think of the best way to handle this.
How's this?
\s*(?:[\w\.]\s*){5,10}+$
Or:
\s*(?:[\w\.]\s*){5,10}$
Also, if ANY non-whitespace character goes:
\s*(?:\S\s*){5,10}$
You can test it here
There is a wrong assumption in your question: \w doesn't match all non-space-characters, it matches word characters - this means letters, digits and the underscore. Depending on language and flags set, this might include or exclude unicode letters and digits. There are a lot more non-space-characters, e.g. . and |. To match space-characters one usually uses \s, thus \S matches non-space-characters.
You can use ^\s*(?:\S\s*){5,10}$ to check your requirements. You might be able to drop the anchors, if you use some kind of full match functionality (e.g. Java .matches() or Python re.fullmatch).
Depending on the language you use, you might not want to use a regex, but iterate over the string and check character for character. This should usually be faster than regex.
Pseudocode:
number of chars = 0
for first character of string to last character of string
if character is space
inc number of chars by 1
return true if number of chars between 5 and 10
Check this out:
(\s*?\w\s*?){5,10}$
It won't match 1.234 because . is not included inside \w set
If you need it to be included then:
(\s*?[\w|\.]\s*?){5,10}$
(\s*?[\w\.]\s*?){5,10}$
Cheers

How to exclude newline mark from requests.get().text

I'm trying to get rid of numbers from site response http://app.lotto.pl/wyniki/?type=dl with code below
import requests
import re
url = 'http://app.lotto.pl/wyniki/?type=dl'
p = re.compile(r'[^\d{4}\-\d{2}\-\d{2}]\d+')
response = requests.get(url)
data = re.findall(p, response.text)
print(data)
but instead of ['7', '46', '8', '43', '9', '47'] I'm getting ['\n7', '\n46', '\n8', '\n43', '\n9', '\n47'] How can I get rid of "\n"?
Your regex is not appropriate because [^\d{4}\-\d{2}\-\d{2}]\d+ matches any character but a digit, {, 4, }, -, 2 and then 1 or more digits. In other words, you turned a sequence into a character set. And that negated character class can match a newline. It can match any letter, too. And a lot more. strip will not help in other contexts, you need to fix the regular expression.
Use
r'(?<!-)\b\d+\b(?!-)'
See the regex and IDEONE demo
This pattern will match 1+ digits (\d+) that are not preceded with a hyphen ((?<!-)) or any word characters (\b) and is not followed with a word character (\b) or a hyphen (-).
You code will look like:
import requests
import re
url = 'http://app.lotto.pl/wyniki/?type=dl'
p = re.compile(r'(?<!-)\b\d+\b(?!-)')
response = requests.get(url)
data = p.findall(response.text)
print(data)
You can strip \n using strip() function
data = [x.strip() for x in re.findall(p, response.text)]
I am assuming that \n can be in beginning as well as in end
Since your numbers are strings, you can easily use lstrip() method for strings. Such method will indeed remove newline/carriage return characters at the left side of your string (that's why lstrip).
You can try something like
print([item.lstrip() for item in data])
to remove your newlines.
Or you can as well overwrite data with the stripped version of itself:
data=[item.lstrip() for item in data]
and then simply print(data).

remove characters from a string in a data frame

I have a data frame where column "ID" has values like these:
1234567_GSM00298873
1238416_GSM90473673
98377829
In other words, some rows have 7 numbers followed by "_" followed by letters and numbers; other rows have just numbers
I want to remove the numbers and the underscore preceding the letters, without affecting the rows that have only number. I tried
dataframe$ID <- gsub("*_", "", dataframe$ID)
but that only removes the underscore. So I learned that * means zero or more.
Is there a wildcard, and a repetition operator such that I can tell it to find the pattern "anything-seven-times-followed-by-_"?
Thanks!
Your regular expression syntax is incorrect. You have nothing preceding your repetition operator.
dataframe$ID <- gsub('[0-9]+_', '', dataframe$ID)
This matches any character of: 0 to 9 ( 1 or more times ) that is preceded by an underscore.
Working Demo
Something like this?:
dataframe$ID <- gsub("[0-9]+_", "", dataframe$ID)
The link http://marvin.cs.uidaho.edu/Handouts/regex.html could helps you.
"[0-9]*_" will match numbers followed by '_'
"[0-9]{7}_" will match 7 numbers followed by '_'
".{7}_" will match 7 characters followed by '_'
A different method. If a string has an underscore, return from the underscore to the end of the string; if not, return the string.
ID <- c("1234567_GSM00298873", "1238416_GSM90473673", "98377829")
ifelse(grepl("_", ID), substr(x = ID, 9, nchar(ID)), ID)

PostgreSQL regular expression

I have a string like 'ABC3245-bG1353BcvG34'. I need to remove the hyphen including all the letters after hyphen.
so the above string should be ABC3245135334
I tried the below code:
select substring('ABC3245-bG1353BcvG34' from 1 for (position('-' in 'ABC3245-bG1353BcvG34')-1))||regexp_replace(substring('ABC3245-bG1353BcvG34' from (position('-' in 'ABC3245-bG1353BcvG34') +1) for length('ABC3245-bG1353BcvG34')),'[a-zA-Z]','')
but not able to remove letters after the hyphen.
I need to remove the hyphen including all the letters after hyphen.
so the above string (ABC3245-bG1353BcvG34) should be ABC3245135334
This suggests that all numbers should remain after the hyphen (in their original order). If that's what you want, you cannot do this with a single regex. Assuming you can have only 1 hyphen in your input:
SELECT substring(input.value from '^[^-]*') ||
regexp_replace(substring(input.value from '[^-]*$'), '\D+', '', 'g')
FROM (SELECT 'ABC3245-bG1353BcvG34'::text AS value) AS input
If you can have multiple hyphens in your input, please define how to handle characters between hyphens.
Fixed version
SELECT a[1] || regexp_replace(a[2], '\D', '', 'g')
FROM string_to_array('ABC3245-bG1353BcvG34', '-') AS a
Or, more convenient to deal with a set (like a table):
SELECT split_part(str, '-', 1)
|| regexp_replace(split_part(str, '-', 2), '\D', '', 'g')
FROM (SELECT 'ABC3245-bG1353BcvG34'::text AS str) tbl
Removes all non-digits after the hyphen. (Assuming there is only one hyphen.) Result:
ABC3245135334
First version
Missed that OP wants to remove all letters after -.
SELECT regexp_replace('ABC3245-bG1353BcvG34', '-\D*', '')
Result:
ABC32451353BcvG34
Regex explained:
- .. literal hyphen -
\D .. class shorthand for "non-digits".
* .. 0 or more times
Removes the first hyphen and everything that follows until the first digit.
A RegEx that would work:
[a-zA-Z0-9]+(?=-)
Do note that this requires the string to actually contain the hyphen. It uses a lookahead to grab a substring of all alphanumeric characters followed by a hyphen.