Regex - Don't match nulls after words - regex

I'm splitting a string on pipes (|) and the regex [^|]* is working fine when there are missing fields, but it's matching null characters after words:
GARBAGE|||GA|30604
yields matches
GARBAGE, null, null, null, GA, null, 30604, null
I've also tried [^|]+ which yields matches
GARBAGE, GA, 30604
EDIT: What I want is GARBAGE, null, null, GA, 30604. And |||| would yield matches null, null, null, null, null.
I'm referencing matches by index, so how can I fix the regex so it matches field by field, including nulls if there is no other data in the field?

This is how split works. You should use a split type function.
There is always a bias split uses.
Your case is simple in that it splits on a single character, in normal cases a regex is not needed.
And in this case, using a regex, the bias cannot be achieved without lookahead/lookbehind.
# (?:^|(?<=\|))([^|]*)(?=\||$)
(?:
^ # BOS
| # or
(?<= \| ) # Pipe behind
)
( [^|]* ) # (1), Optional non-pipe chars
(?=
\| # Pipe ahead
| # or
$ # EOS
)

While not exactly what you want, perhaps you could turn it into rows and work with it that way:
select nvl(regexp_substr( str, '([^|]*)\|{0,1}', 1, level, 'i', 1 ), 'null') part
from ( select 'GARBAGE|||GA|30604' str from dual )
connect by level <= regexp_count( str, '\|' ) + 1;
Specify which row (field) by adding a where clause where level equals the row (field) you want:
select nvl(regexp_substr( str, '([^|]*)\|{0,1}', 1, level, 'i', 1 ), 'null') part
from ( select 'GARBAGE|||GA|30604' str from dual )
where level = 4
connect by level <= regexp_count( str, '\|' ) + 1;

Related

How to find any non-digit characters using RegEx in ABAP

I need a Regular Expression to check whether a value contains any other characters than digits between 0 and 9.
I also want to check the length of the value.
The RegEx I´ve made: ^([0-9]\d{6})$
My test value is: 123Z45 and 123456
The ABAP code:
FIND ALL OCCURENCES OF REGEX '^([0-9]\d{6})$' IN L_VALUE RESULTS DATA(LT_RESULTS).
I´m expecting a result in LT_RESULTS, when I´m testing the first test value '123Z45', because there is a non-digit character.
But LT_RESULTS is in nearly every test case empty.
Your expression ^([0-9]\d{6})$ translates to:
^ - start of input
( - begin capture group
[0-9] - a character between 0 and 9
\d{6} - six digits (digit = character between 0 and 9)
) - end capture group
$ - end of input
So it will only match 1234567 (7 digit strings), not 123456, or 123Z45.
If you just need to find a string that contains non digits you could use the following instead: ^\d*[^\d]+\d*$
* - previous element may occur zero, one or more times
[^\d] - ^ right after [ means "NOT", i.e. any character which is not a digit
+ - previous element may occur one or more times
Example:
const expression = /^\d*[^\d]+\d*$/;
const inputs = ['123Z45', '123456', 'abc', 'a21345', '1234f', '142345'];
console.log(inputs.filter(i => expression.test(i)));
You can also use this character class if you want to extract non-digit group:
DATA(l_guid) = '0074162D8EAA549794A4EF38D9553990680B89A1'.
DATA(regx) = '[[:alpha:]]+'.
DATA(substr) = match( val = l_guid
regex = regx
occ = 1 ).
It finds a first occured non-digit group of characters and shows it.
If you want to just check if they are exists or how much of them reside in your string, count built-in function is your friend:
DATA(how_many) = count( val = l_guid regex = regx ).
DATA(yes) = boolc( count( val = l_guid regex = regx ) > 0 ).
Match and count exist since ABAP 7.50.
If you don't need a Regular Expression for something more complex, ABAP has some nice comparison operators CO (Contains Only), CA, NA etc for you. Something like:
IF L_VALUE CO '0123456789' AND STRLEN( L_VALUE ) = 6.

Regular expression - Remove special characters except single white space

From stack overflow, I got the standard reg expression
to eliminate -
a) special characters
b) digits
c) more than 2 spaces to single space
to include -
d) - (hyphen)
e) ' (single quote)
SELECT ID, REGEXP_REPLACE(REGEXP_REPLACE(forenames, '[^A-Za-z-]', ' '),'\s{2,}',' ') , REGEXP_REPLACE(REGEXP_REPLACE(surname, '[^A-Za-z-]', ' '),'\s{2,}',' ') , forenames, surname from table1;
Instead of 2 functions how to get the result in single function?
to include '(single quote) \' is not working in regexp_replace.
Thanks.
Oracle Setup:
CREATE TABLE test_data ( id, value ) AS
SELECT 1, '123a45b£$- ''c45d#{e''' FROM DUAL
Query:
SELECT ID,
REGEXP_REPLACE(
value,
'[^a-zA-Z'' -]| +( )',
'\1'
)
FROM test_data
Output:
ID | REGEXP_REPLACE(VALUE,'[^A-ZA-Z''-]|+()','\1')
-: | :--------------------------------------------
1 | ab- 'cde'
db<>fiddle here

What does regex expression doing?

What does this expression mean?
Pattern.compile("^.*(?=.*\\d).*$", Pattern.CASE_INSENSITIVE | Pattern.COMMENTS)
I tried to split each part of the expression, but could not get its meaning. please help me on this.
From regex101.com:
TL;DR:
Matches any String that contains at least a number (characters '0' to '9').
As a side note I'd like to point out that this is a horrendous way to do so, and could be replaced by the following:
Pattern.compile("\\d");
I basically removed all the nonsense greedy fillers and the useless anchors. Use this regex with Matcher#find() method and not Matcher#matches().
There are two parts to this regex.
1. The part up to (but not including) the digit.
2. The part from the digit to the end of the string.
The regex is processed left to right.
The first thing it see's is .*. This tells it to go directly to the
end of the string and start searching backwards to satisfy ->
The next thing it see's, which is (?=.*\d).
In that assertion the .* is ignored because of the previous .*
since its already at the end.
So the search progresses (using the assertion) to the left until it finds a
position where a digit is directly in front of the current position.
Once that is found, it matches that digit and all past it until the end of
the string. This is the part 2. described above.
Visually, it can be seen if you add some capture groups, and test it on some
real input.
^
( .* ) # (1)
(?=
( .* ) # (2)
( \d ) # (3)
)
( .* ) # (4)
$
Output:
** Grp 0 - ( pos 0 , len 15 )
12hh34ddd567uuu
** Grp 1 - ( pos 0 , len 11 )
12hh34ddd56
** Grp 2 - ( pos 11 , len 0 ) EMPTY
** Grp 3 - ( pos 11 , len 1 )
7
** Grp 4 - ( pos 11 , len 4 )
7uuu

how to get out string oracle regex

I have the following string my trying get out the 1111111 and 33333333333 with out the |
character
SELECT regexp_substr('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|','*[|]*[|][0-9]*')FROM dual
Using REGEXP_REPLACE may be a bit simpler;
SELECT REGEXP_REPLACE('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|',
'^([^|]*[|]){1}([^|]*).*$', '\2') FROM dual;
> 1111111
SELECT REGEXP_REPLACE('7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|',
'^([^|]*[|]){3}([^|]*).*$', '\2') FROM dual;
> 33333333333
You can choose column by choosing how many pipes to skip in the {1} part.
A simple SQLfiddle to test with.
A short explanation of the regexp;
([^|]+[|]){3} -- Matches 3 groups of {optional characters}{pipe}
(\d*) -- Matches the next digit group (the one we want)
.* -- Matches the rest of the expression
What we want is the second paranthesized group, that is, we replace the whole string by the back reference \2.
Because "|" separators always present it's simpler to extract fields with simple substring function rather than using regular expressions.
Just find positions of corresponding separators in source string and extract content between them:
with test_data as (
select
'7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|ABC' as s,
8 as field_number -- test 1, 3, 8, 10 and 16
from dual
)
select
field_number,
substr(
s,
decode( field_number,
1,1,
instr(s,'|',1,field_number - 1) + 1
),
(
decode( instr(s,'|',1,field_number),
0, length(s)+ 1,
instr(s,'|',1,field_number)
)
-
decode( field_number,
1, 1,
instr(s,'|',1,field_number - 1) + 1
)
)
) as field_value
from
test_data
SQLFiddle
This variant works with empty fields, non-numeric fields and so on.
Possible simplification with appending additional separators to the start and the end of the string:
with test_data as (
select
(
'|' ||
'7|1111111|2222222|33333333333|0||20140515|||false|0|0|0|0|0|ABC' ||
'|'
) as s, -- additional separators appended before and after original string
10 as field_number -- test 1, 3, 8, 10 and 16
from dual
)
select
field_number,
substr(
s,
instr(s, '|', 1, field_number) + 1,
(
instr(s, '|', 1, field_number + 1)
-
(instr(s, '|', 1, field_number) + 1)
)
) as field_value
from
test_data
;
SQLFiddle

regexp_substr skips over empty positions

With this code to return the nth value in a pipe delimited string...
regexp_substr(int_record.interfaceline, '[^|]+', 1, i)
it works fine when all values are present
Mike|Male|Yes|20000|Yes so the 3rd value is Yes (correct)
but if the string is
Mike|Male||20000|Yes, the 3rd value is 20000 (not what I want)
How can I tell the expression to not skip over the empty values?
TIA
Mike
The regexp_substr works this way:
If occurrence is greater than 1, then the database searches for the
second occurrence beginning with the first character following the
first occurrence of pattern, and so forth. This behavior is different
from the SUBSTR function, which begins its search for the second
occurrence at the second character of the first occurrence.
So the pattern [^|] will look for NON pipes, meaning it will skip consecutive pipes ("||") looking for a non-pipe char.
You might try:
select trim(regexp_substr(replace('A|test||string', '|', '| '), '[^|]+', 1, 4)) from dual;
This will replace a "|" with a "| " and allow you to match based on the pattern [^|]
I had a similar problem with a CSV file thus my separator was the semicolon (;)
So I started with an expression like the following one:
select regexp_substr(';2;;4;', '[^;]+', 1, i) from dual
letting i iterate from 1 to 5.
And of course it didn't work either.
To get the empty parts I just say they could be at the beginning (^;), or in the middle (;;) or at the end (;$). And or-ing all of this together gives:
select regexp_substr(';2;;4;', '[^;]+|^;|;;|;$', 1, i) from dual
And believe me or not: testing for i from 1 to 5 it works!
But let's not forgot the last details: with this approach you get ; for fields that are empty originally.
The next lines are showing how to get rid of them easily replacing them by empty strings(nulls):
with stage1 as (
select regexp_substr(';2;;4;', '[^;]+|^;|;;|;$', 1, 2) as F from dual
)
select case when F like '%;' then '' else F end from stage1
OK. This should be the best solution for you.
SELECT
REGEXP_REPLACE ( 'Mike|Male||20000|Yes',
'^([^|]*\|){2}([^|]*).*$',
'\2' )
TEXT
FROM
DUAL;
So for your problem
SELECT
REGEXP_REPLACE ( INCOMINGSTREAMOFSTRINGS,
'^([^|]*\|){N-1}([^|]*).*$',
'\2' )
TEXT
FROM
DUAL;
--INCOMINGSTREAMOFSTRINGS is your complete string with delimiter
--You should pass n-1 to obtain nth position
ALTERNATE 2:
WITH T AS (SELECT 'Mike|Male||20000|Yes' X FROM DUAL)
SELECT
X,
REGEXP_REPLACE ( X,
'^([^|]*).*$',
'\1' )
Y1,
REGEXP_REPLACE ( X,
'^[^|]*\|([^|]*).*$',
'\1' )
Y2,
REGEXP_REPLACE ( X,
'^([^|]*\|){2}([^|]*).*$',
'\2' )
Y3,
REGEXP_REPLACE ( X,
'^([^|]*\|){3}([^|]*).*$',
'\2' )
Y4,
REGEXP_REPLACE ( X,
'^([^|]*\|){4}([^|]*).*$',
'\2' )
Y5
FROM
T;
ALTERNATE 3:
SELECT
REGEXP_SUBSTR ( REGEXP_REPLACE ( 'Mike|Male||20000|Yes',
'\|',
';' ),
'(^|;)([^;]*)',
1,
1,
NULL,
2 )
AS FIRST,
REGEXP_SUBSTR ( REGEXP_REPLACE ( 'Mike|Male||20000|Yes',
'\|',
';' ),
'(^|;)([^;]*)',
1,
2,
NULL,
2 )
AS SECOND,
REGEXP_SUBSTR ( REGEXP_REPLACE ( 'Mike|Male||20000|Yes',
'\|',
';' ),
'(^|;)([^;]*)',
1,
3,
NULL,
2 )
AS THIRD,
REGEXP_SUBSTR ( REGEXP_REPLACE ( 'Mike|Male||20000|Yes',
'\|',
';' ),
'(^|;)([^;]*)',
1,
4,
NULL,
2 )
AS FOURTH,
REGEXP_SUBSTR ( REGEXP_REPLACE ( 'Mike|Male||20000|Yes',
'\|',
';' ),
'(^|;)([^;]*)',
1,
5,
NULL,
2 )
AS FIFTH
FROM
DUAL;
You can use the following :
with l as (select 'Mike|Male||20000|Yes' str from dual)
select regexp_substr(str,'(".*"|[^|]*)(\||$)',1,level,null,1)
from dual,l
where level=3/*use any position*/ connect by level <= regexp_count(str,'([^|]*)(\||$)')
As an complement to #tbone response...
Oddly, my oracle didn't recognize the blank space character in this list: [^|]
In this cases can be confusing and hard to realize what is going wrong.
Try with this regex ([^|]| )+. Also, to detect a posible first blank item, it is better to replace the separator with the space blank before, and not after it:
' |'
trim(regexp_substr(replace('A|test||string', '|', ' |'), '([^|]| )+', 1, 4))