searching backwards with regex - regex

I have the following different texts
line1: SELECT column1,
line2: column2,
line3: RTRIM(LTRIM(blah1)) || ' ' || RTRIM(LTRIM(blah3)),
line4: RTRIM(LTRIM(blah3)) || ' ' || RTRIM(LTRIM(some1)) outColumn,
line5: RTRIM(LTRIM(blah3)) || ' ' || RTRIM(LTRIM(some1)) something,
line6: somelast
Following is what I want to get out of each line
basically want to start the regex search from end of string and keep going untill space. I can take out comma later on.
line1: column1
line2: column2
line3: <space> nothing found
line4: outColumn
line5: something
line6: somelast
basically I will be fine if I can start the regex from the end and walk towards first space.
There probably will have to be a special case for line3 as I dont expect anything back.
I am using groovy for this regex.

Iterate over the lines and match each line against the regex:
(?i).*(column\w+).*
The word you're looking for is captured in group 1 ($1).

I think you want:
(\w*)\s*,?$
Where match group one contains the first word at the end of the line.
Anchoring the expression to the end of the line basically is starting the regex from the end.

Related

Extract book name from a string in Hive

My data is something like this -
1124 An Orphan's Journey
234 Red Dragon
35600 You'll Know When It's Time
It has two values, the first one is Book ID, and the second one is the book name.
I used the split function in Hive but that doesn't look proper.
SELECT split(books, '\\ ')[0] book_id,
split(books, '\\ ')[1] + ' ' +
split(books, '\\ ')[2] + ' ' +
split(books, '\\ ')[3] + ' ' +
split(books, '\\ ')[4] as book_name
FROM books;
So far values are good but I don't feel it is the right approach.
Please help.
You may use
REGEXP_EXTRACT(books, '^\\d+', 0)
to get the book ID and
REGEXP_EXTRACT(books, '\\s+(\\S.*)', 1)
to extract the book name. The second regex can be more verbose, say, you may also check if there are digits at the start of the string before, '^\\d+\\s+(\\S.*)'.
Here,
^\d+ - matches one or more (+) digits at the start of the string (^)
\s+(\S.*) - matches one or more whitespace chars (\s+) and then captures into Group 1 any non-whitespace char (\S) and then the rest of the string (.* matches any zero or more chars other than line break chars as many as possible). Note the index argument is set to 1 in the second callt o REGEXP_EXTRACT to make sure the Group 1 value is only returned, without the initial whitespace.

SQLite Pattern Matching with Extra Character

My database contains these rows:
DuPage
Saint John
What queries could I use that would match people entering either 'Du Page' or 'SaintJohn': in other words: adding an extra character (at any position) that shouldn't be there, or removing a character (at any position) that should be there?
The first example has a possible workaround: I could just remove the space character from the 'Du Page' input before searching the table, but I cannot do that with the second example unless there was some way of saying 'match 'SaintJohn' with the database text that has had all spaces removed', or alternatively 'match a database row that has every letter in 'SaintJohn' somewhere in the row.
Remove spaces from the column and the search text:
select * from tablename
where replace(textcolumn, ' ', '') like '%' || replace('<your search string>', ' ', '') || '%'

Split single row string into multiple rows by multi-chracter delimiter Oracle

I have attempted to use this question here Splitting string into multiple rows in Oracle and adjust it to my needs however I'm not very confident with regex and have not been able to solve it via searching.
Currently that questions answers it with a lot of regex_substr and so on, using [^,]+ as the pattern so it splits by a single comma. I need it to split by a multi-character delimiter (e.g. #;) but that regex pattern matches any single character to split it out so where there are #s or ;s elsewhere in the text this causes a split.
I've worked out the pattern (#;+) will match every group of #; but I cannot workout how to invert this as done above to split the row into multiple.
I'm sure I'm just missing something simple so any help would be greatly appreciated!
I think you should use:
[^#;+]+
instead of
(#;+)
As, it will be checking for any one of the characters in the range which can be # ; or + and then you can split accordingly.
You can change it according to your requirement but in the regex I
shared, I am consudering # , ; and + as delimeter
So, in end, the query would look something like this:
with tbl(str) as (
select ' My, Delimiter# Hello My; Delimiter World My Delimiter My Delimiter test My Delimiter ' from dual
)
SELECT LEVEL AS element,
REGEXP_SUBSTR( str ,'([^#;+]+)', 1, LEVEL, NULL, 1 ) AS element_value
FROM tbl
CONNECT BY LEVEL <= regexp_count(str, '[#;+]')+1\\
Output:
ELEMENT ELEMENT_VALUE
1 My, Delimiter
2 Hello My
3 Delimiter World My Delimiter My Delimiter test My Deli
-- EDIT --
In case you want to check unlimited numbers of # or ; to split and don't want to split at one existence, I found the below regex, but again that is not supported by Oracle.
(?:(?:(?![;#]+).#(?![;#]+).|(?![;#]+).;(?![;#]+).|(?![;#]+).)*)+
So, I found no easy apart from below query which will not split on single existence if there is only one such instance between two delimeters:
select ' My, Delimiter;# Hello My Delimiter ;;# World My Delimiter ; My Delimiter test#; My Delimiter ' from dual
)
SELECT LEVEL AS element,
REGEXP_SUBSTR( str ,'([^#;]+#?[^#;]+;?[^#;]+)', 1, LEVEL, NULL, 1 ) AS element_value
FROM tbl
CONNECT BY LEVEL <= regexp_count(str, '[#;]{2,}')+1\\
Output:
ELEMENT ELEMENT_VALUE
1 My, Delimiter
2 Hello My Delimiter
3 World My Delimiter ; My Delimiter test
4 My Delimiter

Regex for replacing all characters excepting last and first non space ones

I have emails stored in SAP Hana table column of char datatype. I need to replace all letters and digits with '*' char excepting first and last non-whitespace chars. I wrote the regex like this: regex_replace('abcd#efg.hij', '(?!^)[A-Za-z0-9](?!$)', '*')
It works fine and I get masked email 'a***#***.**j'.
But it goes wrong when there are some white spaces at the start and/or the end of the email. For example, if the email string is ' abcd#efg.hij ' the result would be
' ****#***.**** ' while I need ' a***#***.**j '
Unfortunately, I cannot trim email before regexing.
Denis, I tried following in a SELECT statement with Replace_Regexp function
select
REPLACE_REGEXPR('(?!^)[\sA-Za-z0-9](?!$)' IN trim(' abcd#efg.hij ') WITH '*')
from dummy;
It removes the leading and trailing spaces and returns "a***#***.**j"

negative look ahead on whole number but preceded by a character(perl)

I have text like this;
2500.00 $120.00 4500 12.00 $23.00 50.0989
Iv written a regex;
/(?!$)\d+\.\d{2}/g
I want it to only match 2500.00, 12.00 nothing else.
the requirement is that it needs to add the '$' sign onto numeric values that have exactly two digits after the decimal point. with the current regex it ads extra '$' to the ones that already have a '$' sign. its longer but im just saying it briefly. I know i can use regex to remove the '$' then use another regex to add '$' to all the desired numbers.
any help would be appreciated thanks!
To answer your question, you need to look before the pos where the first digit is.
(?<!\$)
But that's not going to work as it will match 23.45 of $123.45 to change it into $1$23.45, and it will match 123.45 of 123.456 to change it into $123.456. You want to make sure there's no digits before or after what you match.
s/(?<![\$\d])(\d+\.\d{2})(?!\d)/\$$1/g;
Or the quicker
s/(?<![\$\d])(?=\d+\.\d{2}(?!\d))/\$/g;
This is tricky only because you are trying to include too many functionalities in your single regex. If you manipulate the string first to isolate each number, this becomes trivial, as this one-liner demonstrates:
$ perl -F"(\s+)" -lane's/^(?=\d+\.\d{2}$)/\$/ for #F; print #F;'
2500.00 $120.00 4500 12.00 $23.00 50.0989
$2500.00 $120.00 4500 $12.00 $23.00 50.0989
The full code for this would be something like:
while (<>) { # or whatever file handle or input you read from
my #line = split /(\s+)/;
s/^(?=\d+\.\d{2}$)/\$/ for #line;
print #line; # or select your desired means of output
# my $out = join "", #line; # as string
}
Note that this split is non-destructive because we use parentheses to capture our delimiters. So for our sample input, the resulting list looks like this when printed with Data::Dumper:
$VAR1 = [
'2500.00',
' ',
'$120.00',
' ',
'4500',
' ',
'12.00',
' ',
'$23.00',
' ',
'50.0989'
];
Our regex here is simply anchored in both ends, and allowed to contain numbers, followed by a period . and two numbers, and nothing else. Because we use a look-ahead assertion, it will insert the dollar sign at the beginning, and keep everything else. Because of the strictness of our regex, we do not need to worry about checking for any other characters, and because we split on whitespace, we do not need to check for any such.
You can use this pattern:
s/(?<!\S)\d+\.\d{2}(?!\S)/\$${^MATCH}/gp
or
s/(?<!\S)(?=\d+\.\d{2}(?!\S))/\$/g
I think it is the shorter way.
(?<!\S) not preceded by a character that is not a white character
(?!\S) not followed by a character that is not a white character
The main interest of these double negations is that you include automaticaly the begining and the end of the string cases.