extracting data in table using REGEXP_REPLACE - regex

I already asked this question, but I am not sure why no one is answering back.
I got a partial answer, maybe because I marked it as correct answer... Not sure, but will try my luck here again.
My data looks like this:
reported_name
--------------
HEMA using TM-0497
TEGDMA
Blue HEMA using TM-0510
Norbloc using TM-0545
SIMAA2 using TM-0547
Tensile Strength using
Appearance using TM-0011
Haze using TM-0561
Blue HEMA using CRM-0126
t-Amyl Alcohol
Transmittance TM-0509
DK (edge corrected) TM-0534
Decanoic Acid CRM-0200
Glycol using CRM-0094
% Ketotifen Released using TM-0578_V2_RELEASE
TMPTMA using CRM-0208
% Ketotifen Released using TM-0578_V2_RE
Ca2DTPA Assay using USP_541 (3 day drying)
Water using TM-0449 OOS Analyst 1, Equip 1, set 2
Leachable Polymer using CRM-0225 Sample B
DMA using TM-0500 2333-30e
Decanoic Acid using TM-0622 - Rev # 1
Ketotifen Fumarate Assay using TM-0624_ASSAY_RC - Rev # 2
Refractive Index using TM-0589 - Day 8
Refractive Index using TM-0589 - Rev # 0 - Day 5
I need my output to be like:
reported_name analysis_method revision_number
-------------- -------------- ------------------
HEMA using TM-0497 TM-0497 null
TEGDMA null null
Blue HEMA using TM-0510 TM-0510 null
Norbloc using TM-0545 TM-0545 null
SIMAA2 using TM-0547 TM-0547 null
Tensile Strength using null null
Appearance using TM-0011 TM-0011 null
Haze using TM-0561 TM-0561 null
Blue HEMA using CRM-0126 CRM-0126 null
t-Amyl Alcohol null null
Transmittance TM-0509 TM-0509 null
DK () TM-0534 TM-0534 null
Decanoic Acid CRM-0200 null null
Glycol using CRM-0094 CRM-0094 null
% Ketotifen Released
using TM-0578_V2_RELEASE TM-0578_V2_RELEASE null
TMPTMA using CRM-0208 CRM-0208 null
% Ketotifen Released
using TM-0578_V2_RE TM-0578_V2_RE null
Ca2DTPA Assay using
USP_541 (3 day drying) USP_541 null
Water using TM-0449
OOS Analyst 1 TM-0449 null
Leachable Polymer
using CRM-0225 Sample B CRM-0225 null
DMA using TM-0500 2333- TM-0500 null
Decanoic Acid using
TM-0622 - Rev # 1 TM-0622 Rev # 1
Ketotifen Fumarate Assay
using TM-0624_ASSAY_RC
- Rev # 2 TM-0624_ASSAY_RC Rev # 2
Refractive Index using
TM-0589 - Day 8 TM-0589 null
Refractive Index using
TM-0589 - Rev # 0
- Day 5 TM-0589 Rev # 0
Is this possible, because I can't seem to make it work right.
I still need to find the way to extract analysis_method, when I see things like CRM-0200 in string : 'Decanoic Acid CRM-0200'
This is what I got so far:
select distinct t.reported_name,
(case
when regexp_like(t.reported_name, '.* using (.*)([ ]?[-]?[ ]?Rev.*)')
then regexp_replace(t.reported_name, '.* using (.*)([ ]?- Rev.*)', '\1')
when regexp_like(t.reported_name, '.* using (.*)')
then regexp_replace(t.reported_name, '.* using (.*)', '\1')
else '' end) as analysis_method_regexp,
(case when regexp_like(t.reported_name, '.*[ ]?[-]?[ ]?(Rev[ ]?#[ ]?[0-9]+).*')
then regexp_replace(t.reported_name, '.*[ ]?[-]?[ ]?(Rev[ ]?#[ ]?[0-9]+).*', '\1')
else '' end) as revision_regexp
from test t;

Here's how I approached the problem. First describe the search criteria for each output column:
- 1st column is the whole line.
- If the line has "using" in it, then the 2nd column is the string following the word "using".
- If the line has the string "Rev #" in it, the 3rd column is the "Rev # " plus the string following it.
Then code the rules using regular expressions for the matching rules:
with sel as(
-- Test data that matches all combinations.
select 'TEGMA' str from dual
union
select 'Blue HEMA using TM-0510' str from dual
union
select 'Decanoic Acid using TM-0622 - Rev # 1' str from dual
union
select '% Ketotifen Released using TM-0578_V2_RELEASE' str from dual
union
select 'Leachable Polymer using CRM-0225 Sample B' str from dual
union
select 'Ketotifen Fumarate Assay using TM-0624_ASSAY_RC - Rev # 2' str from dual
)
SELECT str reported_name,
CASE
-- if line contains " using " then...
WHEN regexp_like(str, '.* using .*') THEN
-- use groups. 1st group is the beginning of the line, ending with
-- "...using ". The second group starts right after the space after
-- "using ", and matches any number of characters that are not a
-- space or the end of the line. Followed by any number of any
-- characters that we do not care about.
-- Return the second group only.
regexp_replace(str, '^(.* using )([^ $]*).*', '\2')
END analysis_method,
CASE
-- if line contains " Rev # (a number)" then..
WHEN regexp_like(str, '.* Rev # \d.*') THEN
regexp_replace(str, '(.* )(Rev # [^ $]).*', '\2')
END revision_number
from sel;
Output:

Related

no preceding characters in regexp statement

So I have attempted to use a negative look back in a regexp statement and have looked online at other solutions but they don't seem to work for me so obviously I am doing something wrong-
I am looking for a return on the first line but the others should be null. Essentially I need CT CHEST or CT LUNG
Any assistance TIA
with test (id, description) as (
select 1, 'CT CHEST HIGH RESOLUTION, NO CONTRAST' from dual union all --want this
select 2, 'INJECTION, THORACIC TRANSFORAMEN EPIDURAL, NON NEUROLYTIC W IMAGE GUIDANCE.' from dual union all --do not want this
select 3, 'The cow came back. But the dog went for a walk' from dual) --do not want this
select id, description, regexp_substr(description, '(?<![a-z]ct).{1,20}(CHEST|THOR|LUNG)',1,1,'i') from test;
regexp_substr(description,'([^A-Z]|^)[CT].{1,20}(CHEST|THOR|LUNG)',1,1,'i')
works
Leverage Oracle Subexrpession Parameter to Check for CT
I would leverage the use of subexpressions to use a pattern like this:
'regexp_substr(description, '(^| )((ct ).*((CHEST)|(THOR)|(LUNG)))', 1, 1,'i', 2)`
-subexpression 1 to look for beginning of line or a space: (^| )
-subexpression 3 to look for 'CT': (ct )
-allow for other characters: .*
-subexressions 5,6,7: (CHEST)|(THOR)|(LUNG)
-subexpression 2 which contain subexpression 3 an subexprssion 4
I use the last optional parameter to identify that I want subexpression 2.
WITH test (id, description) as (
SELECT 1
, 'CT CHEST HIGH RESOLUTION , NO CONTRAST'
FROM dual
UNION ALL --want this
SELECT 2
, 'INJECTION , THORACIC TRANSFORAMEN EPIDURAL , NON NEUROLYTIC W IMAGE GUIDANCE.'
FROM dual
UNION ALL --do not want this
SELECT 3
, 'The cow came back. But the dog went FOR a walk'
FROM dual
) --do not want this
SELECT id
, description
, regexp_substr(description, '(^| )((ct ).*((CHEST)|(THOR)|(LUNG)))', 1, 1,'i', 2)
FROM test;

Trim UK Postcodes in PostgreSQL

I know a similar question exists but that solution doesn't work in PostgreSQL.
What i'm trying to do; create new columns with copy of the full postcode and then trim this down first to sector, then trim that to district, and finally to area. ie. Copy postcode to postcode_sector trim postcode_sector.
TA15 1PL becomes:
TA15 1 for sector
TA15 for district
TA for area.
What I've tried:
Create new columns in table for each, then;
SELECT postcode_sector FROM postcodes
RTRIM (Left([postcode_sector],(Len([postcode_sector])-2)) + " " +
Right([postcode_sector],3));
Throws syntax error;
Select
Postcode,
RTRIM(LEFT(Postcode, PATINDEX('%[0-9]%', Postcode) - 1)) As AreaTest
From postcodes
Doesn't work as no PATINDEX function in PostgresSQL. From here I have looked at an alternate approach using SUBSTRING function elped by the excellent tutorial here . Using;
SELECT
substring (postcode FROM 1 FOR 6) AS postcode_sector
FROM postcodes;
Gets me part way I now have a column with TA15 1 but due to the way the system works i also have T15 1A. Is there a way in PostgresSQL to count the number of characters in a cell and delete one? Out of wider interest is it faster to use TRIM than SUBSTRING I'm executing across the full postcode file which is ~27million rows
I'm not that familiar with UK post codes, but according to Wikipedia's format, this should handle all cases:
select postcode,
m[1] || m[2] || ' ' || m[3] sector,
m[1] || m[2] district,
m[1] area
from src,
regexp_matches(postcode, '^([A-Z]{1,2})([0-9A-Z]{1,2}) ([0-9])([A-Z]{2})') m
http://rextester.com/KREPX19406
This seems to do it:
with postcodes (postcode) as (
values ('TA15 1PL')
)
select substring(postcode from '[^0-9]{2}[0-9]+ [0-9]') as sector,
substring(postcode from '[^0-9]{2}[0-9]+') as district,
substring(postcode from '([^0-9]+)') as area
from postcodes;
returns
sector | district | area
-------+----------+-----
TA15 1 | TA15 | TA

SQLite extract string from text in column

I have a Spatialite Database and I've imported OSM Data into this database.
With the following query I get all motorways:
SELECT * FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
I use GLOB '*A [0-9]*' here, because in Germany every Autobahn begins with A, followed by a number (like A 73).
There is a column called other_tags with information about the motorway part:
"bdouble"=>"yes","hazmat"=>"designated","lanes"=>"2","maxspeed"=>"none","oneway"=>"yes","ref"=>"A 73","width"=>"7"
If you look closer there is the part "ref"=>"A 73".
I want to extract the A 73 as the name for the motorway.
How can I do this in sqlite?
If the format doesn't change, that means that you can expect that the other_tags field is something like %"ref"=>"A 73","width"=>"7"%, then you can use instr and substr (note that 8 is the length of "ref"=>"):
SELECT substr(other_tags,
instr(other_tags, '"ref"=>"') + 8,
instr(other_tags, '","width"') - 8 - instr(other_tags, '"ref"=>"')) name
FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
The result will be
name
A 73
Check with following condition..
other_tags like A% -- Begin With 'A'.
abs(substr(other_tags, 3,2)) <> 0.0 -- Substring from 3rd character, two character is number.
length(other_tags) = 4 -- length of other_tags is 4
So here is how your query should be:
SELECT *
FROM lines
WHERE other_tags LIKE 'A%'
AND abs(substr(other_tags, 3,2)) <> 0.0
AND length(other_tags) = 4
AND highway = 'motorway'

Google Sheets Formula to Extract and Convert Currency from € or £ to USD

I'm trying to do the following:
Check the cell for N/A or No; if it has either of these then it should output N/A or No
Check the cell for either £ or € or Yes; If it has one of these then it would continue to step 3. If it has $ then it should repeat the same input as the output.
Extract currency from cell using: REGEXEXTRACT(A1, "\$\d+") or REGEXEXTRACT(A1, "\£\d+") (I assume that's the best way)
Convert it to $ USD using GoogleFinance("CURRENCY:EURUSD") or GoogleFinance("CURRENCY:GBPUSD")
Output the original cell but replacing the extracted currency from step 3 with the output from step 4.
Examples: (Original --> Output)
N/A --> N/A
No --> No
Alt --> Alt
Yes --> Yes
Yes £10 --> Yes $12.19
Yes £10 per week --> Yes $12.19 per week
Yes €5 (Next) --> Yes $5.49 (Next)
Yes $5 22 EA --> Yes $5 22 EA
Yes £5 - £10 --> Yes $5.49 - $12.19
I am unable to get a working IF statement working, I could do this in normal code but can't work it out for spreadsheet formulas.
I've tried modifying #Rubén's answer lots of times to including the N/A as it's not the Sheets error, I also tried the same for making any USD inputs come out as USD (no changes) but I really can't get the hang of IF/OR/AND in Excel/Google Sheets.
=ArrayFormula(
SUBSTITUTE(
A1,
OR(IF(A1="No","No",REGEXEXTRACT(A1, "[\£|\€]\d+")),IF(A1="N/A","N/A",REGEXEXTRACT(A1, "[\£|\€]\d+"))),
IF(
A1="No",
"No",
TEXT(
REGEXEXTRACT(A1, "[\£|\€](\d+)")*
IF(
"€"=REGEXEXTRACT(A1, "([\£|\€])\d+"),
GoogleFinance("CURRENCY:EURUSD"),
GoogleFinance("CURRENCY:GBPUSD")
),
"$###,###"
)
)
)
)
The above, I tried to add an OR() before the first IF statement to try and include N/A as an option, in the below I tried it as you can see below in various different ways (replace line 4 with this)
IF(
OR(
A1="No",
"No",
REGEXEXTRACT(A1, "[\£|\€]\d+");
A1="No",
"No",
REGEXEXTRACT(A1, "[\£|\€]\d+")
)
)
But that doesn't work either. I thought using ; was a way to separate the OR expressions but apparently not.
Re: Rubén's latest code 16/10/2016
I've modified it to =ArrayFormula(
IF(NOT(ISBLANK(A2)),
IF(IFERROR(SEARCH("$",A2),0),A2,IF(A2="N/A","N/A",IF(A2="No","No",IF(A2="Alt","Alt",IF(A2="Yes","Yes",
SUBSTITUTE(
A2,
REGEXEXTRACT(A2, "[\£|\€]\d+"),
TEXT(
REGEXEXTRACT(A2, "[\£|\€](\d+)")
*
VLOOKUP(
REGEXEXTRACT(A2, "([\£|\€])\d+"),
{
{"£";"€"},
{GoogleFinance("CURRENCY:GBPUSD");GoogleFinance("CURRENCY:EURUSD")}
},
2,0),
"$###,###"
)
)
)))))
,"")
)
This fixes:
Blank cells no longer throw #N/A
Yes only cells no longer throw #N/A
Added another text value Alt
Changes the format of the currency to 0 decimal places rather than my original request of 2 decimal places.
As you can see in the image below the two red cells aren't quite correct as I never thought of this scenario, the second of the two values is staying in it's input form and not being converted to USD.
Direct answer
Try
=ArrayFormula(
IF(IFERROR(SEARCH("$",A1:A6),0),A1:A6,IF(A1:A6="N/A","N/A",IF(A1:A6="No","No",
SUBSTITUTE(
A1:A6,
REGEXEXTRACT(A1:A6, "[\£|\€]\d+"),
TEXT(
REGEXEXTRACT(A1:A6, "[\£|\€](\d+)")
*
VLOOKUP(
REGEXEXTRACT(A1:A6, "([\£|\€])\d+"),
{
{"£";"€"},
{GoogleFinance("CURRENCY:GBPUSD");GoogleFinance("CURRENCY:EURUSD")}
},
2,0),
"$###,###.00"
)
)
)))
)
Result
+---+------------------+---------------------+
| | A | B |
+---+------------------+---------------------+
| 1 | N/A | N/A |
| 2 | No | No |
| 3 | Yes £10 | Yes $12.19 |
| 4 | Yes £10 per week | Yes $12.19 per week |
| 5 | Yes €5 (Next) | Yes $5.49 (Next) |
+---+------------------+---------------------+
Explanation
OR function
Instead or using OR function, the above formula use nested IF functions.
REGEXTRACT
Instead of using a REGEXEXTRACT function for each currency symbol, a regex OR operator was used. Example
REGEXEXTRACT(A1:A6, "[\£|\€]\d+")
Three regular expressions were used,
get currency symbol and the amount [\£|\€]\d+
get the amount [\£|\€](\d+)
get the currency symbol [(\£|\€])\d+
Currency conversion
Instead of using nested IF to handle currency conversion rates, VLOOKUP and array is used. This could be make easier to maintain the formula assuming that more currencies could be added in the future.

fetching multiple values from a string using regular expression

I have a table temp that have a column name "REMARKS"
Create script
Create table temp (id number,remarks varchar2(2000));
Insert script
Insert into temp values (1,'NAME =GAURAV Amount=981 Phone_number =98932324 Active Flag =Y');
Insert into temp values (2,'NAME =ROHAN Amount=984 Phone_number =98932333 Active Flag =N');
Now , i want to fetch the corresponding value of NAME ,Amount ,phone_number, active_flag from the remarks column of the table.
I thought of using regular expression ,but i am not comfortable in using it .
I tried with substr and instr to fetch the name from the remakrs column ,but if i want to fetch all four, i need to write a pl sql .Can we achieve this using Regular expression.
Can i get output(CURSOR) like
id Name Amount phone_number Active flag
------------------------------------------
1 Gaurav 981 98932324 Y
2 Rohan 984 98932333 N
-------------------------------------------
Thanks for your help
you can use something like :
SQL> select regexp_replace(remarks, '.*NAME *=([^ ]*).*', '\1') name,
2 regexp_replace(remarks, '.*Amount *=([^ ]*).*', '\1') amount,
3 regexp_replace(remarks, '.*Phone_number *=([^ ]*).*', '\1') ph_number,
4 regexp_replace(remarks, '.*Active Flag *=([^ ]*).*', '\1') flag
5 from temp;
NAME AMOUNT PH_NUMBER FLAG
-------------------- -------------------- -------------------- --------------------
GAURAV 981 98932324 Y
ROHAN 981 98932324 N