Trim UK Postcodes in PostgreSQL

Trim UK Postcodes in PostgreSQL - regex

I know a similar question exists but that solution doesn't work in PostgreSQL.
What i'm trying to do; create new columns with copy of the full postcode and then trim this down first to sector, then trim that to district, and finally to area. ie. Copy postcode to postcode_sector trim postcode_sector.
TA15 1PL becomes:
TA15 1 for sector
TA15 for district
TA for area.
What I've tried:
Create new columns in table for each, then;
SELECT postcode_sector FROM postcodes
RTRIM (Left([postcode_sector],(Len([postcode_sector])-2)) + " " +
Right([postcode_sector],3));
Throws syntax error;
Select
Postcode,
RTRIM(LEFT(Postcode, PATINDEX('%[0-9]%', Postcode) - 1)) As AreaTest
From postcodes
Doesn't work as no PATINDEX function in PostgresSQL. From here I have looked at an alternate approach using SUBSTRING function elped by the excellent tutorial here . Using;
SELECT
substring (postcode FROM 1 FOR 6) AS postcode_sector
FROM postcodes;
Gets me part way I now have a column with TA15 1 but due to the way the system works i also have T15 1A. Is there a way in PostgresSQL to count the number of characters in a cell and delete one? Out of wider interest is it faster to use TRIM than SUBSTRING I'm executing across the full postcode file which is ~27million rows

I'm not that familiar with UK post codes, but according to Wikipedia's format, this should handle all cases:
select postcode,
m[1] || m[2] || ' ' || m[3] sector,
m[1] || m[2] district,
m[1] area
from src,
regexp_matches(postcode, '^([A-Z]{1,2})([0-9A-Z]{1,2}) ([0-9])([A-Z]{2})') m
http://rextester.com/KREPX19406

This seems to do it:
with postcodes (postcode) as (
values ('TA15 1PL')
)
select substring(postcode from '[^0-9]{2}[0-9]+ [0-9]') as sector,
substring(postcode from '[^0-9]{2}[0-9]+') as district,
substring(postcode from '([^0-9]+)') as area
from postcodes;
returns
sector | district | area
-------+----------+-----
TA15 1 | TA15 | TA

Related

google sheets, splitting and stacking a paragraph

I have a 3 row by 2 column table
1Q18 hello. testing row one.
2Q18 There are about 7.5b people. That's alot.
3Q18 Last sentence. To be stacking.
I want to split each sentence then have a quarter label with it, out would be
1Q18 hello
1Q18 testing row one
2Q18 There are about 7.5b people
2Q18 That's alot
3Q18 Last sentence
3Q18 To be stacking
I can get one line to work with:
=TRANSPOSE({split(rept(A1&" ",counta(split(B1,".")))," ");split(B1,".")})
which would give me:
1Q18 hello
1Q18 testing row one
I need a formula that will let me go down 100 rows, so I can't manually repeat the formula 3 times and use {} with ;
I've also tried using the
=map(A1:A,B2:B,LAMBDA(x,y,TRANSPOSE({split(rept(x&" ",counta(split(y,".")))," ");split(y,".")})))
but get a
Error Result should be a single column.

try:
=INDEX(QUERY(SPLIT(FLATTEN(LAMBDA(x, IF(x="",,A1:A&""&x))
(SPLIT(B1:B&" ", ". ", ))), ""), "where Col2 is not null", ))

Try below formula-
=QUERY(REDUCE(,B1:B3,LAMBDA(a,x,{a;TRANSPOSE(INDEX(INDEX(A1:A,ROW(x)) & " " & SPLIT(SUBSTITUTE(x,". ",".|"),"|")))})),"offset 1",0)

Here's another formula you can try:
=ARRAYFORMULA(
QUERY(
REDUCE({0,0},
QUERY(A1:A&"❄️"&SPLIT(B1:B,". ",),
"where Col1 <> '#VALUE!'"),
LAMBDA(a,c,
{a;SPLIT(c,"❄️",,)})),
"where Col2 is not null offset 1",))

How to extract specific information from strings

I have a dataset with the addresses of authors' affiliations. Addresses have differing length. But the information before the first comma is the name of he institution and that after the last comma the country. What I want to do is to extract the country and create a new variable for it.
I tried this code in Stata. It works to extract the name of institutions.
generate splitat = strpos(institutions ,",")
generate str80 univ = substr(institutions, 1, splitat - 1)
I am wondering whether this code also can be applied to extract the country.
I thought it could check from the end instead from the start?
My dataset looks like the following example:
Natl Taiwan Univ, Inst Epidemiol, Taipei 106, Taiwan
Radboud Univ Nijmegen, Inst Water & Wetland Res, Dept Anim Ecol & Ecophysiol, NL-6525 AJ Nijmegen, Netherlands

There is a specific function in Stata 14+ to look for the last occurrence of a substring (e.g. a specific character) in a string. See help string functions in Stata 14 for documentation of strrpos().
If that is not in your version of Stata, you merely reverse the string, find the substring using the method you already know, and then reverse what you found.
If you are not using the latest version of Stata, it is always a good idea to specify that in questions in any forum that supports Stata questions,
clear
input str244 institutions
"Natl Taiwan Univ, Inst Epidemiol, Taipei 106, Taiwan"
"Radboud Univ Nijmegen, Inst Water & Wetland Res, Dept Anim Ecol & Ecophysiol, NL-6525 AJ Nijmegen, Netherlands"
end
compress
gen country = substr(institutions, strrpos(institutions, ",") + 1, .)
local rev strreverse(institutions)
gen country2 = strreverse(substr(`rev', 1, strpos(`rev', ",") - 1))
assert country == country2
l country
+--------------+
| country |
|--------------|
1. | Taiwan |
2. | Netherlands |
+--------------+

extracting data in table using REGEXP_REPLACE

I already asked this question, but I am not sure why no one is answering back.
I got a partial answer, maybe because I marked it as correct answer... Not sure, but will try my luck here again.
My data looks like this:
reported_name
--------------
HEMA using TM-0497
TEGDMA
Blue HEMA using TM-0510
Norbloc using TM-0545
SIMAA2 using TM-0547
Tensile Strength using
Appearance using TM-0011
Haze using TM-0561
Blue HEMA using CRM-0126
t-Amyl Alcohol
Transmittance TM-0509
DK (edge corrected) TM-0534
Decanoic Acid CRM-0200
Glycol using CRM-0094
% Ketotifen Released using TM-0578_V2_RELEASE
TMPTMA using CRM-0208
% Ketotifen Released using TM-0578_V2_RE
Ca2DTPA Assay using USP_541 (3 day drying)
Water using TM-0449 OOS Analyst 1, Equip 1, set 2
Leachable Polymer using CRM-0225 Sample B
DMA using TM-0500 2333-30e
Decanoic Acid using TM-0622 - Rev # 1
Ketotifen Fumarate Assay using TM-0624_ASSAY_RC - Rev # 2
Refractive Index using TM-0589 - Day 8
Refractive Index using TM-0589 - Rev # 0 - Day 5
I need my output to be like:
reported_name analysis_method revision_number
-------------- -------------- ------------------
HEMA using TM-0497 TM-0497 null
TEGDMA null null
Blue HEMA using TM-0510 TM-0510 null
Norbloc using TM-0545 TM-0545 null
SIMAA2 using TM-0547 TM-0547 null
Tensile Strength using null null
Appearance using TM-0011 TM-0011 null
Haze using TM-0561 TM-0561 null
Blue HEMA using CRM-0126 CRM-0126 null
t-Amyl Alcohol null null
Transmittance TM-0509 TM-0509 null
DK () TM-0534 TM-0534 null
Decanoic Acid CRM-0200 null null
Glycol using CRM-0094 CRM-0094 null
% Ketotifen Released
using TM-0578_V2_RELEASE TM-0578_V2_RELEASE null
TMPTMA using CRM-0208 CRM-0208 null
% Ketotifen Released
using TM-0578_V2_RE TM-0578_V2_RE null
Ca2DTPA Assay using
USP_541 (3 day drying) USP_541 null
Water using TM-0449
OOS Analyst 1 TM-0449 null
Leachable Polymer
using CRM-0225 Sample B CRM-0225 null
DMA using TM-0500 2333- TM-0500 null
Decanoic Acid using
TM-0622 - Rev # 1 TM-0622 Rev # 1
Ketotifen Fumarate Assay
using TM-0624_ASSAY_RC
- Rev # 2 TM-0624_ASSAY_RC Rev # 2
Refractive Index using
TM-0589 - Day 8 TM-0589 null
Refractive Index using
TM-0589 - Rev # 0
- Day 5 TM-0589 Rev # 0
Is this possible, because I can't seem to make it work right.
I still need to find the way to extract analysis_method, when I see things like CRM-0200 in string : 'Decanoic Acid CRM-0200'
This is what I got so far:
select distinct t.reported_name,
(case
when regexp_like(t.reported_name, '.* using (.*)([ ]?[-]?[ ]?Rev.*)')
then regexp_replace(t.reported_name, '.* using (.*)([ ]?- Rev.*)', '\1')
when regexp_like(t.reported_name, '.* using (.*)')
then regexp_replace(t.reported_name, '.* using (.*)', '\1')
else '' end) as analysis_method_regexp,
(case when regexp_like(t.reported_name, '.*[ ]?[-]?[ ]?(Rev[ ]?#[ ]?[0-9]+).*')
then regexp_replace(t.reported_name, '.*[ ]?[-]?[ ]?(Rev[ ]?#[ ]?[0-9]+).*', '\1')
else '' end) as revision_regexp
from test t;

Here's how I approached the problem. First describe the search criteria for each output column:
- 1st column is the whole line.
- If the line has "using" in it, then the 2nd column is the string following the word "using".
- If the line has the string "Rev #" in it, the 3rd column is the "Rev # " plus the string following it.
Then code the rules using regular expressions for the matching rules:
with sel as(
-- Test data that matches all combinations.
select 'TEGMA' str from dual
union
select 'Blue HEMA using TM-0510' str from dual
union
select 'Decanoic Acid using TM-0622 - Rev # 1' str from dual
union
select '% Ketotifen Released using TM-0578_V2_RELEASE' str from dual
union
select 'Leachable Polymer using CRM-0225 Sample B' str from dual
union
select 'Ketotifen Fumarate Assay using TM-0624_ASSAY_RC - Rev # 2' str from dual
)
SELECT str reported_name,
CASE
-- if line contains " using " then...
WHEN regexp_like(str, '.* using .*') THEN
-- use groups. 1st group is the beginning of the line, ending with
-- "...using ". The second group starts right after the space after
-- "using ", and matches any number of characters that are not a
-- space or the end of the line. Followed by any number of any
-- characters that we do not care about.
-- Return the second group only.
regexp_replace(str, '^(.* using )([^ $]*).*', '\2')
END analysis_method,
CASE
-- if line contains " Rev # (a number)" then..
WHEN regexp_like(str, '.* Rev # \d.*') THEN
regexp_replace(str, '(.* )(Rev # [^ $]).*', '\2')
END revision_number
from sel;
Output:

read table with spaces in one column

I am attempting to extract tables from very large text files (computer logs). Dickoa provided very helpful advice to an earlier question on this topic here: extracting table from text file
I modified his suggestion to fit my specific problem and posted my code at the link above.
Unfortunately I have encountered a complication. One column in the table contains spaces. These spaces are generating an error when I try to run the code at the link above. Is there a way to modify that code, or specifically the read.table function to recognize the second column below as a column?
Here is a dummy table in a dummy log:
> collect.models(, adjust = FALSE)
model npar AICc DeltaAICc weight Deviance
5 AA(~region + state + county + city)BB(~region + state + county + city)CC(~1) 17 11111.11 0.0000000 5.621299e-01 22222.22
4 AA(~region + state + county)BB(~region + state + county)CC(~1) 14 22222.22 0.0000000 5.621299e-01 77777.77
12 AA(~region + state)BB(~region + state)CC(~1) 13 33333.33 0.0000000 5.621299e-01 44444.44
12 AA(~region)BB(~region)CC(~1) 6 44444.44 0.0000000 5.621299e-01 55555.55
>
> # the three lines below count the number of errors in the code above
Here is the R code I am trying to use. This code works if there are no spaces in the second column, the model column:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
x <- read.table(text=my.data, comment.char = ">")
I believe I must use the variables top and bottom to locate the table in the log because the log is huge, variable and complex. Also, not every table contains the same number of models.
Perhaps a regex expression could be used somehow taking advantage of the AA and the CC(~1) present in every model name, but I do not know how to begin. Thank you for any help and sorry for the follow-up question. I should have used a more realistic example table in my initial question. I have a large number of logs. Otherwise I could just extract and edit the tables by hand. The table itself is an odd object which I have only ever been able to export directly with capture.output, which would probably still leave me with the same problem as above.
EDIT:
All spaces seem to come right before and right after a plus sign. Perhaps that information can be used here to fill the spaces or remove them.

try inserting my.data$model <- gsub(" *\\+ *", "+", my.data$model) before read.table
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
my.data$model <- gsub(" *\\+ *", "+", my.data$model)
x <- read.table(text=my.data, comment.char = ">")

fetching multiple values from a string using regular expression

I have a table temp that have a column name "REMARKS"
Create script
Create table temp (id number,remarks varchar2(2000));
Insert script
Insert into temp values (1,'NAME =GAURAV Amount=981 Phone_number =98932324 Active Flag =Y');
Insert into temp values (2,'NAME =ROHAN Amount=984 Phone_number =98932333 Active Flag =N');
Now , i want to fetch the corresponding value of NAME ,Amount ,phone_number, active_flag from the remarks column of the table.
I thought of using regular expression ,but i am not comfortable in using it .
I tried with substr and instr to fetch the name from the remakrs column ,but if i want to fetch all four, i need to write a pl sql .Can we achieve this using Regular expression.
Can i get output(CURSOR) like
id Name Amount phone_number Active flag
------------------------------------------
1 Gaurav 981 98932324 Y
2 Rohan 984 98932333 N
-------------------------------------------
Thanks for your help

you can use something like :
SQL> select regexp_replace(remarks, '.*NAME *=([^ ]*).*', '\1') name,
2 regexp_replace(remarks, '.*Amount *=([^ ]*).*', '\1') amount,
3 regexp_replace(remarks, '.*Phone_number *=([^ ]*).*', '\1') ph_number,
4 regexp_replace(remarks, '.*Active Flag *=([^ ]*).*', '\1') flag
5 from temp;
NAME AMOUNT PH_NUMBER FLAG
-------------------- -------------------- -------------------- --------------------
GAURAV 981 98932324 Y
ROHAN 981 98932324 N

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Trim UK Postcodes in PostgreSQL - regex

Related

google sheets, splitting and stacking a paragraph

How to extract specific information from strings

extracting data in table using REGEXP_REPLACE

read table with spaces in one column

fetching multiple values from a string using regular expression

Categories

Resources