Adding nulls to dataframe output with regexp replace in Spark 2.4 - regex

I am trying to use regex replace to add a string "null" to the output. Language is Spark Scala 2.40 in aws glue. What is the best approach for this problem?
I am creating a dataframe by dataframe select and parsing through the columns that I need to add "null" to:
var select_df = raw_df.select(
col("example_column_1"),
col("example_column_2"),
col("example_column_3")
)
Input of example_column_1
#;#;Runner#;#;bob
Desired Output of example_column_1
null#;null#;Runner#;null#;bob
Attempt:
select_df.withColumn("example_column_1", regexp_replace(col("example_column_1"), "", "null"))

The task can be split into two parts:
replace # at the beginning of the string
replace all occurences of ;#
select_df
.withColumn("example_column_1", regexp_replace('example_column_1, "^#", "null#"))
.withColumn("example_column_1", regexp_replace('example_column_1, ";#", ";null#"))
.show(false)

Related

Pyspark regex_extract number only from a text string which contains special characters too

I am trying to extract numbers only from a freeText column, and the column will have text like DH-09878877ABC or 9009898DEC or qwert9876788plk.
I just want to extract numbers using below PySpark but it's not working. Please advise
df=df.withColumn("acount_nbr",regexp_extract(df['freeText',r'(^[0-9])',1)
Thanks
If you just want to extract numbers, and assuming the input would have only at most one substring of numbers, you should be using the regex pattern [0-9]+:
df = df.withColumn("acount_nbr", regexp_extract(df['freeText', r'([0-9]+)', 1)

Finding string that has repeated pattern in snowflake

I am trying to find the string that has repeated patterns in snowflake table. I am trying to get that using regex.
Example :
String : 'abc' , 'abcabc' , 'snowsnowflake'
The Query return only " 'abcabc' , 'snowsnowflake' ". Because it has repeated patterns.
Thank you.
I couldn't make it work with plain regex in SQL, but I was able to create a JavaScript UDF to get the desired results:
create or replace function find_repeated("x" string)
returns string
language javascript
as
$$
return x.match(/(.+)\1/g)
$$;
select x.value
, find_repeated(x.value)
, find_repeated(x.value) is not null has_repeated
from table(split_to_table('abc,abcabc,snowsnowflake', ',')) x

Select the next line of the matched pattern in clob column using oracle regular expression

I have a clob column "details" in table xxx. I want to select the next line of the matched pattern using Regex.
Input Text (CLOB DATA) like below :( all placed in new line)
MODEL_DATA 1
TEST1:
NONE
TEST2:
NONE
INFO:
SERVICES,VALUED-YES
TYPE:
NONE
I tried to use INFO as pattern match string and retrieve the next line of the text . But could not able to do it by using Regular expression function . Please help me to resolve this
Output :
SERVICES,VALUES-YES
You can use the below to get the details
select replace(regexp_substr(details,'INFO:'||chr(10)||'.+'),'INFO:')
from your_table;
You can also try the below to be operation system independent
select replace(regexp_substr(details,'INFO:
('||chr(10)||'|'||chr(13)||chr(10)||').+'),'INFO:')
from your_table;

How to split a string in db2?

I've some URL's in my cas_fnd_dwd_det table,
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf
www.casiac.net/fnds/casi/as.pdf
www.casiac.net/fnds/casi/vindq.pdf
www.casiac.net/fnds/CASI/mnip.pdf
how do i copy the letters between last '/' and '.pdf' to another column
expected outcome
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf qnxp
www.casiac.net/fnds/casi/as.pdf as
www.casiac.net/fnds/casi/vindq.pdf vindq
www.casiac.net/fnds/CASI/mnip.pdf mnip
the below URL's are static
www.casiac.net/fnds/CASI/
www.casiac.net/fnds/casi/
Advise, how do i select the codes between last '/' and '.pdf' ?
I would recommend to take a look at REGEXP_SUBSTR. It allows to apply a regular expression. Db2 has string processing functions, but the regex function may be the easiest solution. See SO question on regex and URI parts for different ways of writing the expression. The following would return the last slash, filename and the extension:
SELECT REGEXP_SUBSTR('http://fobar.com/one/two/abc.pdf','\/(\w)*.pdf' ,1,1)
FROM sysibm.sysdummy1
/abc.pdf
The following uses REPLACE and the pattern is from this SO question with the pdf file extension added. It splits the string in three groups: everything up to the last slash, then the file name, then the ".pdf". The '$1' returns the group 1 (groups start with 0). Group 2 would be the ".pdf".
SELECT REGEXP_REPLACE('http://fobar.com/one/two/abc.pdf','(?:.+\/)(.+)(.pdf)','$1' ,1,1)
FROM sysibm.sysdummy1
abc
You could apply LENGTH and SUBSTR to extract the relevant part or try to build that into the regex.
For older Db2 versions than 11.1. Not sure if it works for 9.5, but definitely should work since 9.7.
Try this as is.
with cas_fnd_dwd_det (casi_imp_urls) as (values
'www.casiac.net/fnds/CASI/qnxp.pdf'
, 'www.casiac.net/fnds/casi/as.pdf'
, 'www.casiac.net/fnds/casi/vindq.pdf'
, 'www.casiac.net/fnds/CASI/mnip.PDF'
)
select
casi_imp_urls
, xmlcast(xmlquery('fn:replace($s, ".*/(.*)\.pdf", "$1", "i")' passing casi_imp_urls as "s") as varchar(50)) cas_code
from cas_fnd_dwd_det

Pentaho Data Integration - Extract string from string

I have this string:
Goods: 1 pallet 120x80x100 100KG
This is the regex I would use in Ruby:
^Goods: <i>(.*)<br>$
This is what I need as the result:
1 pallet 120x80x100 100KG
How do I do it in Pentaho Data Integration?
There is a step called 'Split Fields', you will feed the column with this data, and set : as delimeter, in the New Fields area you will declare 2 new columns that will receive the split data. This step works pretty much like a Split String per token.
You can also use the Regex Evaluation step, but this one relies on Java regex which are a bit different than Ruby's. However, in your case, it is the same :
^Goods: <i>(.*)<br>$
You can use the same regex in a [Modified] Java Script [Value] step:
^Goods: <i>(.*)<br>$