How to remove whitespaces from string in Redshift? - amazon-web-services

I've been trying to join two tables 'A' and 'B' using a column say 'Col1'. The problem I'm facing is that the data coming in both columns are in different format. For example : 'A - Air' is coming as 'A-Air', 'B - Air' is coming as 'B-Air' etc.
Therefore, I'm trying to remove white spaces from data coming in Col1 in A but i'm not able to remove it using any function given in AWS documentation. I've tried Trim and replace, but they wont work in this case. This might be achieved using regular expressions but i'm not able to find how. Below is the snippet of how I tried using regex but didn't work.
select Col1, regexp_replace( Col1, '#.*\\.( )$')
from A
WHERE
date = TO_DATE('2020/08/01', 'YYYY/MM/DD')
limit 5
Please let me know how can I possibly remove the spaces from a string using regular expressions or any other possible means in Redshift.

Col1, regexp_replace( Col1,'\\s','')
This worked for me.

Related

BigQuery regexp replace character between quotes

I'm trying to use the BigQuery function regexp_replace for the following scenario:
Given a string field with comma as a delimiter, I need to only remove the commas within double quotes.
I found the following regex to work in the website but it seems that the BigQuery function doesn't support Lookahead groups. Could you please help me find an equivalent expression that is supported by the Big Query function regexp_replace?
https://regex101.com/r/nxkqtb/3
Big Query example code not supported:
WITH tbl AS (
SELECT 'LINE_NR="1",TXT_FIELD="Some text",CID="0"' as text
UNION ALL
SELECT 'LINE_NR="2",TXT_FIELD=",,Some text",CID="0"' as text
UNION ALL
SELECT 'LINE_NR="3",TXT_FIELD="Some text ,",CID="0"' as text
UNION ALL
SELECT 'LINE_NR="4",TXT_FIELD=",Some ,text,",CID="0"' as text
)
SELECT
REGEXP_REPLACE(text, r'(?m),(?=[^"]*"(?:[^"\r\n]*"[^"]*")*[^"\r\n]*$)', "")
FROM tbl;
Thank you
Consider below approach (assuming you know in advance keys within the text field)
select text,
( select string_agg(replace(kv, ',', ''), ',' order by offset)
from unnest(regexp_extract_all(text, r'((?:LINE_NR|TXT_FIELD|CID)=".*?")')) kv with offset
) corrected_text
from tbl;
if applied to sample data in your question - output is

How can I use regular expressions to select text between commas?

I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.

Hive - regexp_replace function for multiple strings

I am using hive 0.13! I want to find multiple tokens like "hip hop" and "rock music" in my data and replace them with "hiphop" and "rockmusic" - basically replace them without white space. I have used the regexp_replace function in hive. Below is my query and it works great for above 2 examples.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(regexp_replace(ntext, 'hip hop', 'hiphop'), 'rock music', 'rockmusic') as ntext1
from vp_nlp_protext_males
;
But I have 100 such bigrams/ngrams and want to be able to do replace efficiently where I just remove the whitespace. I can pattern match the phrase - hip hop and rock music but in the replace I want to simply trim the white spaces. Below is what I tried. I also tried using trim with regexp_replace but it wants the third argument in the regexp_replace function.
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
regexp_replace(ntext, '(hip hop)|(rock music)') as ntext1
from vp_nlp_protext_males
;
You can strip all occurrences of a substring from a string using the TRANSLATE function to replace the substring with the empty string. For your query it would become this:
drop table vp_hiphop;
create table vp_hiphop as
select userid, ntext,
translate(ntext, ' ', '') as ntext1
from vp_nlp_protext_males
;

Finding/replacing values for a specific column in Notepad++

I think I need RegEx for this, but it is new to me...
What I have in a text file are 200 rows of data, 100 INSERT INTO rows and 100 corresponding VALUE rows.
So it looks like this:
INSERT INTO DB1.Tbl1 (Col1, Col2, Col3........Col20)
VALUES(123, 'ABC', '201450204 15:37:48'........'DEF')
What I want to do is replace every Date/Timestamp value in Col3 with this: CURRENT_TIMESTAMP. The Date/Timestamps are NOT the same for every row. They differ, but they are all in Column 3.
There are 100 records in this table, some other tables have more, that's why I am looking for a shortcut to do this.
Try this:
search with (INSERT[^,]+,[^,]+,)([^,]+,)([^']+'[^']+'[^']+)('[^']+',) and replace with $1$3 and check mark regular expression in the notepad++
Live demo
With
"VALUES" being right at the beginning of the line,
"Col1" values being all numeric, and
no single quotes inside the values for "Col2"
you can search for
^(VALUES\(\d+, '[^']+', )'(\d{9} \d{2}:\d{2}:\d{2})'
and replace with
\1CURRENT_TIMESTAMP
along RegEx101. (Remember, Notepad++ uses the backslash in the replacement string…)
Personally, I'd consider to go straight to the database, and fix the timestamp there - especially, if you have more data to handle. (See my above comment for the general idea.)
Please comment, if and as further detail / adjustment is required.

How to compare Unicode characters in SQL server?

Hi I am trying to find all rows in my database (SQL Server) which have character é in their text by executing the following queries.
SELECT COUNT(*) FROM t_question WHERE patindex(N'%[\xE9]%',question) > 0;
SELECT COUNT(*) FROM t_question WHERE patindex(N'%[\u00E9]%',question) > 0;
But I found two problems: (a) Both of them are returning different number of rows and (b) They are returning rows which do not have the specified character.
Is the way I am constructing the regular expression and comparing the Unicode correct?
EDIT:
The question column is stored using datatype nvarchar.
The following query gives the correct result though.
SELECT COUNT(*) FROM t_question WHERE question LIKE N'%é%';
Why not use SELECT COUNT(*) FROM t_question WHERE question LIKE N'%é%'?
NB: Likeand patindex do not accept regular expressions.
In the SQL Server pattern syntax [\xE9] means match any single character within the specified set. i.e. match \, x, E or 9. So any of the following strings would match that pattern.
"Elephant"
"axis"
"99.9"