Finding/replacing values for a specific column in Notepad++ - regex

I think I need RegEx for this, but it is new to me...
What I have in a text file are 200 rows of data, 100 INSERT INTO rows and 100 corresponding VALUE rows.
So it looks like this:
INSERT INTO DB1.Tbl1 (Col1, Col2, Col3........Col20)
VALUES(123, 'ABC', '201450204 15:37:48'........'DEF')
What I want to do is replace every Date/Timestamp value in Col3 with this: CURRENT_TIMESTAMP. The Date/Timestamps are NOT the same for every row. They differ, but they are all in Column 3.
There are 100 records in this table, some other tables have more, that's why I am looking for a shortcut to do this.

Try this:
search with (INSERT[^,]+,[^,]+,)([^,]+,)([^']+'[^']+'[^']+)('[^']+',) and replace with $1$3 and check mark regular expression in the notepad++
Live demo

With
"VALUES" being right at the beginning of the line,
"Col1" values being all numeric, and
no single quotes inside the values for "Col2"
you can search for
^(VALUES\(\d+, '[^']+', )'(\d{9} \d{2}:\d{2}:\d{2})'
and replace with
\1CURRENT_TIMESTAMP
along RegEx101. (Remember, Notepad++ uses the backslash in the replacement string…)
Personally, I'd consider to go straight to the database, and fix the timestamp there - especially, if you have more data to handle. (See my above comment for the general idea.)
Please comment, if and as further detail / adjustment is required.

Related

regular expression replace for SQL

I have to replace a string pattern in SQL with empty string, could anyone please suggest me?
Input String 'AC001,AD001,AE001,SA001,AE002,SD001'
Output String 'AE001,AE002
There are the 4 digit codes with first 2 characters "alphabets" and last two are digits. This is always a 4 digit code. And I have to replace all codes except the codes starting with "AE".
I can have 0 or more instances of "AE" codes in the string. The final output should be a formatted string "separated by commas" for multiple "AE" codes as mentioned above.
Here is one option calling regex_replace multiple times, eliminating the "not required" strings little by little in each iteration to arrive at the required output.
SELECT regexp_replace(
regexp_replace(
regexp_replace(
'AC001,AD001,AE001,SA001,AE002,SD001', '(?<!AE)\d{3},{0,1}', 'X','g'
),'..X','','g'
),',$','','g'
)
See Demo here
I would convert the list to an array, unnest that to rows then filter out those that should be kept and aggregate it back to a string:
select string_agg(t, ',')
from unnest(string_to_array('AC001,AD001,AE001,SA001,AE002,SD001',',') as x(t)
where x.t like 'AE%'; --<< only keep those
This is independent of the number of elements in the string and can easily be extended to support more complex conditions.
This is a good example why storing comma separated values in a single column is not such a good idea to begin with.

How can I use regular expressions to select text between commas?

I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

Regex inside group

I have a csv file I need to import into my db.
Sample input:
122545;bmwx3;new;red,black,white,pink
I want the final output to be like this:
INSERT INTO myTable VALUES ("122545", "bmwx3", "new", "red");
INSERT INTO myTable VALUES ("122545", "bmwx3", "new", "black");
INSERT INTO myTable VALUES ("122545", "bmwx3", "new", "white");
INSERT INTO myTable VALUES ("122545", "bmwx3", "new", "pink");
The 4th element is a "sub-csv" with an unknown amount of entries. But always in that format (no ")
Ideally I would like to do this in notepad++ using regex, if not possible I will have to cook up a script.
I think that first I need to make this:
122545;bmwx3;new;red,black,white,pink
Look like this:
122545;bmwx3;new;red
122545;bmwx3;new;black
122545;bmwx3;new;white
122545;bmwx3;new;pink
My problem is that I don't know to match the sub-csv. Is it even possible to do this in pure regex (no programming needed)?
If the 122545;bmwx3;new; part is not fixed
In three steps:
Get to red,black,white,pink#LIMIT#122545;bmwx3;new;: replace (.*;)([^;]*) with \2#LIMIT#\1
Create the 122545;bmwx3;new;red stings: replace
(\w+)(?:,|(?=#LIMIT#))(?=.*#LIMIT#(.*))
with \2\1\n (see demo)
Remove the #LIMIT#... lines: replace ^#LIMIT#.* with an empty string
If the 122545;bmwx3;new; part is fixed
#hjpotter's idea seems pretty cool, you just new to replace , with
\n122545;bmwx3;new;
What's left
Replace
^(\w*);(\w*);(\w*);(\w*)$
with
INSERT INTO myTable VALUES ("\1", "\2", "\3", "\4")
You're good to go !
Certainly not the simplest way, but it works:
Find what: ^([^,]+;)(.+),([^,]+)$
Replace with: $1$2\n$1$3
And click on Replace all as many time as needed!

regular expression clob field

I have a question related to an regular expression in oracle 10.
Assuming I have a value like 123456;12345;454545 stored in a clob field, is there a way via an regular expression to only filter on the second pattern (12345) knowing that the value can be more then 5 digits but always occurs after the first semicolon and always has a trailing semicolon at the end?
Thanks a lot for your support in that matter,
Have a nice day,
This query should give you your desired output.
SELECT REGEXP_REPLACE(REGEXP_SUBSTR('123456;12345;454545;45634',';[0-9]+;'),';')
FROM dual;
You can get filter any pattern using this query just change 2 to any value, but it should be less than or equal to the number of elements in the string
with tab(value) as
(select '123456;12345;454545' from dual)
select regexp_substr(value, '[^;]+', 1, 2) from tab;
easily by one call:
select regexp_replace('123456;12345;454545','^[0-9]+;([0-9]+);.*$','\1')
from dual;
perhaps, regexp expression can be modified in a way of more good-looking or your business logic, but the idea, I think, is clear.
select regexp_replace(regexp_substr(Col_name,';\d+;'),';','') from your_table;