I am using BigQuery on Google Cloud Platform to extract data from GDELT. This uses an SQL syntax and regular expressions.
I have a column of data (called V2Tone), in which each cell looks like this:
1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299
To select only the first number (i.e., the number before the first comma) using regular expressions, we use this:
regexp_replace(V2Tone, r',.*', '')
How can we select only the second number (i.e., the number between the first and second commas)?
How about the third number (i.e., the number between the second and third commas)?
I understand that re2 syntax (https://github.com/google/re2/wiki/Syntax) is used here, but my understanding of how to put that all together is limited.
If anything is unclear, please let me know. Thank you for your help as I learn to use regular expressions.
Below example is for BigQuery Standard SQL using super simple SPLIT approach
#standardSQL
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number
FROM `project.dataset.table`
If for some reason you need/want to use regexp here - use below
#standardSQL
SELECT
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number
FROM `project.dataset.table`
Note use of REGEXP_EXTRACT instead of REGEXP_REPLACE
You can play, test above options with dummy string from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '1.55763239875389,2.80373831775701,1.24610591900312,4.04984423676012,26.4797507788162,2.49221183800623,299' V2Tone
)
SELECT
SPLIT(V2Tone)[SAFE_OFFSET(0)] first_number,
SPLIT(V2Tone)[SAFE_OFFSET(1)] second_number,
SPLIT(V2Tone)[SAFE_OFFSET(2)] third_number,
REGEXP_EXTRACT(V2Tone, r'^(.*?),') first_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),)(.*?),') second_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){2}(.*?),') third_number_re,
REGEXP_EXTRACT(V2Tone, r'^(?:(?:.*?),){4}(.*?),') fifth_number_re
FROM `project.dataset.table`
with output :
first_number second_number third_number first_number_re second_number_re third_number_re fifth_number_re
1.55763239875389 2.80373831775701 1.24610591900312 1.55763239875389 2.80373831775701 1.24610591900312 26.4797507788162
I don't know of a single regex replace which could be used to isolate a single number in your CSV string, because we need to remove things on both sides of the match, in general. But, we can chain together two calls to regex_replace. For example, if you wanted to target the third number in the CSV string, we could try this:
regexp_replace(regexp_replace(V2Tone, r'^(?:(?:\d+(?:\.\d+)?),){2}', ''),
r',.*', ''))
The pattern I am using to strip of the first n numbers is this:
^(?:(?:\d+(?:\.\d+)?),){n}
This just removes a number, followed by a comma, n times, from the beginning of the string.
Demo
Here is a solution with a single regex replace:
^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$
Demo
\n is added to the negated character class in the demo to avoid matching accross lines in m|multiline mode.
Usage:
regexp_replace(V2Tone, r'^([^,]+(?:,|$)){2}([^,]+(?:,|$))*|^.*$', '$1')
Explanation:
([^,]+(?:,|$){n} captures everything to the next comma or the end of the string n times
([^,]+(?:,|$))* captures the rest 0 or more times
^.*$ capture everything if we cannot match n times
And then, finally, we can reinsert the nth match using $1.
I have a comma separated file where I need to change the first column removing leading zeroes in string. Text file is as below
ABC-0001,ab,0001
ABC-0010,bc,0010
I need to get the data as under
ABC-1,ab,0001
ABC-10,bc,0010
I can do a command line replace which i tried as below:
sed 's/ABC-0*[1-9]/ABC-[1-9]/g' file
I ended up getting output:
ABC-[1-9],ab,0001
ABC-[1-9]0,ac,0010
Can you please tell me what I am missing in here.
Alternately I also tried to apply formatting in the SQL that generates this file as below:
select regexp_replace(key,'((0+)|1-9|0+)','(1-9|0+)') from file where key in ('ABC-0001','ABC-0010')
which gives output as
ABC-(1-9|0+)1
ABC-(1-9|0+)1(1-9|0+)
Help on either of solution will be very helpful!
Try this :
sed -E 's/ABC-0*([1-9])/ABC-\1/g' file
------ --
| |
capturing group |
captured group
To do it in the query using Oracle, where the key value with the zeroes you want to remove is in a column called "key" in a table called "file", would look like this:
select regexp_replace(key, '(-)(0+)(.*)', '\1\3')
from file;
You need to capture the dash as it is "consumed" by the regex as it is matched. Followed by the second group of one or more 0's, followed by the rest of the field. Replace with captured groups 1 and 3, leaving the 0's (if any) between out.
I want to extract the last word of each line in notepad++. For this purpose i used the regular expression (\w+)$ in find what text box, it showed all the last word highlighted after the search. but i don't know how i will extract these words. Below is a look of my file
gene expression
gene
B activation
B
surface receptor
cell activation
proliferation
T lymphocyte
oxygen intermediate
activation
B
complex
expression
signaling cascade
tyrosine kinase activity
tyrosine kinase
A2
metabolite
formation
B activation
B
You can
Search: (.* )(\w+)$
Replace with: $2
to delete all others except the last word in each line.
To extract the last word of each line, you can use the command line cc.gc ^-1w of the plugin ConyEdit.
In Google Sheets, I have this in one cell:
Random stuff blah blah 123456789
<Surname, Name><123456><A><100><B><200>
<Surname2, Name2><456789><A><300><B><400>
Some more random stuff
And would like to match the strings within <> brackets. With = REGEXEXTRACT(A4, "<(.*)>") I got thus far:
Surname, Name><123456><A><100><B><200
which is nice, but it is only the first line. The desired output would be this (maybe including the <> at the beginning/end, it doesn't really matter):
Surname, Name><123456><A><100><B><200>
<Surname2, Name2><456789><A><300><B><400
or simply:
Surname, Name><123456><A><100><B><200><Surname2, Name2><456789><A><300><B><400
How to get there?
Please try:
=SUBSTITUTE(regexextract(substitute(A4,char(10)," "),"<(.*)>"),"> <",">"&char(10)&"<")
Starting in the middle, the substitute replaces line breaks (char(10)) with spaces. This enables the regexextract the complete (ie multi-line) string to work on, with the same pattern as already familiar to OP. SUBSTITUTE then reinstates the relevant space (identified as being immediately surrounded by > and <) with a line break.
Google sheets uses RE2 syntax. You can set the multi-line and s flags in order to match multiple lines. The following will match all characters over multiple lines in cell A2.
=REGEXEXTRACT(A2, "(?ms)^(.*)$")
REGEXEXTRACT(A1,"text1(?ms)(.*)text2")
So, in this case:
REGEXEXTRACT(A1,"<(?ms)(.*)>")
I'm trying to substitute all non matching characters in a single line between certain columns (after a search).
Example:
The search can be everything
In example below the search = test
The substitute character of non matching characters: empty space.
I want to substitute all characters non part of "test" between columns 10 and 30.
Columns 10 and 30 are indicated with |
before: djd<aj.testjal.kjetestjaja testlala ratesttsuvtesta !<-a-
| |
after: djd<aj.test test testlala ratesttsuvtesta !<-a-
How can I realize this?
Use the following substitution command on that line.
:s/\(test\)\zs\|\%>9v\%<31v./\=submatch(1)!=''?'':' '/g
If the range of columns is specified using visual selection, run
:'<,'>s/\(test\)\zs\|\%V./\=submatch(1)!=''?'':' '/g
One method may be to select the appropiate column range using the Visual mode (control+v)
Once selected, the search and replace can be done using (see this question)
%s/\%Vfoo/bar/g
A regular expression for not test can be found here: Regular expression to match a line that doesn't contain a word?