Having issues seperating the required data using regex - regex

I am trying to use regex on a rails application I'm building to seperate input without splitting the string up manually.
My regex is:
(?<action>\S+)(?:\s(?<query>.*)\s)(?<id>(?<=.).*?(?=\s))
And the test data I am using is as follows:
add hello by name
remove first second by id
add first
From this, I want the following values:
action: add, query: hello, id: name
action: remove, query: first second, id: id
action: add, query: first, id: nil (or "")
What am I doing wrong? It won't match at all on the last line of test data. Any help would be great.

Try this one:
^(?<action>\S+)(?:\s(?<query>(?:(?! by ).)*))(?: by (?<id>\w+))?
The id is always preceded by " by ", so each character in your <query> group should repeat a negative lookahead for that " by " substring.
Also ensure that the group around the id is optional, so that the third line gets matched as well.
Demo
Another option, instead of repeating a negative lookahead, would be to have a single positive lookahead for " by " or the end of the string, and repeat lazily:
^(?<action>\S+)(?:\s(?<query>.*?(?= by |$)))(?: by (?<id>\w+))?$

Related

Extract from string in BigQuery using regexp_extract

I have a long string in BigQuery where that I need to extract out some data.
Part of the string looks like this:
... source: "agent" resolved_query: "hi" score: 0.61254 parameters ...
I want to extract out data such as agent, hi, and 0.61254.
I'm trying to use regexp_extract but I can't get the regexp to work correctly:
select
regexp_extract([col],r'score: [0-9]*\.[0-9]+') as score,
regexp_extract([col],r'source: [^"]*') as source
from [table]
What should the regexp be to just get agent or 0.61254 without the field name and no quotation marks?
Thank you in advance.
I love non-trivial approaches - below one of such -
select * except(col) from (
select col, split(kv, ': ')[offset(0)] key,
trim(split(kv, ': ')[offset(1)], '"') value,
from your_table,
unnest(regexp_extract_all(col, r'\w+: "?[\w.]+"?')) kv
)
pivot (min(value) for key in ('source', 'resolved_query', 'score'))
if applied to sample data as in your question
with your_table as (
select '... source: "agent" resolved_query: "hi" score: 0.61254 parameters ... ' col union all
select '... source: "agent2" resolved_query: "hello" score: 0.12345 parameters ... ' col
)
the output is
As you might noticed, the benefit of such approach is obvious - if you have more fields/attributes to extract - you do not need to clone the lines of code for each of attribute - you just add yet another value in last line's list - the whole code is always the same
You can use
select
regexp_extract([col],r'score:\s*(\d*\.?\d+)') as score,
regexp_extract([col],r'resolved_query:\s*"([^"]*)"') as resolved_query,
regexp_extract([col],r'source:\s*"([^"]*)"') as source
from [table]
Here,
score:\s*(\d*\.?\d+) matches score: string, then any zero or more whitespaces, and then there is a capturing group with ID=1 that captures zero or more digits, an optional . and then one or more digits
resolved_query:\s*"([^"]*)" matches a resolved_query: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char
source:\s*"([^"]*)" matches a source: string, zero or more whitespaces, ", then captures into Group 1 any zero or more chars other than " and then matches a " char.

Regex for SQL Query

Hello together I have the following problem:
I have a long list of SQL queries which I would like to adapt to one of my changes. Finally, I have a renaming problem and I'm afraid I want to solve it more complicated than expected.
The query looks like this:
INSERT member (member, prename, name, street, postalcode, town, tel1, tel2, fax, bem, anrede, salutation, email, name2, name3, association, project) VALUES (2005, N'John', N'Doe', N'Street 4711', N'1234', N'Town', N'1234-5678', N'1234-5678', N'1234-5678', N'Leader', NULL, N'Dear Mr. Doe', N'a#b.com', N'This is the text i want to delete', N'Name2', N'Name3', NULL, NULL);
In the "Insert" there was another column which I removed (which I did simply via Notepad++ by typing the search term - "example, " - and replaced it with an empty field. Only the following entry in Values I can't get out using this method, because the text varies here. So far I have only worked with the text file in which I adjusted the list of queries.
So as you can see there is one more entry in Values than in the insertions (there was another column here, but it was removed by my change).
It is the entry after the email address. I would like to remove this including the comma (N'This is the text i want to delete',).
My idea was to form a group and say that the 14th digit after the comma should be removed. However, even after research I do not know how to realize this.
I thought it could look like this (tried in https://regex101.com/)
VALUES\s?\((,) something here
Is this even the right approach or is there another method? I only knew Regex to solve this problem, because of course the values look different here.
And how can I finally use the regex to get the queries adapted (because the queries are local to my computer and not yet included in the code).
Short summary:
Change the query from
VALUES (... test5, test6, test7 ...)
To
VALUES (... test5, test7 ...)
As per my comment, you could use find/replace, where you search for:
(\bVALUES +\((?:[^,]+,){13})[^,]+,
And replace with $1
See the online demo
( - Open 1st capture group.
\bValues +\( - Match a word-boundary, literally 'VALUES', followed by at least a single space and a literal open paranthesis.
(?: - Open non-capturing group.
[^,]+, - Match anything but a comma at least once followed by a comma.
){13} - Close non-capture group and repeat it 13 times.
) - Close 1st capture group.
[^,]+, - Match anything but a comma at least once followed by a comma.
You may use the following to remove / replace the value you need:
Find What: \bVALUES\s*\((\s*(?:N'[^']*'|\w+))(?:,(?1)){12}\K,(?1)
Replace With: (empty string, or whatever value you need)
See the regex demo
Details
\bVALUES - whole word VALUES
\s* - 0+ whitespaces
\( - a (
(\s*(?:N'[^']*'|\w+)) - Group 1: 0+ whitespaces and then either N' followed with any 0 or more chars other than ' and then a ', or 1+ word chars
(?:,(?1)){12} - twelve repetitions of , followed with the Group 1 pattern
\K - match reset operator that discards the text matched so far from the match memory buffer
, - a comma
(?1) - Group 1 pattern.
Settings screen:

Regex get every string from start until new line?

I have a string like this :
Name: Yoza Jr
Address: Street 123, Canada
Email: yoza#gmail.com
I need get data using regex until new line, for example
Start with Name: get Yoza Jr until new line for name data
so I can have 3 data Name, Address, Email
How to Regex get every string from start until new line?
btw I will use it in golang : https://regex-golang.appspot.com/assets/html/index.html
The pattern ^.*$ should work, see the demo here. This assumes that .* would not be running in dot all mode, meaning that .* will not extend past the \r?\n newline at the end of each line.
If you want to capture the field value, then use:
^[^:]+:\s*(\S+)$
The quantity you want will be present in the first capture group.
I would suggest you use the pattern ^(.+):\s*(.*)$
Demo: https://regex101.com/r/Q9D4RM/1
Not only will it result in 3 distinct matches for the string given by you, the field name (before the ":") will be read as group 1 of the match, and the value (after the ":") will be read as group 2. So, if you want the key-value pairs, you can just search for groups 1 and 2 for each match.
Please let me know if it's unclear so I can elaborate.

Get text between a group of delimiters

I have a string of text with four delimiters ST: SI: T: and I: that are followed by a sequence of digits and numbers. I need to grab the delimiter as a group called group and the digits and numbers as code.
ST:12YEOR48000FCT:24YEOR48000FCSI:12YEOR13000FCI:12YEOR13000FCT:12YEOR51200FCI:12YEOR14500FCST:12YEOR48000FCT:24YEOR48000FCSI:12YEOR13000FCI:12YEOR13000FCT:12ACTYEI:12ACTYET:32000ACTFCI:13300ACTFC
The results should be
GROUP CODE
ST: 12YEOR48000FC
T: 24YEOR48000FC
SI: 12YEOR13000F
CI: 12YEOR13000F
CT: 12YEOR51200F
CI: 12YEOR14500FC
ST: 12YEOR48000F
CT: 24YEOR48000FC
SI: 12YEOR13000F
CI: 12YEOR13000F
CT: 12ACTYE
I: 12ACTYE
T: 32000ACTFC
I: 13300ACTFC
(?'group'ST:|SI:|T:|I:)(?'code'.*?)(?<=ST:|SI:|T:|I:|$)
My thought is that I want grab the starting delimiter as the group, then any character as the code, until another delimiter or end of string is found. The regex I came up with gets the delimiters but not the code.
Thanks for any help.
RegEx101
You're using a positive lookbehind for your code group, which won't accomplish the functionality you're looking for.
However, you're on the right track! Removing the < to create a positive lookahead will achieve what you're looking for:
(?'group'ST:|SI:|T:|I:)(?'code'.*?)(?=ST:|SI:|T:|I:|$)
Regex101
You should also consider optimizing the pattern a bit for maintainability by using nested matching groups to break out the colon token for each of your group items. This will make it easier to add group codes later and limit the potential of typos (i.e., forgetting the colon in the new group code):
(?'group'(?:ST|SI|T|I):)(?'code'.*?)(?=(?:ST|SI|T|I):|$)
Regex101

Regex: Removing Space Between Quotes, And Stopping Before a Colon (With Yahoo Pipes)

I've been working on this for a while, but it's beyond my understanding of regex.
I'm using Yahoo Pipes on an RSS, and I want to create hashtags from titles; so, I'd like to remove space from everything between quotes, but, if there's a colon within the quotes, I only want the space removed between the words before the colon.
And, it would be great if I could also capture the unspaced words as a group, to be able to use: #$1 to output the hashtag in one step.
So, something like:
"The New Apple: Worlds Within Worlds" Before We Begin...
Could be substituted like #$1 - with this result:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
After some work, I was able to come up with, this regex:
\s(?=\s)?|(‘|’|(Review)|:.*)
("Review" was a word that often came before colons and wouldn't be stripped, if it were later in the title; that's what that's for, but I would like to not require that, to be more universal)
But, it has two problems:
I have to use multiple steps. The result of that regex would be:
"TheNewApple: Worlds Within Worlds" Before We Begin...
And I could then add another regex step, to put the hash # in front
But, it only works if the quotes are first, and I don't know how to fix that...
You can do this all in one step with regex, with a caveat. You run into problems with a repeated capturing group because only the last iteration is available in the replacement string. Searching for ( (\w+))+ and replacing with $2 will replace all the words with just the last match - not what we want.
The way around this is to repeat the pattern an arbitrary number of times that will suffice for your use. Each separate group can be referenced.
Search: "(\w+)(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?(?: (\w+))?
Replace: "#$1$2$3$4$5$6
This will replace up to 6-word titles, exactly as you need them. First, "(\w+) matches any word following a quote. In the replacement string, it is put back as "#$1, adding the hashtag. The rest is a repeated list of (?: (\w+))? matches, each matching a possible space and word. Notice the space is part of a non-capturing group; only the word is part of the inner capture group. In the replacement string, I have $1$2$3$4$5$6, which puts back the words, without the spaces. Notice that a colon will not match any part of this, so it will stop once it hits a colon.
Examples:
"The New Apple: Worlds Within Worlds" Before We Begin...
"The New Apple" Before We Begin...
"One: Two"
only "One" word
this has "Two Words"
"The Great Big Apple Dumpling"
"The Great Big Apple Dumpling Again: Part 2"
Results:
"#TheNewApple: Worlds Within Worlds" Before We Begin...
"#TheNewApple" Before We Begin...
"#One: Two"
only "#One" word
this has "#TwoWords"
"#TheGreatBigAppleDumpling"
"#TheGreatBigAppleDumplingAgain: Part 2"
You can match the text with
"([^:]*)(.*?)"(.*)
then use some programming language to output the result like this:
'"#' + removeSpace($1) + $2 + '"' + $3
I have no idea what language you're using, but this seems like a poor choice for regex. In Python I'd do this:
# Python 3
import re
titles = ['''"The New Apple: Worlds Within Worlds" Before We Begin...''',
'''"Made Up Title: For Example Only" So We Can Continue...''']
hashtagged_titles = list()
for title in titles:
hashtagme, *restofstring = title.split(":")
hashtag = '"#'+hashtagme[1:].translate(str.maketrans('', '', " "))
result = "{}:{}".format(hashtag, restofstring)
hashtagged_titles.append(result)
Do a global search for
\ (?=.*:)
Replaced with nothing. Example
You'll need a second search on the results of that if you want to capture "TheNewApple" as a single word.