Pyspark - Regex - Extract value from last brackets - regex

I created the following regular expression with the idea of extracting the last element in brackets. See that if I only have one parenthesis it works fine, but if I have 2 parenthesis it extracts the first one (which is a mistake) or extract with the brackets .
Do you know how to solve it?
tmp= spark.createDataFrame(
[
(1, 'foo (123) oiashdj (hi)'),
(2, 'bar oiashdj (hi)'),
],
['id', 'txt']
)
tmp = tmp.withColumn("old", regexp_extract(col("txt"), "(?<=\().+?(?=\))", 0));
tmp = tmp.withColumn("new", regexp_extract(col("txt"), "\(([^)]+)\)?$", 0));
tmp.show()
+---+--------------------+---+----+
| id| txt|old| new| needed
+---+--------------------+---+----+
| 1|foo (123) oiashdj...|123|(hi)| hi
| 2| bar oiashdj (hi)| hi|(hi)| hi
+---+--------------------+---+----+

To extract the substring between parentheses with no other parentheses inside at the end of the string you may use
tmp = tmp.withColumn("new", regexp_extract(col("txt"), r"\(([^()]+)\)$", 1));
Details
\( - matches (
([^()]+) - captures into Group 1 any 1+ chars other than ( and )
\) - a ) char
$ - at the end of the string.
The 1 argument tells the regexp_extract to extract Group 1 value.
See the regex demo online.
NOTE: To allow trailing whitespace, add \s* right before $: r"\(([^()]+)\)\s*$"
NOTE2: To match the last occurrence of such a substring in a longer string, with exactly the same code as above, use
r"(?s).*\(([^()]+)\)"
The .* will grab all the text up to the end, and then backtracking will do the job.

This should work. Use it with the single line flag.
\([^\(\)]*?\)(?!.*\([^\(\)]*?\))
https://regex101.com/r/Qrnlf3/1

Related

Regex handling "|" in Text

I got the following text:
Code = ABCD123 | Points = 30
Code = ABCD333 | Points = 44
At the end, I want to removing anything except the Code, output:
ABCD123
ABCD333
I actually tried it with
Code = | P.+
But I don't know how to get "|" removed. Currently, I have just ÀBCD333 | left as an example.
I'm struggling there.
Assuming the code only consists of word characters, you may use the following:
^Code = (\w+).+$
..and replace with:
\1
Demo.
If the code can be anything, you may use something like this instead:
^Code = (.+?)[ ]\|.+$
Ctrl+H
Find what: ^Code = (\w+).+
Replace with: $1
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline
Replace all
Explanation:
^ # beginning of line
Code = # literally
(\w+) # group 1, 1 or more word character
.+ # 1 or more any character but newline
Replacement:
$1 # content of group 1
Screenshot (before):
Screenshot (after):

Regex matching all characters from the beginning of the string to the first underscore

I am trying to substring elements of a vector to only keep the part before the FIRST underscore. I am a bit of a newbie with taking substrings and don't fully understand all regex yet. I am close to the answer, I can get the part that I want to delete but still don't see how to get the opposite part. Any help and/or explanation of regex is appreciated!
my vector looks like the following, with multiple underscores in some elements
v = c("WL_Alk", "LQ_Frac_C_litter_origin", "MI_Nr_gat", "SED_C_N", "WL_CO2", "WL_S")
my desired output looks like
v_short = c("WL", "LQ", "MI", "SED", "WL", "WL")
The code that gets me the part I want to delete is sub("^[^_]*", "", v). I think I have to do something with $ in regex because sub("[_$]", "", v) deletes the first underscore, but I can't get it to delete the part behind it. Even with the regex helpfile I don't fully understand the meaning of ^, $ and * yet, so explanation on those is also appreciated!
You can use
> v = c("WL_Alk", "LQ_Frac_C_litter_origin", "MI_Nr_gat", "SED_C_N", "WL_CO2", "WL_S")
> sub("_.*", "", v)
[1] "WL" "LQ" "MI" "SED" "WL" "WL"
The "_.*" pattern matches the first _ and .* matches any 0+ characters up to the end of string greedily (that is, grabs them at one go).
With stringr str_extract, you can use your pattern:
> library(stringr)
> v_short = str_extract(v, "^[^_]*")
> v_short
[1] "WL" "LQ" "MI" "SED" "WL" "WL"
The ^[^_]* pattern matches the beginning of the string and 0 or more characters other than _.
If I understood correctly
gsub("(.*?)(_.*)","\\1",v, perl = TRUE)
Explanation:
(.*?) the first capturing group;
(_.*) the second capturing group;
\\1 return the first capturing group;
There are two ways to do it.
Either use ^[^_]+ and match string before first _. Regex101 Demo
OR
Select the part after first _ using \_.+$ and eliminate it. Regex101 Demo

Regex for receipt items

I have a simple receipt which I typed out. I need to be able to read the items purchased on the receipt. The sample receipt is below.
Tim Hortons
Alwasy Fresh
1 Brek Wrap Combo /A ($0.76)
1 Bacon-wrap $3.79
1 Grilled $0.00
1 5 Pieces Bacon-wrap $0.00
1 Orange $1.40
1 Deposit $0.10
Subtotal: $55.84
GST: $0.29
Debit: $55.84
Take out
Thanks for stopping by!!
Tell us how we did
I came up with the following regex string to find the items.
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
It works for the most part but there are a few incorrect lines like
4
GST: $0.29
Can someone come up with a better pattern. Below is a link to see it in action.
http://regexr.com/3cnk9
I see a number of problems with this original regex:
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
First, parentheses both group and match, though when you quantify your match, only the last iteration is captured, so matching like (.)* will only store the last character; you wanted (.*) for that. Since it's greedy, that will be the character before the space preceding a dollar sign, which given your data will always be a space. Similarly, you're quantifying a group at the beginning with (\s){1,10}, which captures only the last whitespace character. In this case, you don't need the group since \s is a single space character, so you can simply use \s{1,10}.
Here is a piece-by-piece explanation of what that regular expression does.
Capturing solution
The following regex captures the quantity ($1), item description ($2), whether the price is parenthesized ($3), and the price ($4):
^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$
Explained and matched to your sample at regex101.
Separated out and commented (assumes the /x flag is supported):
/ # begin regex
^\s* # start of line, ignore leading spaces if present
(\d+) # $1 = quantity
\s+ # spacing as a delimiter
(.*\S) # $2 = item: contains anything, must end in a non-space char
\s+ # spacing as a delimiter
(\(?) # $3 = negation, an optional open parenthesis
\$ # dollar sign
([0-9.]+) # $4 = price
\)?\s*$ # trailing characters: optional end-paren and space(s)
/x # end regex, multi-line regex flag
with sample perl code executed from a command line:
perl -ne '
my ($quantity, $item, $neg, $price)
= /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/;
if ($item) {
if ($neg) { $price *= -1; }
print "<$quantity><$item><$price>\n"
}' RECEIPT_FILE
(If you want that as a perl script, wrap the code with while(<>) { } and you're done.)
This assigns the variables $quantity, $item, and $price to the itemized lines on your receipt. I am assuming that a parenthesized item is to be subtracted (but I can't verify that since the totals are nonsensical), so $neg notes the existence of a parenthesis so the $price can be negated.
I set the output to use angle brackets (< and >) to indicate what each variable stores.
The output of your given sample receipt would therefore be:
<1><Brek Wrap Combo /A><-0.76>
<1><Bacon-wrap><3.79>
<1><Grilled><0.00>
<1><5 Pieces Bacon-wrap><0.00>
<1><Orange><1.40>
<1><Deposit><0.10>
Prices only solution
You didn't say what you wanted to match. If you don't care about anything but the prices and there are no negative values, you don't need matchers if you have negative look-behind or \K:
grep -Po '^\s*[0-9].*\$\K[0-9.]+' RECEIPT_FILE
Grep's -P flag invokes libpcre (which may not be available if you're on an old or embedded system) and -o displays only the matching text. \K denotes the start of the match. Put the \$ after the \K if you want to capture it. (See also the regex101 description and matches.)
Output from that grep command:
0.76
3.79
0.00
0.00
1.40
0.10
Prices only – with awk
There aren't great ways to handle this regex with efficiency. If you're processing through a mountain of content, you'll feel the hurt. Here's a solution using awk that should be significantly faster. (The difference won't be noticeable with a small input.)
awk '$1 / 1 > 0 && $NF ~ /\$/ { gsub(/[()]/, "", $0); print $NF; }' RECEIPT_FILE
Commented version with explanation:
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
print $NF; # print the contents of the last field
}' RECEIPT_FILE
Prices only – with awk, supporting negative prices
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
neg = 1;
if ( $NF ~ /\(/ ) { # the last field has an open parenthesis
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
neg = -1;
}
print $NF * neg; # print the last field, negated if parenthesized
}' RECEIPT_FILE
Here's my attempt:
^(\d+)\s+(.*)\s+\(?(\$.+)\)?$
Stub. Remember to turn the multiline option on. Components:
^ - beginning of line
(\d+) - capture the quantity at the beginning of each line item
\s+ - one or more space
(.*) - capture the item description
\s+ - one or more space
\(? - optional open bracket `(` character
($.+) - capture anything including and after the dollar sign
\)? - optional close bracket `)` character
$ - end of line
You can use
^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)
See the regex demo
This regex should be used with the /m modifier to match data on different lines. In JS, the /g modifier is also required.
Explanation:
^ - start of a line
(\d+) - Group 1 capturing one or more digits
\s+ - one or more whitespaces
(.*?) - Group 2 capturing zero or more any characters but a newline up to the closest
\s+ - one or more whitespaces
\(? - an optional ( (on the first line)
\$ - a literal $
(\d+\.\d+) - Group 3 capturing one or more digits followed with . and one or more digits.
JS demo:
var re = /^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)/gm;
var str = ' Tim Hortons\n Alwasy Fresh\n\n1 Brek Wrap Combo /A ($0.76)\n1 Bacon-wrap $3.79\n1 Grilled $0.00\n1 5 Pieces Bacon-wrap $0.00\n1 Orange $1.40\n1 Deposit $0.10\nSubtotal: $55.84\nGST: $0.29\nDebit: $55.84\nTake out\n\n Thanks for stopping by!!\n Tell us how we did';
while ((m = re.exec(str)) !== null) {
document.body.innerHTML += "Pcs: <b>" + m[1] + "</b>, item: <b>" + m[2] + "</b>, paid: <b>" + m[3] + "</b><br/>";
}
Adam Katz's answer should be the accepted one! I used this variation of his answer for an implementation in JavaScript:
const receiptRegex = /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/gm
let items = [];
const matches = inputStr.matchAll(receiptRegex);
for (const matchedGroup of matches) {
const [
fullString, //[0] -> matched string "1 Blue gatorade $2.00"
quantity, //[1] -> quantity "1"
item, //[2] -> item description "Blue gatorade"
ignoredSymbol, //[3] -> "$" (should probably always ignore)
price //[4] -> amount "2.00"
] = matchedGroup;
items.push({
quantity,
item,
price,
});
}

Regex to select NOT and operand

I am trying to break a string to array using Regex in C# .
I have for example the string
{([Field] = '100' OR [LaneDescription] LIKE '%DENTINPALEUW%'
OR [LaneDescription] = 'asdf' OR ([ObjectID] = 1) AND [ITEM_HEIGHT] >=
10 AND [SENDER_COMPANY] NOT LIKE '%DHL%'}
(Generated from Telerik RadFilter)
and i need it broken so i can pass it to a custom object with types: open parenthesis, field, comparator , value, close parenthesis.
So far and with the help of http://regexr.com i have reached to
\[([^\[\]]*)\]+|[\w'%]+|[()=]
but i need to get the '>=' and 'NOT LIKE' as one (and similar values like <> != etc..)
You can see my late night attempts at http://regexr.com/39g6b
Any help would be much appreciated.
(PS: There are no newline characters at the string)
Try
\(|\)|\[[a-zA-Z0-9_]+\]|'.*?'|\d+|NOT LIKE|\w+|[=><!]+
Demo.
Explanation:
\( // match "(" literally
| // or
\) // ")"
| // or
\[[a-zA-Z0-9_]+\] // any words inside square braces []
|
'.*?' // strings enclosed in single quotes '' (escape sequences can easily trip this up though)
|
\d+ // digits
|
NOT LIKE // "NOT LIKE", because this is the only token that can contain whitespace
|
\w+ // words like "NOT", "AND", etc
|
[=><!]+ // operators like ">", "!=", etc

How to match the whole expression only, even when there are sub parts that match?

Just trying to write input validation pattern that would allow entry of wild characters. Input field is 9 char max and should follow these rules:
* + 1- 8 charcters
1- 8 chars + *
* + 1-7 chars + *
I've written this regex using the regex documentation and testing it on one of the regex testers.
\*{1}[0-9]{1,7}\*{1}|[0-9]{1,8}\*{1}|\*{1}[0-9]{1,8}|[0-9]{9}
It matches all these correctly
123456789
*1*
*12*
*123*
*1234*
*12345*
*123456*
*1234567*
1234567*
123456*
12345*
1234*
123*
12*
1*
*1
*12
*123
*1234
*12345
*123456
*1234567
*12345678
But it also matches when I don't want it. For example it finds 2 matches in this *123456789* First match is *12345678 and second one is 9*
I don't want in this case to find any matches. Either the whole string matches one of the patterns or not. How does one do that?
Use anchors that make sure the regex always matches the entire string:
^(\*[0-9]{1,7}\*|[0-9]{1,8}\*|\*[0-9]{1,8}|[0-9]{9})$
Note the parentheses to make sure that the alternation is contained within the group:
^
(
\*[0-9]{1,7}\*
|
[0-9]{1,8}\*
|
\*[0-9]{1,8}
|
[0-9]{9}
)
$
Also, {1} is always superfluous - one match per token is the default.
You could use start and end string anchors:
http://www.regular-expressions.info/anchors.html
So, your regex would be something like this (note first and last symbol):
^(\*{1}[0-9]{1,7}*{1}|[0-9]{1,8}*{1}|*{1}[0-9]{1,8}|[0-9]{9})$