How to convert a regular expression from OR to XOR - regex

I wish to evaluate a structure similar to the following:
The house is green but my favorite colors are blue red and yellow
I determine the color of the house with a regular expression like this:
the house \ s + (\ w \ s *) + (? = (cyan | green | red | blue))
What does it do? This expression returns the next match:
The house is green but my favorite colors are blue
That is, returns the last match in the string in the list CharacterClass colors indicated, ie it takes until the appearance of RED, but the first color you see is GREEN.
What should I do? What I'm looking for is to just take the first color mentioned in the list and stop looking, that is to tell me that the house color is green, and nothing else.
Q1: How to loop through the string until the appearance of only one and only one of the expressions that you indicated, that is, how to convert the expression (cyan or green or blue or red) to a list that behaves like an XOR. Important: Only use regular expressions, ie without any como.NET background language, Java, PERL, etc ...
Q2: Are there any alternative to using regular expressions that I missed. That is, the road I took is the right one?
In advance, thank you all

It's returning the latest match because your (\w\s*)+ is greedy; it matches as much as it can (i.e. all the way up to just before the 'red').
You could change it to non-greedy using +? instead of +
the house\s+(\w\s*)+?(?=(cyan|green|red|blue))
But I think you can do better than that.
Why (\w\s*)+ you're potentially just matching a single letter at a time! why not match whole words instead with (\w+\s+)+.
Also, why not just match up to the first colour?
the\s+house\s+(\w+\s+)+?(cyan|green|red|blue)
Then capturing group 2 (the second set of brackets) will contain the first occurence of cyan, green, red, or blue (i.e. your colour list). Note the +? making sure that the word regex is non-greedy, meaning it won't gobble up instances of 'cyan', 'green', 'red' or 'blue'.
You could even just do
house.*?\b(cyan|green|red|blue)
Where the .*? is non-greedy, and just gobbles everything up, up to the first colour. The \b is a "word boundary" and just makes sure the regex doesn't match the 'red' in 'desired', for example.

This is how i would do it in python, im not sure if other languages have the .seach feature.
"What I'm looking for is to just take the first color mentioned in the list and stop looking, "
s='The house is green but my favorite colors are blue red and yellow'
import re
print re.search('(cyan|green|red|blue)',s,).group(1)
print re.match('The house is (cyan|green|red|blue)',s,).group(1)#or if u had to use the .match
note the lack of spaces in the (cyan|green|red|blue).
it prints this:
green
green

Related

RegexReplace the nth occurrence of a string of underscores

I'm having trouble getting a REGEXREPLACE working in a Google Sheets formula. I'm aiming to replicate a certain card game which is opposed to humankind. I have a cell containing a string which contains one, two or three occurrences of a series of underscores, e.g.
"_____ is the new _____"
And let's say I want to substitute in the strings "Orange" for the first occurrence, and "Black" for the second occurrence.
I don't know how many underscores will be in each string, it could be one or more, so it seems like a job for regex. I tried SUBSTITUTE and it didn't seem to recognise asterisks. Based on this link, I tried using {1} {2} and {3} to match the first/second/third occurrence, but I'm not doing something right:
=REGEXREPLACE(G16,".*(_*){1}.*",G17)
G16 is: _____ is the new _____.
G17 is: Orange
The output of the formula is: OrangeOrange.
Can anyone help me figure out the correct way to do this?
You may use
=REGEXREPLACE(REGEXREPLACE(G16,"^([^_]*)_+","$1Orange"), "^([^_]*)_+", "$1Black")
|----- First occurrence -----------------|
|----------------- Second occurrence ------------------------------------------|
Details
^ - start of string
([^_]*) - Capturing group 1 ($1 will refer to this group value): 0 or more chars other than an underscore
_+ - 1 or more underscores.

Detecting whole number with an "x" or "-" after using regex

I'm trying to use regex to detect the quantity in a list of items on a receipt. The software uses OCR so the return can vary a bit. To help ive narrowed it to assume that the quantity will always be at the start of the line and is always a whole number. The use cases I'm trying to cover are:
2 Burgers $4.00
2 x Burgers $4.00
2 X Burgers $4.00
2x Burgers $4.00
2X Burgers $4.00
2- Burgers $4.00
2 - Burgers $4.00
The plan is for the regex to return 2 for each example above. The regex I have so far is \\d{1,2}(\\s[xX]|[xX]) this returns the top three examples fine but as much as I have tried I cant seem to get the rest detected, I haven't looked at adding the - yet as was stuck on detecting the x next to the Int.
Any help would be great, thanks
To help ive narrowed it to assume that the quantity will always be at the start of the line and is always a whole number.
I suggest using something like
let pattern = "(?m)^\\d+"
See the regex demo.
The pattern will match 1 or more digits at the start of any line:
(?m) - a MULTILINE modifier that makes ^ match the start of a line rather than the start of a string
^ - start of a line
\d+ - 1 or more (+) digits.
If you need to specify that some text should follow the digits, use a positive lookahead. E.g. you may require x/X/- after 0+ whitespaces, or a whitespace right after. Then, you need to use
let pattern = "(?m)\\d+(?=\\s*[xX-]|\\s)"
Here, (?=\\s*[xX-]|\\s) will make the regex match only those digits at the start of the line(s) that are immediately followed with either 0+ whitespace chars and then X, x or -, or that are immediately followed with a whitespace.
See this regex demo.
^(\\d+)\\s?[xX-]?.*?([$£](?:\\d{1,2})(?:,?\\d{3})*\.?\\d{0,2})$
See it working here (extra backslashes have been added in the code above to allow it to work in Swift, whereas the below link shows the expected result in JS, Python, Go and PHP, which means there are less backslashes there).
Will capture number of items and the price, what the item is is not captured.

Match and highlight two sets of columns in VIM

This SO post describes how to highlight all characters on a line in VIM past a given line number (80, in this case).
I'd like to have two sets of highlighted characters, columns 81-100 highlighted with one background color, and columns 101+ with another background color.
Here's what I've tried so far:
" Light highlight characters past column 80. Red highlight past 100.
highlight OverLength1 ctermbg=red ctermfg=white guibg=#5b4f62
match OverLength1 /\%81v.\+/
highlight OverLength2 ctermbg=red ctermfg=white guibg=#990500
match OverLength2 /\%101v.\+/
as well as this variation on the 3rd line:
match OverLength1 /\%81v.\+($|100v)/
Neither works. The best I can get is to match 101+ alone; it seems like the second match overwrites the first match.
I don't like the colorcolumn option, I don't want to highlight empty columns, just text in the ranges specified.
Try
" Light highlight characters past column 80. Red highlight past 100.
highlight OverLength1 ctermbg=red ctermfg=white guibg=#5b4f62
match OverLength1 /\%81v.\+/
highlight OverLength2 ctermbg=red ctermfg=white guibg=#990500
2match OverLength2 /\%101v.\+/
Read more about it on :h 2match.

Complete Regex Pattern- String Exclusion, Optional End Brackets, Multiple Matches

I'm parsing a bunch of line items on an inventory list and while each line describes something similar, the text format was not standardized. I'm been working on a regex pattern for the past few days but I'm not having much luck with getting a pattern that can match all of my test scenarios. I hoping that someone with a lot more regex experience might be able to point out a few errors in the the pattern
Pattern To Match the palette number: \([Pp]alette [No\.\s]?#?(.*?)\),
1. Warehouse A, (Palette #91L41)
# Match Result Correct: 91L41
2. Warehouse B Palette No. 214
# Match Result Incorrect: no match
3. Warehouse Lot Storage C (Palette No. 9),
# Match Result Incorrect: o. 9 //I don't quite understand why it matches the o
4. Store Location D of Palette (Palette #1),
# Match Result Correct: 1
5. Store Location E of Palette, Empty, lot #45,
# Match Result Incorrect: no match
I've also tried to make the parenthesis optional so that it will match examples 2 and 5 but it's too greedy and included the previously mentioned lot word
Anything in brackets causes the engine to look for ONE of the provided characters. Your pattern successfully matches, for example, strings like: Palette Nabcdefg
To indicate one of different options, you'll need to use paranthesis. What you're actually looking for should look something like this: [Pp]alette (No\.?\s?|#)?(\d+?)
Though it seems highly ineffective to not standardize the pattern. Your last case for example could be completely incompatible since it seems to be capable of containing possibly any kind of input.
A little bit of explanation on matching your patterns with regular expressions. You really don't need to look for and match your parentheses ( .. ) in this case.
Let's say we want to just find any string with the word Palette that is followed with whitespace and the # symbol and capture the Palette sequence from it.
You could simply just use the following:
[Pp]alette\s+#([A-Z0-9]+)
This will result in capturing 91L41 and 1 from the matched patterns
1. Warehouse A, (Palette #91L41)
4. Store Location D of Palette (Palette #1)
Now say we want to find any string that has Palette, followed by whitespace and either a # symbol or No.
We can use a Non-capturing group for this. Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything.
So we could do something like:
[Pp]alette\s+(?:No[ .]+|#)([A-Z0-9]+)
Now this results in matching the following strings and capturing 91L41, 214, 9 and 1
1. Warehouse A, (Palette #91L41)
2. Warehouse B Palette No. 214
3. Warehouse Lot Storage C (Palette No. 9)
4. Store Location D of Palette (Palette #1)
And last if you want to match all the following strings and capture the Palette sequence.
[Pp]alette[\w, ]+(?:No[ .]+|#)([A-Z0-9]+)
See working demo and an explanation on this regular expression.
Everyone has a different way of using regular expressions, this is just one of many ways you can simply understand and accomplish this.
This should work for your case:
[Pp]alette.*?(?:No\.?|#)\s*(\w+)
This will search following types of patterns:
[Pp]alette{any_characters}No.{optonal_spaces}(alphanumeric)
[Pp]alette{any_characters}No{optonal_spaces}(alphanumeric)
[Pp]alette{any_characters}#{optonal_spaces}(alphanumeric)
Check it in action here
MATCH 1
1. [26-31] `91L41`
MATCH 2
1. [60-63] `214`
MATCH 3
1. [104-105] `9`
MATCH 4
1. [148-149] `1`
MATCH 5
1. [195-197] `45`

All characters that may be bullet points (e.g. "*") or "dash" points

This question is a simple point (pardon the pun):
What are all the characters that may, when starting a paragraph, be reasonably interpreted as indicating (in the Anglo-saxon demographic) that the paragraph was meant to be a bullet point or a "dash" point.
Here are the ones I would expect, so far:
Bullets
Asterisk: "*",
HTML entity ⁊: "•"
Dash
The dash: "-"
The en-dash (–): "–"
The em-dash (—): "—"
Are there others?
Thank you for reading.
Brian
In unicode there are lots. How about:
Black left pointing index: U+261A ☚
Black right pointing index: U+261B ☛
White left pointing index: U+261C ☜
White right pointing index: U+261E ☞
just for a quick example. Heck, there is a whole range dedicated to various kinds of arrows (2190–21FF), which can easily be used as bullet points. I guess you can start to browse unicode code pages - there are a lot of characters out there, though. I expect you'll have a hard time finding everything anybody might use.
I've seen +, >, and # used to indicate bullet points.
Dashes have a Unicode category of Pd. As of Unicode 5.2, there are 21 of these characters:
U+002D - HYPHEN-MINUS
U+058A ֊ ARMENIAN HYPHEN
U+05BE ־ HEBREW PUNCTUATION MAQAF
U+1400 ᐀ CANADIAN SYLLABICS HYPHEN
U+1806 ᠆ MONGOLIAN TODO SOFT HYPHEN
U+2010 ‐ HYPHEN
U+2011 ‑ NON-BREAKING HYPHEN
U+2012 ‒ FIGURE DASH
U+2013 – EN DASH
U+2014 — EM DASH
U+2015 ― HORIZONTAL BAR
U+2E17 ⸗ DOUBLE OBLIQUE HYPHEN
U+2E1A ⸚ HYPHEN WITH DIAERESIS
U+301C 〜 WAVE DASH
U+3030 〰 WAVY DASH
U+30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN
U+FE31 ︱ PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32 ︲ PRESENTATION FORM FOR VERTICAL EN DASH
U+FE58 ﹘ SMALL EM DASH
U+FE63 ﹣ SMALL HYPHEN-MINUS
U+FF0D - FULLWIDTH HYPHEN-MINUS
Bullets are a lot more complicated, as the others have mentioned.
Even ordinary windows, code-page characters like: º+·˙̣·۰۠۟۟•▪■□►●○▬─
can be used -- Especially if CSS is used to size and position them.
.
Also, pretty much the same significance as a bullet point, but ordered, is outline notation:
1.
2.
2.1
2.1.A
2.1.B
etc.