I've to match this pattern:
3BxFxxx131xxxx
Where "x" stand for a character that is not needed to be matched.
Is possibile to match a string with this pattern?
I'm forced to use a concatenaion of "/^.{2}B/" for every position or exist a better solution?
Thank you
As per your comment, "x" stands for any alphanumeric char. In that case you could try:
^(?!.*_)3B\wF\w{3}131\w{4}$
See the online demo
^ - Start string anchor.
(?!.*_) - Negative lookahead to prevent underscore anywhere in the string.
3B - Match "3B" literally.
\w - A single word-character. Shorthand for [0-9A-Za-z_].
F - Match "F" literally.
\w{3} - Three word-characters.
131 - Match "131" literally.
\w{4} - Four word-characters.
$ - End string anchor.
I want to extract a number bevore a list of specific Characters. I want to extract Volume, Pirce and more from different Websites.
For example I want to excract the Volume from here:
<td class="data">Single Malt Scotch Whisky der Marke Speyburn 10 Years 40% 0,7l Flasche</td>
or
<td class="data">Irish Whiskey der Marke Bushmills the Original 40% 1,0l Flasche</td>
I tried the following code:
re.findall("[-+]?[.]?[\d]+(?:,\d\d\d)*[\.]?\d*?(?=l|L|Liter| Liter| l| L|ml)", string)
And this is the result:
First String = ['7'] and Second String = ['0']
How I get the complete number (0,7 and 1,0)?
For the Volume I tryed to convert the comma into a dot. This works fine for the volume but not for the price.
if ',' in string:
string= string.replace(',', '.')
If it is possibible, I want to use the regex also for the price. The difficulty here are the different types of numbers.
Following types are available:
10.00€
10,00€
1,234.56€
1.234,56€
You may use
[-+]?\.?\d+(?:[.,]\d+)*(?= ?[mM]?[lL])
See the regex demo. To match the measurement units as whole words, add \b word boundary at the end of the lookahead pattern, (?= ?(?:[mM]?[lL]|[Ll]iter)\b).
Details
[-+]? - an optional - or +
\.? - an optional .
\d+ - 1+ digits
(?:[.,]\d+)* - 0 or more occurrences of a dot or comma and then 1+ digits
(?= ?[mM]?[lL]) - a positive lookahead that matches a location that is immediately followed with
\? - an optional space (you may use \s? here to match any whitespace)
[mM]? - an optional m or M
[lL] - l or L.
Note that you do not need Liter alterantive in the lookahead if you use (?= ?[mM]?[lL]), but if you use a word boundary, you will need to use a Liter alternative.
I need to match and replace all UPPERCASE word in a Postgres string field like
'GARLASCO Cavour/Oriani'
'SANNAZZARO DE' BURGONDI Italia, 46 (Direzione Sud)'
'S.MARGHERITA STAFFORA Vallechiara (Bivio Montemartino)'
'GAMBOLO' Umberto I, 312'
I try with
[A-Z\''.]{2,}
SELECT REGEXP_REPLACE('SANNAZZARO DE' BURGONDI Italia, 46 (Direzione Sud)',' \b[A-Z]{2,}\b','','g')
but it works only for string with 1 uppercase world like 'GARLASCO Cavour/Oriani'
You may use
REGEXP_REPLACE(your_col_here,' '^[A-Z[:space:].'']+\y','')
This will replace the following matches:
^ - start of string
[A-Z[:space:].']+ - 1+ uppercase letters (you may also replace A-Z with [:upper:]), whitespaces, dots or apostrophes at...
\y - a word boundary.
The example names that I am trying it on are here
O'Kefe,Shiley
Folenza,Mitchel V
Briscoe Jr.,Sanford Ray
Andade-Alarenga,Blnca
De La Cru,Feando
Carone,Letca Jo
O'Conor,Mole K
Daeron III,Lawence P
Randall,Jason L
Esquel Mendez,Mara D
Dinle III,Jams E
Coras Sr.,Cleybr E
Hsieh-Krnk,Caolyn E
Graves II,Theodore R
I am trying to capture everything before comma except the roman numbers and Sr.|Jr. suffix.
So if the name is like Andade-Alarenga,Blnca I want to capture Andade-Alarenga, but if the name is Briscoe Jr.,Sanford Ray I just want Briscoe.
the code I have tried is here
^((?:(?![JjSs][rR]\.|\b(?:[IV]+))[^,]))
also this one - ^(?!\w+ \A[jr|sr|Jr|Sr].*)\w+| \w+ \w+|'\w+|-\w+$
[Regex101 my code with example sets][1]
https://regex101.com/r/jX5cK6/2
One option could be using a capturing group with a non greedy match up till the first occurrence of a comma and optionally before the comma match Jr Sr jr sr or a roman numeral.
Then match the comma itself. The value is in capture group 1.
An extended match for a roman numeral can be found for example on this page as the character class [XVICMD]+ is a broad match which would also allow other combinations.
^(\w.*?)(?: (?:[JjSs]r\.|[XVICMD]+\b))?,
^ Start of string
( Capture group 1
\w.*? Match a word char and 0+ times any char except a newline non greedy
) close group
(?: Non capturing group
(?: Match a space and start non capturing group
[JjSs]r\. Match any of the listed followed by r.
| Or
[XVICMD]+\b Match 1+ times any of the listed and a word boundary
) Close group
)? Close group and make it optional
, Match the comma
Regex demo
Because of your test on Regex101, I'm assuming your regex engine supports positive lookaheads (This is true for PCRE, Javascript or Python, for example)
A positive lookahead will enable you to match only what you want, without the need for capturing groups. The full match will be the string you're looking for.
^[\w'\- ]+?(?= ?(?:\b(?:[IVXCMD]*|\w+\.)),)
The part that matches the name is as simple as it gets:
^[\w'\- ]+?
All it does is match any of the characters on the list. the final ? is there to make it lazy: This way, the engine will only match as few characters as it needs to.
The important part is this one:
(?= ?(?:\b(?:[IVXCMD]*|\w+\.)),)
It is divided in two parts by the pipe (this character: |) there. The first part matches roman numerals (or nothing), and the second part matches titles (Basically, anything that ends on a .). Finally, we need to match the comma, because of your requirement.
Here it is on Regex101
You didn't specify a language so I used a regex in the replaceAll() String method of Java.
String[] names = {
"O'Kefe,Shiley", "Folenza,Mitchel V", "Briscoe Jr.,Sanford Ray",
"Andade-Alarenga,Blnca", "De La Cru,Feando", "Carone,Letca Jo",
"O'Conor,Mole K", "Daeron III,Lawence P", "Randall,Jason L",
"Esquel Mendez,Mara D", "Dinle III,Jams E", "Coras Sr.,Cleybr E",
"Hsieh-Krnk,Caolyn E", "Graves II,Theodore R"
};
for (String name : names) {
System.out.println(name + " -> "
+ name.replaceAll("(I{1,3},|((Sr|Jr)\\.,)|,).*", ""));
}
Here is a python solution using re.sub
import re
names = ["O'Kefe,Shiley", "Folenza,Mitchel V", "Briscoe Jr.,Sanford Ray",
"Andade-Alarenga,Blnca", "De La Cru,Feando", "Carone,Letca Jo",
"O'Conor,Mole K", "Daeron III,Lawence P", "Randall,Jason L",
"Esquel Mendez,Mara D", "Dinle III,Jams E", "Coras Sr.,Cleybr E",
"Hsieh-Krnk,Caolyn E", "Graves II,Theodore R"]
for name in names:
print(name, "->", re.sub("(I{1,3},|((Sr|Jr)\\.,)|,).*","",name))
You may use
^(?:(?![JS]r\.|\b(?:[XVICMD]+)\b)[^,])+\b(?<!\s)
See the regex demo
Details
^ - start of a string
(?:(?![JS]r\.|\b(?:[XVICMD]+)\b)[^,])+ - any char but , ([^,]), one or more occurrences (+), that does not start a Jr. or Sr. char sequence or a whole word consisting of 1 or more X, V, I, C, M,D chars
\b - a word boundary
(?<!\s) - no whitespace immediately to the left is allowed (it is trimming the match)
I am looking for a REGEX to find the first one or two capitalized words in a string. If the first two words is capitalized I want the first two words. A hyphen should be considered part of a word.
for Madonna has a new album I'm looking for madonna
for Paul Young has no new album I'm looking for Paul Young
for Emmerson Lake-palmer is not here I'm looking for Emmerson Lake-palmer
I have been using ^[A-Z]+.*?\b( [A-Z]+.*?\b){0,1} which does great on the first two, but for the 3rd example I get Emmerson Lake, instead of Emmerson Lake-palmer.
What REGEX can I use to find the first one or two capitalized words in the above examples?
You may use
^[A-Z][-a-zA-Z]*(?:\s+[A-Z][-a-zA-Z]*)?
See the regex demo
Basically, use a character class [-a-zA-Z]* instead of a dot matching pattern to only match letters and a hyphen.
Details
^ - start of string
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
(?:\s+[A-Z][-a-zA-Z]*)? - an optional (1 or 0 due to ? quantifier) sequence of:
\s+ - 1+ whitespace
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
A Unicode aware equivalent (for the regex flavors supporting Unicode property classes):
^\p{Lu}[-\p{L}]*(?:\s+\p{Lu}[-\p{L}]*)?
where \p{L} matches any letter and \p{Lu} matches any uppercase letter.
This is probably simpler:
^([A-Z][-A-Za-z]+)(\s[A-Z][-A-Za-z]+)?
Replace + with * if you expect single-letter words.
If u need a Full name only (a two words with the first capitalize letters), this is a simple example:
^([A-Z][a-z]*)(\s)([A-Z][a-z]+)$
Try it. Enjoy!