I have the following snippet of text:
#article{carr2006,
title={Techniques for qualitative and quantitative measurement of aspects of laser-induced damage important for laser beam propagation},
author={Carr, CW and Feit, MD and Nostrand, MC and Adams, JJ},
journal={Meas. Sci. Technol.},
volume={17},
number={7},
pages={1958},
year={2006},
publisher={IOP Publishing}
}
#article{NIF1998,
author = {Schwartz, Sheldon and Feit, Michael D. and Kozlowski, Mark R. and Mouser, Ron P.},
title = {Current 3-ω large optic test procedures and data analysis for the quality assurance of National Ignition Facility optics},
journal = {Proc. SPIE},
volume = {3578},
number = {},
pages = {314-321},
year = {1999},
}
And I've been trying to extract the article by it's tag, however I fail to understand how the greedy/non-greedy works, or rather how to capture everything in the brackets when it contains more brackets :/
The following regexp returns a result up until first brackets, which is not what I'm aiming for...
/\{(carr2006[^}]+)\}?/s
Also was trying to capture full text with #article in front, but that doesn't work either...
/#*\{(carr2006[^}]+)\}?/s
Any explanations on what I'm doing wrong would be helpful :)
You may change your regex like below.
#\w+\{1st_standard(?:,\s*\w+\s*=\s*(?:{[^}]*}|"[^"]*"))+,?\s*\}
DEMO
\s* should match any type of whitespace character so this would match also the line breaks.
Related
I'm a regex noob that's trying to use the regexp_extract() function in data studio to extract part of a string. Could you help me out?
I need to extract the part of the string that comes after 'May'. Everything before 'May' is exactly the same across all campaigns.
I've tried googling the solution and killed a lot of time on regexer.com but i can't figure it out
Current Campaign Name:
Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24
Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24
Expected Campaign Names:
Comedy Movie Fans18-24
South Asian Film Fans18-24
Cricket Enthusiasts18-24
Motorcycle Enthusiasts18-24
EDIT: I'm trying to use this in data studio in the REGEXP_EXTRACT(Campaign,"regex_code_here") function. I think the acceptable syntax is re2.
You may actually use REGEXP_REPLACE here to remove all before and including May:
REGEXP_REPLACE(Campaign, '.*May', '')
See the regex demo:
The regex you need is this:
(?<=May).*$
Test it here.
You can use replace
^.*?May - Match everything up-to first occurrence of May
"$`" - replace with portion that follows substring Ref
let arr = ["Xxxxx_xxxxx_PKN_Trueview_24th MayComedy Movie Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MaySouth Asian Film Fans18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayCricket Enthusiasts18-24","Xxxxx_xxxxx_PKN_Trueview_24th MayMotorcycle Enthusiasts18-24"]
let op = arr.map(str=> str.replace(/^.*?May/g, "$`"))
console.log(op)
noob here. I have strings where I want to keep some emoji and to discard the rest.
INPUT:
This book is so funny❤️. This book 📖 is the bomb(AS IN THE BEST
IN THE WORLD 🌎 🌍🗺 )I love 💗 💕 💖 it!I definitely recommend it!'
DESIRED OUTPUT:
This book is so funny❤️. This book is the bomb(AS IN THE BEST
IN THE WORLD )I love 💗 💕 💖 it!I definitely recommend it!'
I have the re.compile that matches:
my emoji
all emoji Removing Emoticons from..... See David Mabodo answer
I don't know how to put it together in re.compile that excludes one from the other. Alternatively keep alphanumeric, punctuation, and my emoji, and substitute the rest to "".
mytext = This book is so funny❤️. This book 📖 is the bomb(AS IN THE BEST
IN THE WORLD 🌎 🌍🗺 )I love 💗 💕 💖 it!I definitely recommend it!'
# Desired out put:
# u'This book is so funny❤️. This book is the bomb(AS IN THE BEST
IN THE WORLD )I love 💗 💕 💖 it!I definitely recommend it!'
print ("Original text:")
print (mytext, "\n")
# Strip out emoticon modifiers, leaving a simplified emoticon to work with.
# https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)
# https://en.wikipedia.org/wiki/Variation_Selectors_Supplement
Emoji_Modifiers = re.compile(u'([\U0000FE00-\U0000FE0F])|([\U000E0100-\U000E0100])')
mytext_mod_gone = Emoji_Modifiers.sub(r'', mytext)
print ("Modifiers Removed:")
print (mytext_mod_gone, "\n")
# All emoticons
find_regex = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
# Heart emoticons
#find_regex = re.compile(u"([\U00002619])|([\U00002661])|([\U00002665])|([\U00002763])|([\U00002764])|([\U00002765])|([\U00002766])|([\U00002767])|([\U00002E96])|([\U00002E97])|([\U00002F3C])|([\U0001F394])|([\U0001F48C])|([\U0001F48F])|([\U0001F491])|([\U0001F493])|([\U0001F494])|([\U0001F495])|([\U0001F496])|([\U0001F497])|([\U0001F498])|([\U0001F499])|([\U0001F49A])|([\U0001F49B])|([\U0001F49C])|([\U0001F49D])|([\U0001F49E])|([\U0001F49F])|([\U0001F4D6])|([\U0001F5A4])|([\U0001F60D])|([\U0001F618])|([\U0001F63B])|([\U0001F970])|([\U0001F9E1])")
# Alphanumeric + punctuation for an alternative solution
#find_regex = re.compile(r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?#\^_`{|}~\s]") #
mytext_emoji_gone = find_regex.sub(r'', mytext)
I am falling down at:
Negating unicode with a Negative Lookbehind (?<!...). I don't understand the operands well enough, and regex101.com only works with r', not u'.
Combining multiple regex together in a re.compile. Say if I wanted to keep alphanumeric and my emoji, it complains when I do re.compile(u'(\Uxxxx)' | r'(regex)' ). unsupported operand type(s) for |: 'str' and 'str', so a OR type statement does not work here...and an OR gives undesired results.
Could I have some help with either:
Ignoring a subset of emoticons and deleting the rest (my preferred solution)
Keeping (alphanumeric, punctuation, and my emoticons), and deleting the rest.
A specific question: Can you 'stack' re.compiles? IE create 2 different re.compiles to match (or not match) things, then join them together.
regex101 has a Unicode option, it is a flag you can turn on from the right side of the regex box.
I think the easiest thing to do is to find all the emojis in the string except for the ones you want to keep and replace them with an empty string like you wanted to do. To do that you can use a regex that will find any emoji (for this example I'll use [\U00010000-\U0010ffff] but I'm sure there are better ones out there so use one of those) and add a negative look ahead to ignore the emoji you wish to keep.
The finale regex should look similar to this:
(?![\u2764])[\U00010000-\U0010ffff]
The first part (?![\u2764]) will make sure the match is not an emoji you wish to keep and the second part [\U00010000-\U0010ffff] will make sure it's an emoji
You can add all the other emojis you wish to keep in the square brackets (?![\u2764
here ])
I went with:
find_regex = re.compile(u"(?![\U00002619])(?![\U00002661])(?![\U00002665])(?![\U00002763])(?![\U00002764])(?![\U00002765])(?![\U00002766])(?![\U00002767])(?![\U00002E96])(?![\U00002E97])(?![\U00002F3C])(?![\U0001F394])(?![\U0001F48C])(?![\U0001F48F])(?![\U0001F491])(?![\U0001F493])(?![\U0001F494])(?![\U0001F495])(?![\U0001F496])(?![\U0001F497])(?![\U0001F498])(?![\U0001F499])(?![\U0001F49A])(?![\U0001F49B])(?![\U0001F49C])(?![\U0001F49D])(?![\U0001F49E])(?![\U0001F49F])(?![\U0001F4D6])(?![\U0001F5A4])(?![\U0001F60D])(?![\U0001F618])(?![\U0001F63B])(?![\U0001F970])(?![\U0001F9E1])"r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?#\^_`{|}~\s]")
mytext_emoji_gone = find_regex.sub(r'', mytext)
which stripped out all other emoji, leaving only the heart and book emojis, and alphanumeric and punctuation.
As part of my original question, is there a way to stack those? Currently, that is one huge long line of code. Could we do something like:
regex = re.compile(a)
regex += re.compile(b)
That would use vertial real estate but I am ok with that
I need to check an input field for a German IBAN. The user should be allowed to leave in white spaces and input should be validated to have a starting DE and then exact 20 characters numbers and letters.
Without the white space allowance, I tried
^[DE]{2}([0-9a-zA-Z]{20})$
but I cannot find where and how I can add "white spaces anywhere allowed.
This should be simple, but I simply cannot find a solution.
Thanks for help!
Because you should use the right tool for the right task: you should not rely on regexps to validate IBAN numbers, but instead use the IBAN checksum algorithm to check the whole code is actually correct, making any regexp superfluous and redundant. i.e.: remove all spaces, rearrange the code, convert to integers, and compute remainder, here it's best explained.
Though, there am I trying to answer your question, for the fun of it:
what about:
^DE([0-9a-zA-Z]\s?){20}$
which only difference is allowing a whitespace (or not) after each occurence of a alphanumeric character.
here is the visualization:
edit: for the OP's information, the only difference is that this regexp, from #ulugbex-umirov: (?:\s*[0-9a-zA-Z]\s*) does a lookahead check to see if there's a space between the iso country code and the checksum (which only made of numerical digits), which I do not support on purpose.
And actually to support a correct IBAN syntax, which is formed of groups of 4 characters, as the wikipedia page says:
^DE\d{2}\s?([0-9a-zA-Z]{4}\s?){4}[0-9a-zA-Z]{2}$
example
If your UI is in Javascript, you can use that library for doing IBAN validation:
<script src="iban.js"></script>
<script>
// the API is now accessible from the window.IBAN global object
IBAN.isValid('hello world'); // false
IBAN.isValid('BE68539007547034'); // true
</script>
so you know this is a valid IBAN, and can validate it before the data is ever even sent to the backend. Simpler, lighter and more elegant… Why do something else?
Here is a list of IBANs from 70 Countries. I generated it with a python script i wrote based on this https://en.wikipedia.org/wiki/International_Bank_Account_Number
AL[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){2}([a-zA-Z0-9]{4}\s?){4}\s?
AD[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){2}([a-zA-Z0-9]{4}\s?){3}\s?
AT[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}\s?
AZ[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{4}\s?){1}([0-9]{4}\s?){5}\s?
BH[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([a-zA-Z0-9]{4}\s?){3}([a-zA-Z0-9]{2})\s?
BY[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{4}\s?){1}([0-9]{4}\s?){5}\s?
BE[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){3}\s?
BA[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}\s?
BR[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}([0-9]{3})([a-zA-Z]{1}\s?)([a-zA-Z0-9]{1})\s?
BG[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([0-9]{4}\s?){1}([0-9]{2})([a-zA-Z0-9]{2}\s?)([a-zA-Z0-9]{4}\s?){1}([a-zA-Z0-9]{2})\s?
CR[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}([0-9]{2})\s?
HR[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}([0-9]{1})\s?
CY[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){2}([a-zA-Z0-9]{4}\s?){4}\s?
CZ[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}\s?
DK[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){3}([0-9]{2})\s?
DO[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([0-9]{4}\s?){5}\s?
TL[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}([0-9]{3})\s?
EE[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}\s?
FO[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){3}([0-9]{2})\s?
FI[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){3}([0-9]{2})\s?
FR[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){2}([0-9]{2})([a-zA-Z0-9]{2}\s?)([a-zA-Z0-9]{4}\s?){2}([a-zA-Z0-9]{1})([0-9]{2})\s?
GE[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{2})([0-9]{2}\s?)([0-9]{4}\s?){3}([0-9]{2})\s?
DE[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}([0-9]{2})\s?
GI[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([a-zA-Z0-9]{4}\s?){3}([a-zA-Z0-9]{3})\s?
GR[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){1}([0-9]{3})([a-zA-Z0-9]{1}\s?)([a-zA-Z0-9]{4}\s?){3}([a-zA-Z0-9]{3})\s?
GL[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){3}([0-9]{2})\s?
GT[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{4}\s?){1}([a-zA-Z0-9]{4}\s?){5}\s?
HU[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){6}\s?
IS[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}([0-9]{2})\s?
IE[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{4}\s?){1}([0-9]{4}\s?){3}([0-9]{2})\s?
IL[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}([0-9]{3})\s?
IT[a-zA-Z0-9]{2}\s?([a-zA-Z]{1})([0-9]{3}\s?)([0-9]{4}\s?){1}([0-9]{3})([a-zA-Z0-9]{1}\s?)([a-zA-Z0-9]{4}\s?){2}([a-zA-Z0-9]{3})\s?
JO[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([0-9]{4}\s?){5}([0-9]{2})\s?
KZ[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){3}([0-9]{1})([a-zA-Z0-9]{3}\s?)([a-zA-Z0-9]{4}\s?){2}([a-zA-Z0-9]{2})\s?
XK[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){1}([0-9]{4}\s?){2}([0-9]{2})([0-9]{2}\s?)\s?
KW[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([a-zA-Z0-9]{4}\s?){5}([a-zA-Z0-9]{2})\s?
LV[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([a-zA-Z0-9]{4}\s?){3}([a-zA-Z0-9]{1})\s?
LB[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){1}([a-zA-Z0-9]{4}\s?){5}\s?
LI[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){1}([0-9]{1})([a-zA-Z0-9]{3}\s?)([a-zA-Z0-9]{4}\s?){2}([a-zA-Z0-9]{1})\s?
LT[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}\s?
LU[a-zA-Z0-9]{2}\s?([0-9]{3})([a-zA-Z0-9]{1}\s?)([a-zA-Z0-9]{4}\s?){3}\s?
MK[a-zA-Z0-9]{2}\s?([0-9]{3})([a-zA-Z0-9]{1}\s?)([a-zA-Z0-9]{4}\s?){2}([a-zA-Z0-9]{1})([0-9]{2})\s?
MT[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([0-9]{4}\s?){1}([0-9]{1})([a-zA-Z0-9]{3}\s?)([a-zA-Z0-9]{4}\s?){3}([a-zA-Z0-9]{3})\s?
MR[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}([0-9]{3})\s?
MU[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([0-9]{4}\s?){4}([0-9]{3})([a-zA-Z]{1}\s?)([a-zA-Z]{2})\s?
MC[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){2}([0-9]{2})([a-zA-Z0-9]{2}\s?)([a-zA-Z0-9]{4}\s?){2}([a-zA-Z0-9]{1})([0-9]{2})\s?
MD[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{2})([a-zA-Z0-9]{2}\s?)([a-zA-Z0-9]{4}\s?){4}\s?
ME[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}([0-9]{2})\s?
NL[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([0-9]{4}\s?){2}([0-9]{2})\s?
NO[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){2}([0-9]{3})\s?
PK[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{4}\s?){1}([0-9]{4}\s?){4}\s?
PS[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{4}\s?){1}([0-9]{4}\s?){5}([0-9]{1})\s?
PL[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){6}\s?
PT[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}([0-9]{1})\s?
QA[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([a-zA-Z0-9]{4}\s?){5}([a-zA-Z0-9]{1})\s?
RO[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([a-zA-Z0-9]{4}\s?){4}\s?
SM[a-zA-Z0-9]{2}\s?([a-zA-Z]{1})([0-9]{3}\s?)([0-9]{4}\s?){1}([0-9]{3})([a-zA-Z0-9]{1}\s?)([a-zA-Z0-9]{4}\s?){2}([a-zA-Z0-9]{3})\s?
SA[a-zA-Z0-9]{2}\s?([0-9]{2})([a-zA-Z0-9]{2}\s?)([a-zA-Z0-9]{4}\s?){4}\s?
RS[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){4}([0-9]{2})\s?
SK[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}\s?
SI[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){3}([0-9]{3})\s?
ES[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}\s?
SE[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}\s?
CH[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){1}([0-9]{1})([a-zA-Z0-9]{3}\s?)([a-zA-Z0-9]{4}\s?){2}([a-zA-Z0-9]{1})\s?
TN[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){5}\s?
TR[a-zA-Z0-9]{2}\s?([0-9]{4}\s?){1}([0-9]{1})([a-zA-Z0-9]{3}\s?)([a-zA-Z0-9]{4}\s?){3}([a-zA-Z0-9]{2})\s?
AE[a-zA-Z0-9]{2}\s?([0-9]{3})([0-9]{1}\s?)([0-9]{4}\s?){3}([0-9]{3})\s?
GB[a-zA-Z0-9]{2}\s?([a-zA-Z]{4}\s?){1}([0-9]{4}\s?){3}([0-9]{2})\s?
VA[a-zA-Z0-9]{2}\s?([0-9]{3})([0-9]{1}\s?)([0-9]{4}\s?){3}([0-9]{2})\s?
VG[a-zA-Z0-9]{2}\s?([a-zA-Z0-9]{4}\s?){1}([0-9]{4}\s?){4}\s?
Original:
^[DE]{2}([0-9a-zA-Z]{20})$
Debuggex Demo
Modified:
^DE(?:\s*[0-9a-zA-Z]\s*){20}$
Debuggex Demo
This is the correct regex to match DE IBAN account numbers:
DE\d{2}[ ]\d{4}[ ]\d{4}[ ]\d{4}[ ]\d{4}[ ]\d{2}|DE\d{20}
Pass: DE89 3704 0044 0532 0130 00|||DE89370400440532013000
Fail: DE89-3704-0044-0532-0130-00
Most simple solution I can think of:
^DE(\s*[[:alnum:]]){20}\s*$
In particular, your initial [DE]{2} is wrong, as it allows 'DD', 'EE', 'ED' as well as the intended 'DE'.
To allow any amount of spaces anywhere:
^ *D *E( *[A-Za-z0-9]){20} *$
As you want to allow lower letters, also DE might be lower?
^ *[Dd] *[Ee]( *[A-Za-z0-9]){20} *$
^ matches the start of the string
$ end anchor
in between each characters there are optional spaces *
[character class] defines a set/range of characters
To allow at most one space in between each characters, replace the quantifier * (any amount of) with ? (0 or 1). If supported, \s shorthand can be used to match [ \t\r\n\f] instead of space only.
Test on regex101.com, also see the SO regex FAQ
Using Google Apps Script, I pasted Laurent's code from github into a script and added the following code to test.
// Use the Apps Script IDE's "Run" menu to execute this code.
// Then look at the View > Logs menu to see execution results.
function myFunction() {
//https://github.com/arhs/iban.js/blob/master/README.md
// var IBAN = require('iban');
var t1 = IBAN.isValid('hello world'); // false
var t2 = IBAN.isValid('BE68539007547034'); // true
var t3 = IBAN.isValid('BE68 5390 0754 7034'); // true
Logger.log("Test 1 = %s", t1);
Logger.log("Test 2 = %s", t2);
Logger.log("Test 3 = %s", t3);
}
The only thing needed to run the example code was commenting out the require('iban') line:
// var IBAN = require('iban');
Finally, instead of using client handlers to attempt a RegEx validation of IBAN input, I use a a server handler to do the validation.
At one point in my app, I need to match some strings against a pattern. Let's say that some of the sample strings look as follows:
Hi there, John.
What a lovely day today!
Lovely sunset today, John, isn't it?
Will you be meeting Linda today, John?
Most (not all) of these strings are from pre-defined patterns as follows:
"Hi there, %s."
"What a lovely day today!"
"Lovely sunset today, %s, isn't it?"
"Will you be meeting %s today, %s?"
This library of patterns is ever-expanding (currently at around 1,500), but is manually maintained. The input strings though (the first group) is largely unpredictable. Though most of them will match one of the patterns, some of them will not.
So, here's my question: Given a string (from the first group) as input, I need to know which of the patterns (known second group) it matched. If nothing matched, it needs to tell me that.
I'm guessing the solution involves building a regex out of the patterns, and iteratively checking which one matched. However, I'm unsure what the code to build those regexes looks like.
Note: The strings I've given here are for illustration purposes. In reality, the strings aren't human generated, but are computer-generated human-friendly strings as shown above from systems I don't control. Since they aren't manually typed in, we don't need to worry about things like typos and other human errors. Just need to find which pattern it matches.
Note 2: I could modify the patterns library to be some other format, if that makes it easier to construct the regexes. The current structure, with the printf style %s, isn't set in stone.
I am looking at this as a parsing problem. The idea is that the parser function takes a string and determines if it is valid or not.
The string is valid if you can find it among the given patterns. That means you need an index of all the patterns. The index must be a full text index. Also it must match according to the word position. eg. it should short circuit if the first word of the input is not found among the first word of the patterns. It should take care of the any match ie %s in the pattern.
One solution is to put the patterns in an in memory database (eg. redis) and do a full text index on it. (this will not match according to word position) but you should be able to narrow down to the correct pattern by splitting the input into words and searching. The searches will be very fast because you have a small in memory database. Also note that you are looking for the closest match. One or more words will not match. The highest number of matches is the pattern you want.
An even better solution is to generate your own index in a dictionary format. Here is an example index for the four patterns you gave as a JavaScript object.
{
"Hi": { "there": {"%s": null}},
"What: {"a": {"lovely": {"day": {"today": null}}}},
"Lovely": {"sunset": {"today": {"%s": {"isnt": {"it": null}}}}},
"Will": {"you": {"be": {"meeting": {"%s": {"today": {"%s": null}}}}}}
}
This index is recursive descending according to the word postion. So search for the first word, if found search for the next within the object returned by the first and so on. Same words at a given level will have only one key. You should also match the any case. This should be blinding fast in memory.
My first thought would be to have the regexp engine take all the trouble of handling this. They're usually optimised to handle large amounts of text so it shouldn't be that much of a performance hassle. It's brute force but the performance seems to be okay. And you could split the input into pieces and have multiple processes handle them. Here's my moderately tested solution (in Python).
import random
import string
import re
def create_random_sentence():
nwords = random.randint(4, 10)
sentence = []
for i in range(nwords):
sentence.append("".join(random.choice(string.lowercase) for x in range(random.randint(3,10))))
ret = " ".join(sentence)
print ret
return ret
patterns = [ r"Hi there, [a-zA-Z]+.",
r"What a lovely day today!",
r"Lovely sunset today, [a-zA-Z]+, isn't it?",
r"Will you be meeting [a-zA-Z]+ today, [a-zA-Z]+\?"]
for i in range(95):
patterns.append(create_random_sentence())
monster_pattern = "|".join("(%s)"%x for x in patterns)
print monster_pattern
print "--------------"
monster_regexp = re.compile(monster_pattern)
inputs = ["Hi there, John.",
"What a lovely day today!",
"Lovely sunset today, John, isn't it?",
"Will you be meeting Linda today, John?",
"Goobledigoock"]*2000
for i in inputs:
ret = monster_regexp.search(i)
if ret:
print ".",
else:
print "x",
I've created a hundred patterns. This is the maximum limit of the python regexp library. 4 of them are your actual examples and the rest are random sentences just to stress performance a little.
Then I combined them into a single regexp with 100 groups. (group1)|(group2)|(group3)|.... I'm guessing you'll have to sanitise the inputs for things that can have meanings in regular expressions (like ? etc.). That's the monster_regexp.
Testing one string against this tests it against 100 patterns in a single shot. There are methods that fetch out the exact group which was matched. I test 10000 strings 80% of which should match and 10% which will not. It short cirtcuits so if there's a success, it will be comparatively quick. Failures will have to run through the whole regexp so it will be slower. You can order things based on the frequency of input to get some more performance out of it.
I ran this on my machine and this is my timing.
python /tmp/scratch.py 0.13s user 0.00s system 97% cpu 0.136 total
which is not too bad.
However, to run a pattern against such a large regexp and fail will take longer so I changed the inputs to have lots of randomly generated strings that won't match and then tried. 10000 strings none of which match the monster_regexp and I got this.
python /tmp/scratch.py 3.76s user 0.01s system 99% cpu 3.779 total
Similar to Noufal's solution, but returns the matched pattern or None.
import re
patterns = [
"Hi there, %s.",
"What a lovely day today!",
"Lovely sunset today, %s, isn't it",
"Will you be meeting %s today, %s?"
]
def make_re_pattern(pattern):
# characters like . ? etc. have special meaning in regular expressions.
# Escape the string to avoid interpretting them as differently.
# The re.escape function escapes even %, so replacing that with XXX to avoid that.
p = re.escape(pattern.replace("%s", "XXX"))
return p.replace("XXX", "\w+")
# Join all the pattens into a single regular expression.
# Each pattern is enclosed in () to remember the match.
# This will help us to find the matched pattern.
rx = re.compile("|".join("(" + make_re_pattern(p) + ")" for p in patterns))
def match(s):
"""Given an input strings, returns the matched pattern or None."""
m = rx.match(s)
if m:
# Find the index of the matching group.
index = (i for i, group in enumerate(m.groups()) if group is not None).next()
return patterns[index]
# Testing with couple of patterns
print match("Hi there, John.")
print match("Will you be meeting Linda today, John?")
Python solution. JS should be similar.
>>> re2.compile('^ABC(.*)E$').search('ABCDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABCDDDDDDE') == None
False
>>> re2.compile('^ABC(.*)E$').search('ABX') == None
True
>>>
The trick is to use ^ and $ to bound your pattern and making it a "template". Use (.*) or (.+) or whatever it is that you want to "search" for.
The main bottleneck for you, imho, will be iterating through a list of these patterns. Regex searches are computationally expensive.
If you want a "does any pattern match" result, build a massive OR based regex and let your regex engine handle the 'OR'ing for you.
Also, if you have only prefix patterns, check out the TRIE data structure.
This could be a job for sscanf, there is an implementation in js: http://phpjs.org/functions/sscanf/; the function being copied is this: http://php.net/manual/en/function.sscanf.php.
You should be able to use it without changing the prepared strings much, but I have doubts about the performances.
the problem isn't clear to me. Do you want to take the patterns and build regexes out of it?
Most regex engines have a "quoted string" option. (\Q \E). So you could take the string and make it
^\QHi there,\E(?:.*)\Q.\E$
these will be regexes that match exactly the string you want outside your variables.
if you want to use a single regex to match just a single pattern, you can put them in grouped patterns to find out which one matched, but that will not give you EVERY match, just the first one.
if you use a proper parser (I've used PEG.js), it might be more maintainable though. So that's another option if you think you might get stuck in regex hell
I need a regular Expression for Validating City textBox, the city textbox field accepts only Letters, spaces and dashes(-).
This answer assumes that the letters which #Manaysah refers to also encompasses the use of diacritical marks. I've added the single quote ' since many names in Canada and France have it. I've also added the period (dot) since it's required for contracted names.
Building upon #UIDs answer I came up with,
^([a-zA-Z\u0080-\u024F]+(?:. |-| |'))*[a-zA-Z\u0080-\u024F]*$
The list of cities it accepts:
Toronto
St. Catharines
San Fransisco
Val-d'Or
Presqu'ile
Niagara on the Lake
Niagara-on-the-Lake
München
toronto
toRonTo
villes du Québec
Provence-Alpes-Côte d'Azur
Île-de-France
Kópavogur
Garðabær
Sauðárkrókur
Þorlákshöfn
And what it rejects:
A----B
------
*******
&&
()
//
\\
I didn't add in the use of brackets and other marks since it didn't fall within the scope of this question.
I've stayed away from \s for whitespace. Tabs and line feeds aren't part of a city name and shouldn't be used in my opinion.
This can be arbitrarily complex, depending on how precise you need the match to be, and the variation you're willing to allow.
Something fairly simple like ^[a-zA-Z]+(?:[\s-][a-zA-Z]+)*$ should work.
warning: This does not match cities like München, etc, but here you basically need to work with the [a-zA-Z] part of the expression, and define what characters are allowed for your particular case.
Keep in mind that it also allows for something like San----Francisco, or having several spaces.
Translates to something like:
1 or more letters, followed by a block of: 0 or more spaces or dashes and more letters, this last block can occur 0 or more times.
Weird stuff in there: the ?: bit. If you're not familiarized with regexes, it might be confusing, but that simply states that the piece of regex between parenthesis, is not a capturing group (I don't want to capture the part it matches to reuse later), so the parenthesis are only used as to group the expression (and not to capture the match).
"New York" // passes
"San-Francisco" // passes
"San Fran Cisco" // passes (sorry, needed an example with three tokens)
"Chicago" // passes
" Chicago" // doesn't pass, starts with spaces
"San-" // doesn't pass, ends with a dash
Adding my answer if anybody needs its while searching for Regex for City Names, Like I did
Please use this :
^[a-zA-Z\u0080-\u024F\s\/\-\)\(\`\.\"\']+$
As many city names contains dashes, such as Soddy-Daisy, Tennessee, or special characters like, ñ in La Cañada Flintridge, California
Hope this helps!
Here is the one I've found works best
for PCRE flavours allowing \p{L} (.NET, php, Golang)
/^\p{L}+(?:([\ \-\']|(\.\ ))\p{L}+)*$/u
for regex that does not allow \p{L} replace it with [a-zA-Z\u0080-\u024F]
so for javascript, python regex use
/^[a-zA-Z\u0080-\u024F]+(?:([\ \-\']|(\.\ ))[a-zA-Z\u0080-\u024F]+)*$/
White listing a bunch of character is easy, but there are things to watch for in your regex
consecutive non-alphabetical characters should not be allowed. i.e. Los Angeles should fail because it has two spaces
periods should have a space after. i.e. St.Albert should fail because it's missing the space
names cannot start or end with non-alphabetical characters i.e. -Chicago- should fail
a whitespace character \s !== \, i.e. a tab and line feed character could pass, so space character should be defined instead
Note: When building regex rules, I find https://regex101.com/tests is very helpful, as you can easily create unit tests
js: https://regex101.com/r/cgJwc0/1/tests
php: https://regex101.com/r/Yo3GV2/1/tests
Here's one that will work with most cities, and has been tested:
^[a-zA-Z\u0080-\u024F]+(?:. |-| |')*([1-9a-zA-Z\u0080-\u024F]+(?:. |-| |'))*[a-zA-Z\u0080-\u024F]*$
Python code below, including its test.
import re
import pytest
CITY_RE = re.compile(
r"^[a-zA-Z\u0080-\u024F]+(?:. |-| |')*" # a word
r"([1-9a-zA-Z\u0080-\u024F]+(?:. |-| |'))*"
r"[a-zA-Z\u0080-\u024F]*$"
)
def is_city(value: str) -> bool:
valid = CITY_RE.match(value) is not None
return valid
# Tests
#pytest.mark.parametrize(
"value,expected",
(
("1", False),
("Toronto", True),
("Saint-Père-en-Retz", True),
("Saint Père en Retz", True),
("Saint-Père en Retz", True),
("Paris 13e Arrondissement", True),
("Paris 13e Arrondissement ", True),
("Bouc-Étourdi", True),
("Arnac-la-Poste", True),
("Bourré", True),
("Å", True),
("San Francisco", True),
),
)
def test_is_city(value, expected):
valid, msg = validate.is_city(value)
assert valid is expected
^[a-zA-Z\- ]+$
Also this might be useful http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
use this regex:
^[a-zA-Z-\s]+$
After many hours of looking for a city regex matcher I have built this and it meets my needs 100%
(?ix)^[A-Z.-]+(?:\s+[A-Z.-]+)*$
expression for testing city.
Matches
City
St. City
Some Silly-City
City St.
Too Many Words City
it seems that there are many flavors of regex and I built this for my Java needs and it works great
^[a-zA-Z.-]+(?:[\s-][\/a-zA-Z.]+)*$
This will help identify some city names like St. Johns, Baie-Sainte-Anne, Grand-Salut/Grand Falls
I like shepley's suggestion, but it has a couple flaws in it.
If you change shpeley's regex to this, it will not accept other special characters:
^([a-zA-Z\u0080-\u024F]{1}[a-zA-Z\u0080-\u024F\. |\-| |']*[a-zA-Z\u0080-\u024F\.']{1})$
I use that one:
^[a-zA-Z\\u0080-\\u024F.]+((?:[ -.|'])[a-zA-Z\\u0080-\\u024F]+)*$
You can try this:
^\p{L}+(?:[\s\-]\p{L}+)*
The above regex will:
Restrict leading and trailing spaces, hyphens
Match cities with names like Néewiller-près-lauterbourg
Here are some fun edge-cases:
's Graveland
's Gravendeel
's Gravenpolder
's Gravenzande
's Heer Arendskerke
's Heerenberg
's Heerenhoek
's Hertogenbosch
't Harde
't Veld
't Zand
100 Mile House
6 October City
So, don't forget to add ' and 0-9 as a possible first character of the city name.