How can I find all words with at least one non latin letter (arabic, chinese...) in them using regex.h library?
cityدبي
How about:
(?=\pL)(?![a-zA-Z])
This will match a letter in any alphabet that is not a latin letter:
not ok - cityدبي
ok - city
not ok - دبي
Try this :
[a-zA-Z]*[^A-Za-z \d]+[a-zA-Z]*
Means : One or more non latin letter preceded or followed by one or more latin letter i.e. a word containing atleast 1 non latin character.
See demo with some random text:
http://regexr.com?326s3
You may need to adjust this regex to your needs,and include things like digits,special characters,word boundaries as per your input.
just use [^a-zA-Z]
if not match, it should contain an international character...
Related
I am new to RegExp. I have a sentence and I would like to pull out a word which satisfies the following -
It must contain only one capitalized letter
It must consist of only characters/letters without numbers
For instance -
"appLe", "warDrobe", "hUsh"
The words that do not fit - "sf_dsfsdF", "331ffsF", "Leopard1997", "mister_Ram" et cetera.
How would you resolve this problem?
The following regex should work:
will find words that have only one capital letter
will only find words with letters (no numbers or special characters)
will match the entire word
\b(?=[A-Z])[A-Z][a-z]*\b|\b(?=[a-z])[a-z]+[A-Z][a-z]*\b
Matches:
appLe
hUsh
Harry
suSan
I
Rejects
HarrY - has TWO capital letters
warDrobeD - has TWO capital letters
sf_dsfsdF - has SPECIAL characters
331ffsF - has NUMBERS
Leopd1997 - has NUMBERS
mistram - does not have a CAPITAL LETTER
See it in action here
Note:
If the capital letter is OPTIONAL- then you will need to add a ? after each [A-Z] like this:
\b(?=[A-Z])[A-Z]?[a-z]*\b|\b(?=[a-z])[a-z]+[A-Z]?[a-z]*\b
You can do this by using character sets ([a-z] & [A-Z]) with appropriate quantifiers (use ? for one or zero capitals), wrapped in () to capture, surrounded by word breaks \b.
If the capital is optional and can appear anywhere use:
/\b([a-z]*[A-Z]?[a-z]*)\b/ //will still match empty string check for length
If you always want one capital appearing anywhere use:
/\b([a-z]*[A-Z][a-z]*)\b/ // does not match empty string
If you always want one capital that must not be the first or last character use:
/\b([a-z]+[A-Z][a-z]+)\b/ // does not match empty string
Here is a working snippet demonstrating the second regex from above in JavaScript:
const exp = /\b([a-z]*[A-Z][a-z]*)\b/
const strings = ["appLe", "warDrobe", "hUsh", "sf_dsfsdF", "331ffsF", "Leopard1997", "mister_Ram", ""];
for (const str of strings) {
console.log(str, exp.test(str))
}
Regex101 is great for dev & testing!
RegExp:
/\b[a-z\-]*[A-Z][a-z\-]*\b/g
Demo:
RegEx101
Explanation
Segment
Description
\b[a-z\-]*
Find a point where a non-word is adjacent to a word ([A-Za-z0-9\-] or \w), then match zero or more lowercase letters and hyphens (note, the hyphen needs to be escaped (\-))
[A-Z]
Find a single uppercase letter
[a-z\-]*\b
Match zero or more lowercase letters and hyphens, then find a point where a non-word is adjacent to a word
If I want to find "KFC", "EU 8RF", and IK-OTP simultaneously, what should the code look like?
My code is :
db.business.find({name:/^[A-Z\s?\d?\-?]*$/}, {name:1}).sort({"name":1})
but it will return the name that is whole number, such as 1973, 1999. How should I improve my code? TIA
Use a lookahead to require at least one letter.
^(?=.*[A-Z])[A-Z\d\s-]+$
DEMO
You can use
^(?=[\d -]*[A-Z])[A-Z\d]+(?:[ -][A-Z\d]+)*$
See the regex demo.
Details:
^ - start of string
(?=[\d -]*[A-Z]) - a positive lookahead that requires an uppercase ASCII letter after any zero or more digits, spaces or hyphens immediately to the right of the current location
[A-Z\d]+ - one or more uppercase ASCII letters or digits
(?:[ -][A-Z\d]+)* - zero or more repetitions of a space or - and then one or more uppercase ASCII letters or digits
$ - end of string.
I need a regex to allow only Uppercase Extended ASCII characters of a maxLength I set before that it's the maximum length of the word.
Regex for uppercase letters: \P{Ll}*
Regex for extended ASCII letters: [\x00-\xFF]*
Using ^[\p{Ll}] it's not enough because I need characters to be extended ASCII(to not allow emoji or other special characters outrange ASCII extended).
How can I combine that 2 requirements ? And length of maxLength.
Thank you!!
Generally, you can use
^(?:(?=\p{Lu})\p{Latin}){1,10}$
See the regex demo. Details:
^ - start of string
(?: - start of a non-capturing group:
(?=\p{Lu})\p{Latin} - a char from Latin Unicode category class that is an uppercase letter
){1,10} - end of the group, repeat one to ten occurrences
$ - end of string.
Since you are using the regex in a DevExpress masked input component you need to enumerate all these letters in a character class. Based on Regex Latin characters filter and non latin character filer, you need
Latin-1 Supplement U+0080 - U+00FF
Latin Extended-A U+0100 - U+017F
Latin Extended-B U+0180 - U+024F
All chars that are uppercase letters in these three ranges are the ones you want to allow:
var res = []
for (var i=128; i<=591; i++) { // Get chars from \u0080 to \u024F
if (/^\p{Lu}$/u.test(String.fromCharCode(i))) { // If it is an uppercase letter
res.push(String.fromCharCode(i)); // Add it to the results
}
}
console.log(res.join(""))
The code will look like
settings.MaskExpression = "[\\u00C0-\\u00D6\\u00D8-\\u00DE\\u0100\\u0102\\u0104\\u0106\\u0108\\u010A\\u010C\\u010E\\u0110\\u0112\\u0114\\u0116\\u0118\\u011A\\u011C\\u011E\\u0120\\u0122\\u0124\\u0126\\u0128\\u012A\\u012C\\u012E\\u0130\\u0132\\u0134\\u0136\\u0139\\u013B\\u013D\\u013F\\u0141\\u0143\\u0145\\u0147\\u014A\\u014C\\u014E\\u0150\\u0152\\u0154\\u0156\\u0158\\u015A\\u015C\\u015E\\u0160\\u0162\\u0164\\u0166\\u0168\\u016A\\u016C\\u016E\\u0170\\u0172\\u0174\\u0176\\u0178\\u0179\\u017B\\u017D\\u0181\\u0182\\u0184\\u0186\\u0187\\u0189-\\u018B\\u018E-\\u0191\\u0193\\u0194\\u0196-\\u0198\\u019C\\u019D\\u019F\\u01A0\\u01A2\\u01A4\\u01A6\\u01A7\\u01A9\\u01AC\\u01AE\\u01AF\\u01B1-\\u01B3\\u01B5\\u01B7\\u01B8\\u01BC\\u01C4\\u01C7\\u01CA\\u01CD\\u01CF\\u01D1\\u01D3\\u01D5\\u01D7\\u01D9\\u01DB\\u01DE\\u01E0\\u01E2\\u01E4\\u01E6\\u01E8\\u01EA\\u01EC\\u01EE\\u01F1\\u01F4\\u01F6-\\u01F8\\u01FA\\u01FC\\u01FE\\u0200\\u0202\\u0204\\u0206\\u0208\\u020A\\u020C\\u020E\\u0210\\u0212\\u0214\\u0216\\u0218\\u021A\\u021C\\u021E\\u0220\\u0222\\u0224\\u0226\\u0228\\u022A\\u022C\\u022E\\u0230\\u0232\\u023A\\u023B\\u023D\\u023E\\u0241\\u0243-\\u0246\\u0248\\u024A\\u024C\\u024E]{1,10}";
The \u... part matches any letters from the ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽƁƂƄƆƇƉƊƋƎƏƐƑƓƔƖƗƘƜƝƟƠƢƤƦƧƩƬƮƯƱƲƳƵƷƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶǷǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺȻȽȾɁɃɄɅɆɈɊɌɎ set.
The {1,10} limiting quantifier matches one to ten occurrences. You may adjust it further.
Slight modification of #Wiktor's comment that I think is easier to read:
^[^\P{Lu}\P{Latin}]{0,10}$
should match a string of a max of 10 uppercase Latin (inc. extended) characters. Using a negation class to find 10 characters that are not not uppercase nor not Latin. It does match such beautiful and definitely not cursed strings as ĦꜴꝎꞂꜨⱠƎƢƔ.
I need a RegEx to match an uppercase string ending with a colon. The string can contain spaces, numbers and periods. So that if:
mystring = "I have a C. GRAY CAT2:"
I want the coldfusion expression
REFind("[A-Z0-9. ][:]",mystring)
to return the number 9, matching "C. GRAY CAT2:". Instead, it is returning the number 21, matching only the colon. I hope that a correction of the regex will solve the problem. Of course I have tried many, many things.
Thank you!
I suggest using
[A-Z0-9][A-Z0-9. ]*:
See the regex demo
Details
[A-Z0-9] - an uppercase letter or digit (in case the first char can be a digit, else remove 0-9)
[A-Z0-9. ]* - zero or more uppercase letters/digits, . or space
: - a colon.
Variations
To avoid matching 345: like substrings but still allow 23 VAL: like ones, use
\b(?=[0-9. ]*[A-Z])[A-Z0-9][A-Z0-9. ]*:
See this regex demo. Here, \b(?=[0-9. ]*[A-Z]) matches a word boundary first, and then the positive lookahead (?=[0-9. ]*[A-Z]) makes sure there is an uppercase letter after 0+ digits, spaces or dots.
If you do not expect numbers at the start of the sequence, i.e. out of I have a 22 C. GRAY CAT2:, you need to extract C. GRAY CAT2, use Sebastian's suggestion (demo).
Have revised the selected answer to my own question to cover the German special characters.
[A-Z][A-Z0-9.ÜÄÖß ]*:
This appears to work, however the Germans have recently added a capital ß to their alphabet, which is surely not on most keyboards yet, and therefore will not be a problem for the RegEx for a while.
I am looking for a REGEX to find the first one or two capitalized words in a string. If the first two words is capitalized I want the first two words. A hyphen should be considered part of a word.
for Madonna has a new album I'm looking for madonna
for Paul Young has no new album I'm looking for Paul Young
for Emmerson Lake-palmer is not here I'm looking for Emmerson Lake-palmer
I have been using ^[A-Z]+.*?\b( [A-Z]+.*?\b){0,1} which does great on the first two, but for the 3rd example I get Emmerson Lake, instead of Emmerson Lake-palmer.
What REGEX can I use to find the first one or two capitalized words in the above examples?
You may use
^[A-Z][-a-zA-Z]*(?:\s+[A-Z][-a-zA-Z]*)?
See the regex demo
Basically, use a character class [-a-zA-Z]* instead of a dot matching pattern to only match letters and a hyphen.
Details
^ - start of string
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
(?:\s+[A-Z][-a-zA-Z]*)? - an optional (1 or 0 due to ? quantifier) sequence of:
\s+ - 1+ whitespace
[A-Z] - an uppercase ASCII letter
[-a-zA-Z]* - zero or more ASCII letters / hyphens
A Unicode aware equivalent (for the regex flavors supporting Unicode property classes):
^\p{Lu}[-\p{L}]*(?:\s+\p{Lu}[-\p{L}]*)?
where \p{L} matches any letter and \p{Lu} matches any uppercase letter.
This is probably simpler:
^([A-Z][-A-Za-z]+)(\s[A-Z][-A-Za-z]+)?
Replace + with * if you expect single-letter words.
If u need a Full name only (a two words with the first capitalize letters), this is a simple example:
^([A-Z][a-z]*)(\s)([A-Z][a-z]+)$
Try it. Enjoy!