Swift Regular Expressions name validate Vietnamese name - regex

I use following to validate VietNamese address, it work on web https://regex101.com but wrong when I used on my swift project.
extension String {
func isValidAddress() -> Bool {
let RegEx = "([0-9A-ZẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ']+\\s?\\b){2,}"
let Test = NSPredicate(format:"SELF MATCHES %#", RegEx)
return Test.evaluate(with: self.uppercased())
}
}
My test string " 123/13 Hương lộ 2. Khu phố 2, Quận Bình Tân. Phường Bình Trị Đông A"
It correct when I delete "." "/" and "," like: 12313 Hương lộ 2 Khu phố 2 Quận Bình Tân. Phường Bình Trị Đông A
Thanks for your help.

First of al, MATCHES with NSPredicate requires a full string match. Since your pattern does not match punctuation, it can't match the " 123/13 Hương lộ 2. Khu phố 2, Quận Bình Tân. Phường Bình Trị Đông A" string.
Depending on your requirements, either use a range(of:options:range:locale:) with your current pattern that allows partial match:
return self.range(of: "(?i)([0-9A-ZẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ']+\\s?\\b){2,}", options: .regularExpression) != nil
(Note that (?i) is a shorter way to tell the regex engine that the pattern is case insensitive). Or else, add those patterns to the regex that you expect to appear in the input string.
E.g. you may match your string fully with "[\\w\\p{P}]+(?:\\s[\\w\\p{P}]+)+" pattern where \w matches any letters, digits and _, \p{P} matches any punctuation (you might think of using just \S instead to match any non-whitespaces).

Related

Regex: Only matching at the end of String not anywhere in elastic [duplicate]

The following should be matched:
AAA123
ABCDEFGH123
XXXX123
can I do: ".*123" ?
Yes, you can. That should work.
. = any char except newline
\. = the actual dot character
.? = .{0,1} = match any char except newline zero or one times
.* = .{0,} = match any char except newline zero or more times
.+ = .{1,} = match any char except newline one or more times
Yes that will work, though note that . will not match newlines unless you pass the DOTALL flag when compiling the expression:
Pattern pattern = Pattern.compile(".*123", Pattern.DOTALL);
Matcher matcher = pattern.matcher(inputStr);
boolean matchFound = matcher.matches();
Use the pattern . to match any character once, .* to match any character zero or more times, .+ to match any character one or more times.
The most common way I have seen to encode this is with a character class whose members form a partition of the set of all possible characters.
Usually people write that as [\s\S] (whitespace or non-whitespace), though [\w\W], [\d\D], etc. would all work.
.* and .+ are for any chars except for new lines.
Double Escaping
Just in case, you would wanted to include new lines, the following expressions might also work for those languages that double escaping is required such as Java or C++:
[\\s\\S]*
[\\d\\D]*
[\\w\\W]*
for zero or more times, or
[\\s\\S]+
[\\d\\D]+
[\\w\\W]+
for one or more times.
Single Escaping:
Double escaping is not required for some languages such as, C#, PHP, Ruby, PERL, Python, JavaScript:
[\s\S]*
[\d\D]*
[\w\W]*
[\s\S]+
[\d\D]+
[\w\W]+
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegularExpression{
public static void main(String[] args){
final String regex_1 = "[\\s\\S]*";
final String regex_2 = "[\\d\\D]*";
final String regex_3 = "[\\w\\W]*";
final String string = "AAA123\n\t"
+ "ABCDEFGH123\n\t"
+ "XXXX123\n\t";
final Pattern pattern_1 = Pattern.compile(regex_1);
final Pattern pattern_2 = Pattern.compile(regex_2);
final Pattern pattern_3 = Pattern.compile(regex_3);
final Matcher matcher_1 = pattern_1.matcher(string);
final Matcher matcher_2 = pattern_2.matcher(string);
final Matcher matcher_3 = pattern_3.matcher(string);
if (matcher_1.find()) {
System.out.println("Full Match for Expression 1: " + matcher_1.group(0));
}
if (matcher_2.find()) {
System.out.println("Full Match for Expression 2: " + matcher_2.group(0));
}
if (matcher_3.find()) {
System.out.println("Full Match for Expression 3: " + matcher_3.group(0));
}
}
}
Output
Full Match for Expression 1: AAA123
ABCDEFGH123
XXXX123
Full Match for Expression 2: AAA123
ABCDEFGH123
XXXX123
Full Match for Expression 3: AAA123
ABCDEFGH123
XXXX123
If you wish to explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
RegEx Circuit
jex.im visualizes regular expressions:
There are lots of sophisticated regex testing and development tools, but if you just want a simple test harness in Java, here's one for you to play with:
String[] tests = {
"AAA123",
"ABCDEFGH123",
"XXXX123",
"XYZ123ABC",
"123123",
"X123",
"123",
};
for (String test : tests) {
System.out.println(test + " " +test.matches(".+123"));
}
Now you can easily add new testcases and try new patterns. Have fun exploring regex.
See also
regular-expressions.info/Tutorial
No, * will match zero-or-more characters. You should use +, which matches one-or-more instead.
This expression might work better for you: [A-Z]+123
Specific Solution to the example problem:-
Try [A-Z]*123$ will match 123, AAA123, ASDFRRF123. In case you need at least a character before 123 use [A-Z]+123$.
General Solution to the question (How to match "any character" in the regular expression):
If you are looking for anything including whitespace you can try [\w|\W]{min_char_to_match,}.
If you are trying to match anything except whitespace you can try [\S]{min_char_to_match,}.
Try the regex .{3,}. This will match all characters except a new line.
[^] should match any character, including newline. [^CHARS] matches all characters except for those in CHARS. If CHARS is empty, it matches all characters.
JavaScript example:
/a[^]*Z/.test("abcxyz \0\r\n\t012789ABCXYZ") // Returns ‘true’.
I like the following:
[!-~]
This matches all char codes including special characters and the normal A-Z, a-z, 0-9
https://www.w3schools.com/charsets/ref_html_ascii.asp
E.g. faker.internet.password(20, false, /[!-~]/)
Will generate a password like this: 0+>8*nZ\\*-mB7Ybbx,b>
I work this Not always dot is means any char. Exception when single line mode. \p{all} should be
String value = "|°¬<>!\"#$%&/()=?'\\¡¿/*-+_#[]^^{}";
String expression = "[a-zA-Z0-9\\p{all}]{0,50}";
if(value.matches(expression)){
System.out.println("true");
} else {
System.out.println("false");
}

SCALA regex: Find matching URL rgex within a sentence

import java.util.regex._
object RegMatcher extends App {
val str="facebook.com"
val urlpattern="(http://|https://|file://|ftp://)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?"
var regex_list: Set[(String, String)] = Set()
val url=Pattern.compile(urlpattern)
var m=url.matcher(str)
if (m.find()) {
regex_list += (("date", m.group(0)))
println("match: " + m.group(0))
}
val str2="url is ftp://filezilla.com"
m=url.matcher(str2)
if (m.find()) {
regex_list += (("date", m.group(0)))
println("str 2 match: " + m.group(0))
}
}
This returns
match: facebook.com
str 2 match: url is ftp:
How do I manage the regex pattern so that both the strings are matched well.
What do the symbols actually mean in regex. I am very new to regex. Please help.
I read your regex as:
0 or 1 (? modifier) of the schemes (http://, https://, etc.)
followed by 0 or 1 instance of www.,
followed by 1 or more (+ modifier ) alphanumeric characters ,
followed by any character ( . is a regex special character, remember, standing for any one character),
followed by 0 or more (* modifier) alphanumerics,
followed by any character (. again)
followed by 3 lowercase letters ({3} being an exact count modifier)
followed by 0 or 1 of any character (.?)
followed by one or more lowecase letters.
If you plug your regex into regex101.com, you'll not only see a similar breakdown ( without any errors I might have made, though I think i nailed it), and you'll also have a chance to test various strings against it. Then, once you have your regexes working the way you want, you can bring them back to your script. It's a solid workflow for both learning regexes and developing an expression for a particular purpose.
If you drop your regex and your inputs into regex 101, you'll see why you're getting the output you see. But here's a hint: when you ask your regular expression to match "url is ftp://filezilla.com", nothing excludes "url is" from being part of the match. That's why you're not matching the scheme you want. Regex101 really is a great way to investigate this further.
The regex can be updated to
((ftp|https|http?):\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,})
This is all I needed.

How to create "blocks" with Regex

For a project of mine, I want to create 'blocks' with Regex.
\xyz\yzx //wrong format
x\12 //wrong format
12\x //wrong format
\x12\x13\x14\x00\xff\xff //correct format
When using Regex101 to test my regular expressions, I came to this result:
([\\x(0-9A-Fa-f)])/gm
This leads to an incorrect output, because
12\x
Still gets detected as a correct string, though the order is wrong, it needs to be in the order specified below, and in no other order.
backslash x 0-9A-Fa-f 0-9A-Fa-f
Can anyone explain how that works and why it works in that way? Thanks in advance!
To match the \, folloed with x, followed with 2 hex chars, anywhere in the string, you need to use
\\x[0-9A-Fa-f]{2}
See the regex demo
To force it match all non-overlapping occurrences, use the specific modifiers (like /g in JavaScript/Perl) or specific functions in your programming language (Regex.Matches in .NET, or preg_match_all in PHP, etc.).
The ^(?:\\x[0-9A-Fa-f]{2})+$ regex validates a whole string that consists of the patterns like above. It happens due to the ^ (start of string) and $ (end of string) anchors. Note the (?:...)+ is a non-capturing group that can repeat in the string 1 or more times (due to + quantifier).
Some Java demo:
String s = "\\x12\\x13\\x14\\x00\\xff\\xff";
// Extract valid blocks
Pattern pattern = Pattern.compile("\\\\x[0-9A-Fa-f]{2}");
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()){
res.add(matcher.group(0));
}
System.out.println(res); // => [\x12, \x13, \x14, \x00, \xff, \xff]
// Check if a string consists of valid "blocks" only
boolean isValid = s.matches("(?i)(?:\\\\x[a-f0-9]{2})+");
System.out.println(isValid); // => true
Note that we may shorten [a-zA-Z] to [a-z] if we add a case insensitive modifier (?i) to the start of the pattern, or just use \p{Alnum} that matches any alphanumeric char in a Java regex.
The String#matches method always anchors the regex by default, we do not need the leading ^ and trailing $ anchors when using the pattern inside it.

How to search for only whole words in a Swift String

I have this NS search expression. searchString passes in a String which I would like to search for in the baseString and highlight. However at the moment if I search for the word 'I' an 'i' in the word 'hide' for example appears highlighted.
I've seen that I can use \b to search for only whole words but I can't see where I add this into the expression. So that only whole words are highlighted.
Another example could be if my baseString contains 'His story is history' and I used searchString to so search for 'his' it will highlight history.
let regex = try! NSRegularExpression(pattern: searchString as! String,options: .caseInsensitive)
for match in regex.matches(in: baseString!, options: NSRegularExpression.MatchingOptions(), range: NSRange(location: 0, length: (baseString?.characters.count)!)) as [NSTextCheckingResult] {
attributed.addAttribute(NSBackgroundColorAttributeName, value: UIColor.yellow, range: match.range)
}
You can easily create a regex pattern from your searchString:
let baseString = "His story is history"
let searchString = "his" //This needs to be a single word
let attributed = NSMutableAttributedString(string: baseString)
//Create a regex pattern matching with word boundaries
let searchPattern = "\\b"+NSRegularExpression.escapedPattern(for: searchString)+"\\b"
let regex = try! NSRegularExpression(pattern: searchPattern, options: .caseInsensitive)
for match in regex.matches(in: baseString, range: NSRange(0..<baseString.utf16.count)) {
attributed.addAttribute(NSBackgroundColorAttributeName, value: UIColor.yellow, range: match.range)
}
Some comments:
Assuming baseString and searchString are non-Optional String in the code above, if not, make them so as soon as possible, before searching.
Empty OptionSet is represented by [], so options: NSRegularExpression.MatchingOptions() in your code can be simplified as option: [], and it is the default value for options: parameter of matches method, which you have no need to specify.
NSRegularExpression takes and returns ranges based on UTF-16 representation of String. You should not use characters.count to make NSRange, use utf16.count instead.
The return type of matches(in:range:) is declared as [NSTextCheckingResult], you have no need to cast it.
Update
I thought of a better solution than my previous answer so I updated it. The original answer will follow for anyone that prefers so.
"(?<=[^A-Za-z0-9]|^)[A-Za-z0-9]+(?=[^A-Za-z0-9]|$)"
Breaking down this expression, (?<=[^A-Za-z0-9]|^) checks for any non-alphanumeric or start of line ^ before the word I want to match. [A-Za-z0-9]+? matches any alphanumeric characters and requires at least one matched by +. (?=[^A-Za-z0-9]|$) will check for another non-alphanumeric or end of line $ after the word I matched. Therefore this expression will match any alphanumeric. To exclude numbers to match only alphabets simply remove 0-9 from the expression like
"(?<=[^A-Za-z]|^)[A-Za-z]+(?=[^A-Za-z]|$)"
For usage replace the center matching expression with the word to match like:
"(?<=[^A-Za-z]|^)\(searchString)(?=[^A-Za-z]|$)"
Old Answer
I tried using this before, it finds every string separated by whitespace. Should do what you need
"\\s[a-zA-Z1-9]*\\s"
Change [a-zA-Z1-9]* to match what you are searching for, in your case fit your original search string into it like
let regex = try! NSRegularExpression(pattern: "\\s\(searchString)\\s" ,options: .caseInsensitive)
As an added answer, \\s will include the whitespace before and after the word. I added a check to exclude the whitespace if it becomes more useful, the pattern is like:
"(?<=\\s)[A-Za-z0-9]*(?=\\s)"
similarly, replace [A-Za-z0-9]* which searches for all words with the search string you need.
Note, (?<=\\s) checks for whitespace before the word but does not include it, (?=\\s) checks for whitespace after, also not including it. This will work better in most scenarios compared to my original answer above since there is no extra whitespace.

Parse string using regex

I need to come up with a regular expression to parse my input string. My input string is of the format:
[alphanumeric].[alpha][numeric].[alpha][alpha][alpha].[julian date: yyyyddd]
eg:
A.A2.ABC.2014071
3.M1.MMB.2014071
I need to substring it from the 3rd position and was wondering what would be the easiest way to do it.
Desired result:
A2.ABC.2014071
M1.MMB.2014071
(?i) will be considered as case insensitive.
(?i)^[a-z\d]\.[a-z]\d\.[a-z]{3}\.\d{7}$
Here a-z means any alphabet from a to z, and \d means any digit from 0 to 9.
Now, if you want to remove the first section before dot, then use this regex and replace it with $1 (or may be \1)
(?i)^[a-z\d]\.([a-z]\d\.[a-z]{3}\.\d{7})$
Another option is replace below with empty:
(?i)^[a-z\d]\.
If the input string is just the long form, then you want everything except the first two characters. You could arrange to substitute them with nothing:
s/^..//
Or you could arrange to capture everything except the first two characters:
/^..(.*)/
If the expression is part of a larger string, then the breakdown of the alphanumeric components becomes more important.
The details vary depending on the language that is hosting the regex. The notations written above could be Perl or PCRE (Perl Compatible Regular Expressions). Many other languages would accept these regexes too, but other languages would require tweaks.
Use this regex:
\w.[A-Z]\d.[A-Z]{3}.\d{7}
Use the above regex like this:
String[] in = {
"A.A2.ABC.2014071", "3.M1.MMB.2014071"
};
Pattern p = Pattern.compile("\\w.[A-Z]\\d.[A-Z]{3}.\\d{7}");
for (String s: in ) {
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println("Result: " + m.group().substring(2));
}
}
Live demo: http://ideone.com/tns9iY