Case and diacritic insensitive matching of regex with metacharacter in Swift - regex

I am trying to match rude words in user inputs, for example "I Hate You!" or "i.håté.Yoù" will match with "hate you" in an array of words parsed from JSON.
So I need it to be case and diacritic insensitive and to treat whitespaces in the rude words as any non-letter character:
regex metacharacter \P{L} should work for that, or at least \W
Now I know [cd] works with NSPredicate, like this:
func matches(text: String) -> [String]? {
if let rudeWords = JSON?["words"] as? [String]{
return rudeWords.filter {
let pattern = $0.stringByReplacingOccurrencesOfString(" ", withString: "\\P{L}", options: .CaseInsensitiveSearch)
return NSPredicate(format: "SELF MATCHES[cd] %#", pattern).evaluateWithObject(text)
}
} else {
log.debug("error fetching rude words")
return nil
}
}
That doesn't work with either metacharacters, I guess they are not parsed by NSpredicate, so I tried using NSRegularExpression like this:
func matches(text: String) -> [String]? {
if let rudeWords = JSON?["words"] as? [String]{
return rudeWords.filter {
do {
let pattern = $0.stringByReplacingOccurrencesOfString(" ", withString: "\\P{L}", options: .CaseInsensitiveSearch)
let regex = try NSRegularExpression(pattern: pattern, options: .CaseInsensitive)
return regex.matchesInString(text, options: [], range: NSMakeRange(0, text.characters.count)).count > 0
}
catch _ {
log.debug("error parsing rude word regex")
return false
}
}
} else {
log.debug("error fetching rude words")
return nil
}
}
This seem to work OK however there is no way that I know to make regex diacritic insensitive, so I tried this (and other solutions like re-encoding)
let text = text.stringByFoldingWithOptions(.DiacriticInsensitiveSearch, locale: NSLocale.currentLocale())
However, this does not work for me since I check user input every time a character is typed so all the solutions I tried to strip accents made the app extremely slow.
Does someone know if there any other solutions or if I am using this the wrong way ?
Thanks
EDIT
I was actually mistaken, what was making the app slow was trying to match with \P{L}, I tried the second soluton with \W and with the accent-stripping line, now it works OK even if it matches with less strings than I initially wanted.
Links
These might help some people dealing with regex and predicates:
http://www.regular-expressions.info/unicode.html
http://juehualu.blogspot.fr/2013/08/ios-notes-for-predicates-programming.html
https://regex101.com

It might be worthwhile to go in a different direction. Instead of flattening the input, what if you changed the regex?
Instead of matching against hate.you, could match against [h][åæaàâä][t][ëèêeé].[y][o0][ùu], for example (it's not a comprehensive list, in any case). It would make most sense to do this transformation on the fly (not storing it) because it might be easier if you need to change what the characters expand to later.
This will give you some more control over what characters will match. If you look, I have 0 as a character matching o. No amount of Unicode coercion could let you do that.

I ended up using the solution suggested by Laurel. It works well for me.
I post it here for anybody who might need it.
extension String {
func getCaseAndDiacriticInsensitiveRegex() throws -> NSRegularExpression {
var pattern = self.folding(options: [.caseInsensitive, .diacriticInsensitive], locale: .current)
pattern = pattern.replacingOccurrences(of: "a", with: "[aàáâäæãåā]")
pattern = pattern.replacingOccurrences(of: "c", with: "[cçćč]")
pattern = pattern.replacingOccurrences(of: "e", with: "[eèéêëēėę]")
pattern = pattern.replacingOccurrences(of: "l", with: "[lł]")
pattern = pattern.replacingOccurrences(of: "i", with: "[iîïíīįì]")
pattern = pattern.replacingOccurrences(of: "n", with: "[nñń]")
pattern = pattern.replacingOccurrences(of: "o", with: "[oôöòóœøōõ]")
pattern = pattern.replacingOccurrences(of: "s", with: "[sßśš]")
pattern = pattern.replacingOccurrences(of: "u", with: "[uûüùúū]")
pattern = pattern.replacingOccurrences(of: "y", with: "[yýÿ]")
pattern = pattern.replacingOccurrences(of: "z", with: "[zžźż]")
return try NSRegularExpression(pattern: pattern, options: [.caseInsensitive])
}
}

Related

Regex pattern with [:] returns as invalid

I'm searching a text for a specific pattern using regex and it seems to work fine until I want to include a ":"
the function I use to find the text is:
func matches(for regex: String, in text: String) -> [String] {
do {
let regex = try NSRegularExpression(pattern: regex)
let results = regex.matches(in: text,
range: NSRange(text.startIndex..., in: text))
return results.map {
String(text[Range($0.range, in: text)!])
}
} catch let error {
print("invalid regex: \(error.localizedDescription)")
return []
}
}
The example array I use to test the pattern is:
var textstringarray = ["10:50 - 13:40","ABC"]
And here is the loop that checks the different items:
for myString in textstringarray{
let matched2 = matches(for: "[0-9][0-9][:][0-9][0-9] [-] [0-9][0-9][:][0-9][0-9]", in: myString)
if !matched2.isEmpty{
print(matched2)
}
}
I expect it to return only the first item, but the Log in Playground only says
invalid regex: The value “[0-9][0-9][:][0-9][0-9] [-] [0-9][0-9][:][0-9][0-9]” is invalid
So far I figured out that the problem the second [:] is, because when I delete it everything works fine. Anyone any idea, what I could do?
Thanks a lot
Maybe it thinks [: ... :] is an invalid posix character class? Seems like a bug to me.
Does [\:] fix it? (Though there's no need to use a character class for a single character; you could just have : instead of [:].)

NSRegularExpression to extract subset of text in Swift 3

I am trying to use NSRegularExpression(pattern: regex) to extract 10.32.15.235 in a string: \"IPAddress\":\"10.32.15.235\",\"WAN\" using Swift 3.
However, I'm getting an error using this function from this answer
func matches(for regex: String, in text: String) -> [String] {
do {
let regex = try NSRegularExpression(pattern: regex)
let nsString = text as NSString
let results = regex.matches(in: text, range: NSRange(location: 0, length: nsString.length))
return results.map { nsString.substring(with: $0.range)}
} catch let error {
print("invalid regex: \(error.localizedDescription)")
return []
}
}
With this call:
let pattern = "IPAddress\\\":\\\"(.+?)\\"
let IPAddressString = self.matches(for: pattern, in: stringData!)
print(IPAddressString)
However, the error part of the function is called with this error:
invalid regex: The value “IPAddress\":\"(.+?)\” is invalid.
Can you help me modify the regex expression for Swift 3?
Thanks
Note that in case you have a valid JSON, you may use a JSON parser with Swift.
TO fix your current regex approach, you may use
let pattern = "(?<=IPAddress\":\")[^\"]+"
Pattern details
(?<=IPAddress\":\") - a positive lookahead that matches a position in the string right after IPAddress":" substring
[^\"]+ - a negated character class matching 1 or more chars other than "
See the regex demo.

Target double quotes using regular expressions in Swift

I have been trying to extract a piece of text inside an string using regular expressions in Swift. The text I want to extract is inside double quotes so I'm trying to target those double quotes and get the piece of text inside.
This is the RegExp that I'm using: (?<=")(?:\\.|[^"\\])*(?=")
It work pretty well with any kind of text and it could be even simpler since I'm looking for anything that could be inside those double quotes.
When I try to use this RegExp with Swift I have to scape the double quotes in it, but for some reason the RegExp doesn't work with escaped double quotes e.g. (?<=\")(?:\\.|[^\"\\])*(?=\").
Even if I try some as simple as this \" the RegExp doesn't match any double quote in the string.
Code Example
func extractText(sentence: String?) -> String {
let pattern = "(?<=\")(?:\\.|[^\"\\])*(?=\")"
let source = sentence!
if let range = source.range(of: pattern, options: .regularExpression) {
return "Text: \(source[range])"
}
return ""
}
extractText("Hello \"this is\" a test") -> "this is"
To have in mind:
All these RegExps must be inside double quotes to create the string literal that is going to be used as a pattern.
I'm using the String's range method with the .regularExpression option to match the content.
I'm using Swift 4 with an Xcode 9 Playground
How can I scape double quotes in Swift to successfully match these in a string?
Solution
Thanks to #Atlas_Gondal and #vadian I noticed the problem "in part" is not the RegExp but the string I'm getting which is using a different type of double quotes “ ... ” so I have to change my pattern to something like this "(?<=“).*(?=”)" in order to use it.
The resulted code looks like this:
func extractText(sentence: String?) -> String {
let pattern = "(?<=“).*(?=”)"
let source = sentence!
if let range = source.range(of: pattern, options: .regularExpression) {
return "\(source[range])"
}
return ""
}
range(of with regularExpression option can't do that because it's not able to capture groups.
You need real NSRegularExpression
func extractText(sentence: String) -> String {
let pattern = "\"([^\"]+)\""
let regex = try! NSRegularExpression(pattern: pattern)
if let match = regex.firstMatch(in: sentence, range: NSRange(sentence.startIndex..., in: sentence)) {
let range = Range(match.range(at: 1), in: sentence)!
return String(sentence[range])
}
return ""
}
extractText(sentence:"Hello \"this is\" a test")
The pattern is much simpler: Search for a double quote followed by one or more non-double-quote characters followed by a closing double quote. Capture the characters between the double quotes.
Note that escaped double quotes in a literal string are only virtually escaped.
PS: Your code doesn't compile without the parameter label in Swift 3 nor 4.
try this code:
extension String {
func capturedGroups(withRegex pattern: String) -> [String] {
var results = [String]()
var regex: NSRegularExpression
do {
regex = try NSRegularExpression(pattern: pattern, options: [])
} catch {
return results
}
let matches = regex.matches(in: self, options: [], range: NSRange(location:0, length: self.characters.count))
guard let match = matches.first else { return results }
let lastRangeIndex = match.numberOfRanges - 1
guard lastRangeIndex >= 1 else { return results }
for i in 1...lastRangeIndex {
let capturedGroupIndex = match.rangeAt(i)
let matchedString = (self as NSString).substring(with: capturedGroupIndex)
results.append(matchedString)
}
return results
}
}
Use extension like this:
print("This is \"My String \"".capturedGroups(withRegex: "\"(.*)\""))
Sample Screenshot
Even though it's a bit late, I've fixed it by using a raw string.
Since Swift 5 you can do this:
let pattern = #"(?<=“).*(?=”)"# // <- Note the # in front and after.
// ...
And you are good to go. By far the simplest solution in my opinion!
⚠️ Note: This means that every character inside of the double quotes gets taken literally (no more templating ("\(variable)" or new lines \n)).
Here is a great article about raw strings.

Make sure regex matches the entire string with Swift regex

How to check whether a WHOLE string can be matches to regex? In Java is method String.matches(regex)
You need to use anchors, ^ (start of string anchor) and $ (end of string anchor), with range(of:options:range:locale:), passing the .regularExpression option:
import Foundation
let phoneNumber = "123-456-789"
let result = phoneNumber.range(of: "^\\d{3}-\\d{3}-\\d{3}$", options: .regularExpression) != nil
print(result)
Or, you may pass an array of options, [.regularExpression, .anchored], where .anchored will anchor the pattern at the start of the string only, and you will be able to omit ^, but still, $ will be required to anchor at the string end:
let result = phoneNumber.range(of: "\\d{3}-\\d{3}-\\d{3}$", options: [.regularExpression, .anchored]) != nil
See the online Swift demo
Also, using NSPredicate with MATCHES is an alternative here:
The left hand expression equals the right hand expression using a regex-style comparison according to ICU v3 (for more details see the ICU User Guide for Regular Expressions).
MATCHES actually anchors the regex match both at the start and end of the string (note this might not work in all Swift 3 builds):
let pattern = "\\d{3}-\\d{3}-\\d{3}"
let predicate = NSPredicate(format: "self MATCHES [c] %#", pattern)
let result = predicate.evaluate(with: "123-456-789")
What you are looking for is range(of:options:range:locale:) then you can then compare the result of range(of:option:) with whole range of comparing string..
Example:
let phoneNumber = "(999) 555-1111"
let wholeRange = phoneNumber.startIndex..<phoneNumber.endIndex
if let match = phoneNumber.range(of: "\\(?\\d{3}\\)?\\s\\d{3}-\\d{4}", options: .regularExpression), wholeRange == match {
print("Valid number")
}
else {
print("Invalid number")
}
//Valid number
Edit: You can also use NSPredicate and compare your string with evaluate(with:) method of its.
let pattern = "^\\(?\\d{3}\\)?\\s\\d{3}-\\d{4}$"
let predicate = NSPredicate(format: "self MATCHES [c] %#", pattern)
if predicate.evaluate(with: "(888) 555-1111") {
print("Valid")
}
else {
print("Invalid")
}
Swift extract regex matches
with little bit of edit
import Foundation
func matches(for regex: String, in text: String) -> Bool {
do {
let regex = try NSRegularExpression(pattern: regex)
let nsString = text as NSString
let results = regex.matches(in: text, range: NSRange(location: 0, length: nsString.length))
return !results.isEmpty
} catch let error {
print("invalid regex: \(error.localizedDescription)")
return false
}
}
Example usage from link above:
let string = "19320"
let matched = matches(for: "^[1-9]\\d*$", in: string)
print(matched) // will match
let string = "a19320"
let matched = matches(for: "^[1-9]\\d*$", in: string)
print(matched) // will not match

How to use regex with Swift?

I am making an app in Swift and I need to catch 8 numbers from a string.
Here's the string:
index.php?page=index&l=99182677
My pattern is:
&l=(\d{8,})
And here's my code:
var yourAccountNumber = "index.php?page=index&l=99182677"
let regex = try! NSRegularExpression(pattern: "&l=(\\d{8,})", options: NSRegularExpressionOptions.CaseInsensitive)
let range = NSMakeRange(0, yourAccountNumber.characters.count)
let match = regex.matchesInString(yourAccountNumber, options: NSMatchingOptions.Anchored, range: range)
Firstly, I don't know what the NSMatchingOptions means, on the official Apple library, I don't get all the .Anchored, .ReportProgress, etc stuff. Anyone would be able to lighten me up on this?
Then, when I print(match), nothing seems to contain on that variable ([]).
I am using Xcode 7 Beta 3, with Swift 2.0.
ORIGINAL ANSWER
Here is a function you can leverage to get captured group texts:
import Foundation
extension String {
func firstMatchIn(string: NSString!, atRangeIndex: Int!) -> String {
var error : NSError?
let re = NSRegularExpression(pattern: self, options: .CaseInsensitive, error: &error)
let match = re.firstMatchInString(string, options: .WithoutAnchoringBounds, range: NSMakeRange(0, string.length))
return string.substringWithRange(match.rangeAtIndex(atRangeIndex))
}
}
And then:
var result = "&l=(\\d{8,})".firstMatchIn(yourAccountNumber, atRangeIndex: 1)
The 1 in atRangeIndex: 1 will extract the text captured by (\d{8,}) capture group.
NOTE1: If you plan to extract 8, and only 8 digits after &l=, you do not need the , in the limiting quantifier, as {8,} means 8 or more. Change to {8} if you plan to capture just 8 digits.
NOTE2: NSMatchingAnchored is something you would like to avoid if your expected result is not at the beginning of a search range. See documentation:
Specifies that matches are limited to those at the start of the search range.
NOTE3: Speaking about "simplest" things, I'd advise to avoid using look-arounds whenever you do not have to. Look-arounds usually come at some cost to performance, and if you are not going to capture overlapping text, I'd recommend to use capture groups.
UPDATE FOR SWIFT 2
I have come up with a function that will return all matches with all capturing groups (similar to preg_match_all in PHP). Here is a way to use it for your scenario:
func regMatchGroup(regex: String, text: String) -> [[String]] {
do {
var resultsFinal = [[String]]()
let regex = try NSRegularExpression(pattern: regex, options: [])
let nsString = text as NSString
let results = regex.matchesInString(text,
options: [], range: NSMakeRange(0, nsString.length))
for result in results {
var internalString = [String]()
for var i = 0; i < result.numberOfRanges; ++i{
internalString.append(nsString.substringWithRange(result.rangeAtIndex(i)))
}
resultsFinal.append(internalString)
}
return resultsFinal
} catch let error as NSError {
print("invalid regex: \(error.localizedDescription)")
return [[]]
}
}
// USAGE:
let yourAccountNumber = "index.php?page=index&l=99182677"
let matches = regMatchGroup("&l=(\\d{8,})", text: yourAccountNumber)
if (matches.count > 0) // If we have matches....
{
print(matches[0][1]) // Print the first one, Group 1.
}
It may be easier just to use the NSString method instead of NSRegularExpression.
var yourAccountNumber = "index.php?page=index&l=99182677"
println(yourAccountNumber) // index.php?page=index&l=99182677
let regexString = "(?<=&l=)\\d{8,}+"
let options :NSStringCompareOptions = .RegularExpressionSearch | .CaseInsensitiveSearch
if let range = yourAccountNumber.rangeOfString(regexString, options:options) {
let digits = yourAccountNumber.substringWithRange(range)
println("digits: \(digits)")
}
else {
print("Match not found")
}
The (?<=&l=) means precedes but not part of.
In detail:
Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
In general performance considerations of a look-behind without instrumented proof is just premature optimization. That being said there may be other valid reasons for and against look-arounds in regular expressions.
ICU User Guide: Regular Expressions
For Swift 2, you can use this extension of String:
import Foundation
extension String {
func firstMatchIn(string: NSString!, atRangeIndex: Int!) -> String {
do {
let re = try NSRegularExpression(pattern: self, options: NSRegularExpressionOptions.CaseInsensitive)
let match = re.firstMatchInString(string as String, options: .WithoutAnchoringBounds, range: NSMakeRange(0, string.length))
return string.substringWithRange(match!.rangeAtIndex(atRangeIndex))
} catch {
return ""
}
}
}
You can get the account-number with:
var result = "&l=(\\d{8,})".firstMatchIn(yourAccountNumber, atRangeIndex: 1)
Replace NSMatchingOptions.Anchored with NSMatchingOptions() (no options)