How to extract nth URL from string using regex? - regex

I wanted to extract second URL using regex, I cant use any other thing, So far I have managed to extract all URLs from the string using a regex but its just giving out the first URL.
fun main() {
var text = "hello world https://www.google.com hello world https://www.stackoverflow.com hello world https://www.test.com"
var regex = """((http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])?)"""
println(performRegex(text, regex))
}
private fun performRegex(text: String?, regex: String?): String? {
val regexPattern = Regex("""$regex""")
return regexPattern.find(text.toString())?.value
}
Current Output: https://www.google.com
Expected Output: https://www.stackoverflow.com

You can use
private fun performRegex(text: String?, regex: String?): String? {
val regexPattern = Regex("""$regex""")
val matchList = regexPattern.findAll(text.toString()).map{it.value}.toList()
return if (matchList.size >= 2) matchList[1] else null
}
fun main(args: Array<String>) {
var text = "hello world https://www.google.com hello world https://www.stackoverflow.com hello world https://w...content-available-to-author-only...t.com"
var regex = """(?:https?|ftp)://\S+"""
println(performRegex(text, regex))
}
See the online Kotlin demo.
The regex is (?:https?|ftp)://\S+, it matches http://, https:// or ftp:// and then any one or more non-whitespace chars.
The val matchList = regexPattern.findAll(text.toString()).map{it.value}.toList() part finds all matches and maps the results to a list of strings.
The return if (matchList.size >= 2) matchList[1] else null part returns the second match found if the match list size is two or more, else, it returns null.

Related

regex keeps returning false even when regex101 returns match

I am doing a list.where filter:
String needleTemp = '';
final String hayStack =
[itemCode, itemDesc, itemCodeAlt, itemDescAlt, itemGroup].join(' ');
for (final k in query.split(" ")) {
needleTemp = '$needleTemp(?=.*\\Q$k\\E)';
}
var re = RegExp(needleTemp);
return re.hasMatch(hayStack);
I printed the output for needleTemp and it looks the same as on my regex101 example:
in dart it prints (?=.*\Qa/a\E)(?=.*\Qpatro\E)
basically the same, but nothing matches, not even a simple letter.
Is dart regex different or do I need another syntax?
edit:
Simple example to test in DartPad:
void main() {
print("(?=.*\\Qpatrol\\E)");
var re = RegExp("(?=.*\\Q2020\\E)");
print(re.hasMatch('A/A PATROL 2020'));
}
still returns false
Found the solution:
I just need to remove \Q and \E then RegExp.escape(text_to_escape) inside the needle.

Replacing the 1st regex-match group instead of the 0th

I was expecting this
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
println(string.replace(regex, ""))
to result in this:
hello, world
Instead, it prints this:
hello world
I see that the replace function cares about the whole match. Is there a way to replace only the 1st group instead of the 0th one?
Add the comma in the replacement:
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
println(string.replace(regex, ","))
Or, if kotlin supports lookahead:
val string = "hello , world"
val regex = Regex("""\s+(?=,)""")
println(string.replace(regex, ""))
You can retrieve the match range of the regular expression by using the groups property of MatchGroupCollection and then using the range as a parameter for String.removeRange method:
val string = "hello , world"
val regex = Regex("""(\s+)[,]""")
val result = string.removeRange(regex.find(string)!!.groups[1]!!.range)

How to use regex to split string into groups of identical characters?

I got a string like this:
var string = "AAAAAAABBBCCCCCCDD"
and like to split the string into an array of this format (same characters --> same group) using regular expressions:
Array: "AAAAAAA", "BBB", "CCCCCC", "DD"
This Is what I got so far but tbh I can not really get it working.
var array = [String]()
var string = "AAAAAAABBBCCCCCCDD"
let pattern = "\\ b([1,][a-z])\\" // mistake?!
let regex = try! NSRegularExpression(pattern: pattern, options: [])
array = regex.matchesInString(string, options: [], range: NSRange(location: 0, length: string.count))
You can achieve that using this function from this answer:
func matches(for regex: String, in text: String) -> [String] {
do {
let regex = try NSRegularExpression(pattern: regex)
let results = regex.matches(in: text,
range: NSRange(text.startIndex..., in: text))
return results.map {
String(text[Range($0.range, in: text)!])
}
} catch let error {
print("invalid regex: \(error.localizedDescription)")
return []
}
}
Passing (.)\\1+ as regex and AAAAAAABBBCCCCCCDD as text like this:
let result = matches(for: "(.)\\1+", in: "AAAAAAABBBCCCCCCDD")
print(result) // ["AAAAAAA", "BBB", "CCCCCC", "DD"]
You can achieve that with a "back reference", compare
NSRegularExpression:
\n
Back Reference. Match whatever the nth capturing group matched. n must be a number ≥ 1 and ≤ total number of capture groups in the pattern.
Example (using the utility method from Swift extract regex matches):
let string = "AAAAAAABBBCCCCCCDDE"
let pattern = "(.)\\1*"
let array = matches(for: pattern, in: string)
print(array)
// ["AAAAAAA", "BBB", "CCCCCC", "DD", "E"]
The pattern matches an arbitrary character, followed by zero or more
occurrences of the same character. If you are only interested in
repeating word characters use
let pattern = "(\\w)\\1*"
instead.

Make sure regex matches the entire string with Swift regex

How to check whether a WHOLE string can be matches to regex? In Java is method String.matches(regex)
You need to use anchors, ^ (start of string anchor) and $ (end of string anchor), with range(of:options:range:locale:), passing the .regularExpression option:
import Foundation
let phoneNumber = "123-456-789"
let result = phoneNumber.range(of: "^\\d{3}-\\d{3}-\\d{3}$", options: .regularExpression) != nil
print(result)
Or, you may pass an array of options, [.regularExpression, .anchored], where .anchored will anchor the pattern at the start of the string only, and you will be able to omit ^, but still, $ will be required to anchor at the string end:
let result = phoneNumber.range(of: "\\d{3}-\\d{3}-\\d{3}$", options: [.regularExpression, .anchored]) != nil
See the online Swift demo
Also, using NSPredicate with MATCHES is an alternative here:
The left hand expression equals the right hand expression using a regex-style comparison according to ICU v3 (for more details see the ICU User Guide for Regular Expressions).
MATCHES actually anchors the regex match both at the start and end of the string (note this might not work in all Swift 3 builds):
let pattern = "\\d{3}-\\d{3}-\\d{3}"
let predicate = NSPredicate(format: "self MATCHES [c] %#", pattern)
let result = predicate.evaluate(with: "123-456-789")
What you are looking for is range(of:options:range:locale:) then you can then compare the result of range(of:option:) with whole range of comparing string..
Example:
let phoneNumber = "(999) 555-1111"
let wholeRange = phoneNumber.startIndex..<phoneNumber.endIndex
if let match = phoneNumber.range(of: "\\(?\\d{3}\\)?\\s\\d{3}-\\d{4}", options: .regularExpression), wholeRange == match {
print("Valid number")
}
else {
print("Invalid number")
}
//Valid number
Edit: You can also use NSPredicate and compare your string with evaluate(with:) method of its.
let pattern = "^\\(?\\d{3}\\)?\\s\\d{3}-\\d{4}$"
let predicate = NSPredicate(format: "self MATCHES [c] %#", pattern)
if predicate.evaluate(with: "(888) 555-1111") {
print("Valid")
}
else {
print("Invalid")
}
Swift extract regex matches
with little bit of edit
import Foundation
func matches(for regex: String, in text: String) -> Bool {
do {
let regex = try NSRegularExpression(pattern: regex)
let nsString = text as NSString
let results = regex.matches(in: text, range: NSRange(location: 0, length: nsString.length))
return !results.isEmpty
} catch let error {
print("invalid regex: \(error.localizedDescription)")
return false
}
}
Example usage from link above:
let string = "19320"
let matched = matches(for: "^[1-9]\\d*$", in: string)
print(matched) // will match
let string = "a19320"
let matched = matches(for: "^[1-9]\\d*$", in: string)
print(matched) // will not match

Selectively uppercasing a string

I have a string with some XML tags in it, like:
"hello <b>world</b> and <i>everyone</i>"
Is there a good Scala/functional way of uppercasing the words, but not the tags, so that it looks like:
"HELLO <b>WORLD<b> AND <i>EVERYONE</i>"
We can use dustmouse's regex to replace all the text in/outside XML tags with Regex.replaceAllIn. We can get the matched text with Regex.Match.matched which then can easily be uppercased using toUpperCase.
val xmlText = """(?<!<|<\/)\b\w+(?!>)""".r
val string = "hello <b>world</b> and <i>everyone</i>"
xmlText.replaceAllIn(string, _.matched.toUpperCase)
// String = HELLO <b>WORLD</b> AND <i>EVERYONE</i>
val string2 = "<h1>>hello</h1> <span>world</span> and <span><i>everyone</i>"
xmlText.replaceAllIn(string2, _.matched.toUpperCase)
// String = <h1>>HELLO</h1> <span>WORLD</span> AND <span><i>EVERYONE</i>
Using dustmouse's updated regex :
val xmlText = """(?:<[^<>]+>\s*)(\w+)""".r
val string3 = """<h1>>hello</h1> <span id="test">world</span>"""
xmlText.replaceAllIn(string3, m =>
m.group(0).dropRight(m.group(1).length) + m.group(1).toUpperCase)
// String = <h1>>hello</h1> <span id="test">WORLD</span>
Okay, how about this. It just prints the results, and takes into consideration some of the scenarios brought up by others. Not sure how to capitalize the output without mercilessly poaching from Peter's answer:
val string = "<h1 id=\"test\">hello</h1> <span>world</span> and <span><i>everyone</i></span>"
val pattern = """(?:<[^<>]+>\s*)(\w+)""".r
pattern.findAllIn(string).matchData foreach {
m => println(m.group(1))
}
The main thing here is that it is extracting the correct capture group.
Working example: http://ideone.com/2qlwoP
Also need to give credit to the answer here for getting capture groups in scala: Scala capture group using regex