I want to do some processing on a string in Scala. The first stage of that is finding the index of articles such as: "A ", " A ", "a ", " a ". I am trying to do that like this:
"A house is in front of us".indexOfSlice("\\s+[Aa] ")
I think this should return 0, as the substring is first matched in the first position of the string.
However, this returns -1.
Why does it return -1? Is the regex I am using incorrect?
The other answers as I type this are just missing the point. Your problem is that indexOfSlice doesn't take a regexp, but a sub-sequence to seach for in the sequence. So fixing the regexp won't help at all.
Try this:
val pattern = "\\b[Aa]\\b".r.unanchored
for (mo <- pattern.findAllMatchIn("A house is in front of us, a house is in front of us all")) {
println("pattern starts at " + mo.start)
}
//> pattern starts at 0
//| pattern starts at 27
(with fixed regex, too)
Edit: counter-example for the popular but wrong suggestion of "\\s*[Aa] "
val pattern2 = "\\s*[Aa] ".r.unanchored
for (mo <- pattern2.findAllMatchIn("The agenda is hidden")) {
println("pattern starts at " + mo.start)
}
//> pattern starts at 9
I see a mistake in your regex. your regex is searching for
at least once space (\s+)
a letter (either A or a)
but string you are matching doesn't contain space in beginning. that's why It's not returning you index 0 but -1.
you could write your regex as "^\\s*[Aa] "
Here is example:
val text = "A house is in front of us";
val matcher = Pattern.compile("^\\s*[Aa] ").matcher(text)
var idx = 0;
if(matcher.find()){
idx = matcher.start()
}
println(idx)
it should return 0 as expected.
Related
This question already has answers here:
Kotlin function for getting start and end index of substring
(2 answers)
Closed 1 year ago.
I want to find index from string. How can i find the index of the first alphabet of second last word in a string.
val index = "Hey! How are you men? How you doing"
i want to search you doing from the above string, but i want y index from the word you. I did some code to find index but I am unable to find it.
fun main(vararg args: String) {
val inputString = "Hey! How are you men? How you doing"
val regex = "you doing".toRegex()
val match = regex.find(inputString)!!
println(match.value)
println(match.range)
}
This regex finds the last two words in your sentence and calculates the index by subtracting the length of the two words from the length of the string.
val result = Regex("^(?:.*?\\s+)?([^\\s]+\\s+[^\\s]+)$").matchEntire(inputString)
if (result != null) {
println(inputString.length - result.groupValues[1].length)
} else {
println("not supported")
}
Supports inputs like
Hey! How are you men? How you doing
Hey! How are you men? How you doing?
Hey! How are you, John?
Hello there!
Split the string, then take the first character of the second-to-last element of the resulting array.
If you are looking for the index of the y in you doing related to the entire string (Hey! How are you men? How you doing), you can use indexOf.
val inputString = "Hey! How are you men? How you doing"
val matchString = "you doing"
val matchIndex = inputString.indexOf(matchString)
More info on indexOf here.
If you don't want to use a regex (which you probably shouldn't unless you need the efficiency) the simplest option is probably what #samuei says:
index.split(' ').takeLast(2).first().first()
(take the last two words, take the first of those, and then the first character of that)
If you want to mess with indices instead you could do this kind of thing:
val lastSpaceIndex = index.lastIndexOf(' ')
val secondToLastSpace = index.lastIndexOf(' ', startIndex = lastSpaceIndex -1)
println(index.get(secondToLastSpace + 1))
where you're finding the index of the last space, then the index of the last space before that one, and then grabbing the character after that. But this is already getting a lot less readable, and is it worth the extra complexity? Your call!
I have two list of Strings. Now I want to replace every occurence of a word in the first list at index i with a word in the second list at index i of a sentence.
So if I have
list a=("am","I","my")
and
list b=("are","You","your")
I want the sentence "I am an amateur"
to become "You are an amateur"
What is cleanest way to do that in Kotlin (without for loop)?
First split the string to a list of its words and then map each word if it exists in list a to the corresponding word in list b. Finally rejoin the string:
val a= listOf("am","I","my")
val b= listOf("are","You","your")
val str = "I am an amateur"
val new = str
.split("\\s+".toRegex())
.map { val i = a.indexOf(it); if (i < 0) it else b[i] }
.joinToString(" ")
Another way of doing the same thing is:
var new = " $str "
a.forEachIndexed { i, s -> new = new.replace(" $s ", " ${b[i]} ") }
new = new.trim()
although this is closer to a for loop.
I assume there is no punctuation, all whitespaces are spaces and so on.
val m = a.zip(b).toMap()
return s.split(' ').joinToString(" ") { m[it] ?: it }
First you create a map m for more efficient... mapping. Then
Split the string to get a list of words
Map all words: if m contains the word, then return the value (i.e. the replacement), otherwise return the original word (since we shouldn't replace it).
Join all words, separate them by spaces.
You can use the regular expression \b\w+\b to match words in a sentence and then call replace function with the lambda that provides a replacement string for each match:
val input = "I am an amateur, alas."
val wordsToReplace = listOf("I", "am", "my")
val wordsReplaceWith = listOf("You", "are", "your")
val wordRegex = """\b\w+\b""".toRegex()
val result = wordRegex.replace(input) { match ->
val wordIndex = wordsToReplace.indexOf(match.value)
if (wordIndex >= 0) wordsReplaceWith[wordIndex] else match.value
}
println(result)
If there are a lot of word in your lists, it makes sense to build a map of them to speed up searches:
val replaceMap = (wordsToReplace zip wordsReplaceWith).toMap()
val result = wordRegex.replace(input) { match ->
replaceMap[match.value] ?: match.value
}
I think the simplest way is to create a set of regex you want and replace the string by iteration. Let's say you want to replace the word "am", your regex will be "\bam\b". You can use "(?i)\bam\b" if you want it not to be case sensitive. To make "I am an amateur" to "You are an amateur"
val replacements = setOf("\\bam\\b" to "are",
"\\bI\\b" to "You",
"\\bmy\\b" to "your")
replacements.forEach {
str = str.replace(Regex(it.first), it.second)
}
I want to validate a field with white spaces either before a text string or after. It is allowed to have space in the middle string.
Here is my code
$.validator.addMethod("trimLookup", function(value, element) {
regex = "^[^\s]+(\s+[^\s]+)*$";
regex = new RegExp( regex );
return this.optional( element ) || regex.test( value );
}, $.validator.format("Cannot contains any spaces at beginning or end"));
I test the regex in https://regex101.com/ it works fine. I also test this code with other regex it works. But if enter " " or " abc " it doesn't work.
Any Suggestion?
Thank you for your time!
I want to replace all the consecutive underscores with a single space. This is the code that I have written. But it is not replacing anything. Below is the code that I have written. What am I doing wrong?
import scala.util.matching.Regex
val regex: Regex = new Regex("/[\\W_]+/g")
val name: String = "cust_id"
val newName: String = regex.replaceAllIn(name, " ")
println(newName)
Answer: "cust_id"
You could use replaceAll to do the job without regex :
val name: String = "cust_id"
val newName: String = name.replaceAll("_"," ")
println(newName)
The slashes in your regular expression don't belong there.
new Regex("[\\W_]+", "g").replaceAllIn("cust_id", " ")
// "cust id"
A string in Scala may be treated as a collection, hence we can map over it and in this case apply pattern matching to substitute characters, like this
"cust_id".map {
case '_' => " "
case c => c
}.mkString
Method mkString glues up the vector of characters back onto a string.
VBScript's Trim function only trims spaces. Sometimes I want to trim TABs as well. For this I've been using this custom trimSpTab function that is based on a regular expression.
Today I ran into a performance problem. The input consisted of rather long lines (several 1000 chars).
As it turns out
- the function is slow, only if the string is long AND contains many spaces
- the right-hand part of the regular expression is reponsible for the poor performance
- the run time seems quadratic to the line length (O(n^2))
So why is this line trimmed fast
" aaa xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx bbb " '10000 x's
and this one trimmed slowly
" aaa bbb " '10000 spaces
Both contain only 6 characters to be trimmed.
Can you propose a modification to my trimSpTab function?
Dim regex
Set regex = new regexp
' TEST 1 - executes in no time
' " aaa XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX bbb "
t1 = Timer
character = "X"
trimTest character
MsgBox Timer-t1 & " sec",, "with '" & character & "' in the center of the string"
' TEST 2 - executes in 1 second on my machine
' " aaa bbb "
t1 = Timer
character = " "
trimTest character
MsgBox Timer-t1 & " sec",, "with '" & character & "' in the center of the string"
Sub trimTest (character)
sInput = " aaa " & String (10000, character) & " bbb "
trimmed = trimSpTab (sInput)
End Sub
Function trimSpTab (byval s)
'trims spaces & tabs
regex.Global = True
regex.Pattern = "^[ \t]+|[ \t]+$" 'trim left+right
trimSpTab = regex.Replace (s, "")
End Function
I have tried this (with regex.Global = false) but to no avail
regex.Pattern = "^[ \t]+" 'trim left
s = regex.Replace (s, "")
regex.Pattern = "[ \t]+$" 'trim right
trimSpTab = regex.Replace (s, "")
UPDATE
I've come up with this alternative in the mean time. It processes a 100 million character string is less than a second.
Function trimSpTab (byval s)
'trims spaces & tabs
regex.Pattern = "^[ \t]+"
s = strReverse (s)
s = regex.Replace (s, "")
s = strReverse (s)
s = regex.Replace (s, "")
trimSpTab = s
End Function
Solution
As mentioned in the question, your current solution is to reverse the string. However, this is not necessary, since .NET regex supports RightToLeft matching option. For the same regex, the engine will start matching from right to left instead of default behavior of matching from left to right.
Below is sample code in C#, which I hope you can adapt to VB solution (I don't know VB enough to write sample code):
input = new Regex("^[ \t]+").Replace(input, "", 1)
input = new Regex("[ \t]+$", RegexOptions.RightToLeft).Replace(input, "", 1)
Explanation
The long run time is due to the engine just trying to match [ \t]+ indiscriminately in the middle of the string and end up failing when it is not an trailing blank sequence.
The observation that the complexity is quadratic is correct.
We know that the regex engine starts matching from index 0. If there is a match, then the next attempt starts at the end of the last match. Otherwise, the next attempt starts at the (current index + 1). (Well, to simplify things, I don't mention the case where a zero-length match is found).
Below shall illustrate the farthest attempt (some is a match, some are not) of the engine matching the regex ^[ \t]+|[ \t]+$. _ is used to denote space (or tab character) for clarity.
_____ab_______________g________
^----
^
^
^--------------
^-------------
^------------
...
^
^
^-------
When there is a long sequence of spaces & tabs in the middle of the string (which will not produce a match), the engine attempts matching at every index in the long sequence of spaces & tabs. As the result, the engine ends up going through O(k2) characters on a non-matching sequence of spaces & tabs of length k.
Your evidence proves that VBScript's RegExp implementation does not optimize for the $ anchor: It spends time (backtracking?) for each of the spaces in the middle of your test string. Without doubt, that's a fact good to know.
If this causes you real world problems, you'll have to find/write a better (R)Trim function. I came up with:
Function trimString(s, p)
Dim l : l = Len(s)
If 0 = l Then
trimString = s
Exit Function
End If
Dim ps, pe
For ps = 1 To l
If 0 = Instr(p, Mid(s, ps, 1)) Then
Exit For
End If
Next
For pe = l To ps Step -1
If 0 = Instr(p, Mid(s, pe, 1)) Then
Exit For
End If
Next
trimString = Mid(s, ps, pe - ps + 1)
End Function
It surely needs testing and benchmarks for long heads or tails of white space, but I hope it gets you started.