How to remove numbers form a text in Scala ?
for example i have this text:
canon 40 22mm lens lock strength plenty orientation 321 .
after removing :
canon lens lock strength plenty orientation .
Please, try filter or filterNot
val text = "canon 40 22mm lens lock strength plenty orientation 321 ."
val without_digits = text.filter(!_.isDigit)
or
val text = "canon 40 22mm lens lock strength plenty orientation 321 ."
val without_digits = text.filterNot(_.isDigit)
\\d+\\S*\\s+
Try this.Replace by empty string.See demo.
https://regex101.com/r/tS1hW2/1
Since it is apparent, you want to remove all words that contain a number, because in your example mm is also gone, because it is prefixed by a number.
val s = "That's 22m, which is gr8."
s.split(" ").filterNot(_.exists(_.isDigit)).mkString(" ")
res8: String = That's which is
Related
I am trying to split sentences with some exceptions to ignore cases like Mr. And Mrs. Etc... And add them to an array.
This worked in vanilla JS for me
/(?<!\sMrs)(?<!\sMr)(?<!\sSr)(\.\s)|(!\s)|(\?\s)/gm
Unfortunately React Native doesn't support the negative lookbehind.
Is there another way I can achieve the same result?
You can create exceptions in the following way:
let str = "Hello Mr. Jackson. How are you doing today?"
let sentences = str.match(/(?:\b(?:Mrs?|Sr)\.\s|[^!?.])+[!?.]?/g).map(x => x.trim())
console.log(sentences)
The regex (see its online demo) matches
(?:\b(?:Mrs?|Sr)\.\s|[^!?.])+ - one or more occurrences of
\b(?:Mrs?|Sr)\.\s - Mr, Mrs or Sr as whole words followed with . and a whitespace char
| - or
[^!?.] - any single char other than ?, ! and .
[!?.]? - an optional !, ? or ..
I ended up doing this:
let str = "Hello Mr. Jackson. How are you doing today?"
let sentences = str
.replace(
/(!”\s|\?”\s|\.”|!\)]s|\.\)\s|\?\)|!"\s|\."\s|\.\s|!\s|\?\s|[.?!])\s*/g,
"$1|"
)
.split("|")
let arr = []
for (let i = 0; i < sentences.length; i++) {
if (sentences[i].includes("Mr.") | sentences[i].includes("Mrs.")) {
arr.push(sentences[i] + " " + sentences[i+1])
i++
} else {
arr.push(sentences[i])
}
}
console.log(arr)
If anyone has a more efficient solution, let me know!
I need to split string into the array with elements as two following words by scala:
"Hello, it is useless text. Hope you can help me."
The result:
[[it is], [is useless], [useless text], [Hope you], [you can], [can help], [help me]]
One more example:
"This is example 2. Just\nskip it."
Result:
[[This is], [is example], [Just skip], [skip it]]
I tried this regex:
var num = """[a-zA-Z]+\s[a-zA-Z]+""".r
But the output is:
scala> for (m <- re.findAllIn("Hello, it is useless text. Hope you can help me.")) println(m)
it is
useless text
Hope you
can help
So it ignores some cases.
First split on the punctuation and digits, then split on the spaces, then slide over the results.
def doubleUp(txt :String) :Array[Array[String]] =
txt.split("[.,;:\\d]+")
.flatMap(_.trim.split("\\s+").sliding(2))
.filter(_.length > 1)
usage:
val txt1 = "Hello, it is useless text. Hope you can help me."
doubleUp(txt1)
//res0: Array[Array[String]] = Array(Array(it, is), Array(is, useless), Array(useless, text), Array(Hope, you), Array(you, can), Array(can, help), Array(help, me))
val txt2 = "This is example 2. Just\nskip it."
doubleUp(txt2)
//res1: Array[Array[String]] = Array(Array(This, is), Array(is, example), Array(Just, skip), Array(skip, it))
First process the string as it is by removing all escape characters.
scala> val string = "Hello, it is useless text. Hope you can help me."
val preprocessed = StringContext.processEscapes(string)
//preprocessed: String = Hello, it is useless text. Hope you can help me.
OR
scala>val string = "This is example 2. Just\nskip it."
val preprocessed = StringContext.processEscapes(string)
//preprocessed: String =
//This is example 2. Just
//skip it.
Then filter out all necessary chars(like chars, space etc...) and use slide function as
val result = preprocessed.split("\\s").filter(e => !e.isEmpty && !e.matches("(?<=^|\\s)[A-Za-z]+\\p{Punct}(?=\\s|$)") ).sliding(2).toList
//scala> res9: List[Array[String]] = List(Array(it, is), Array(is, useless), Array(useless, Hope), Array(Hope, you), Array(you, can), Array(can, help))
You need to use split to break the string down into words separated by non-word characters, and then sliding to double-up the words in the way that you want;
val text = "Hello, it is useless text. Hope you can help me."
text.trim.split("\\W+").sliding(2)
You may also want to remove escape characters, as explained in other answers.
Sorry I only know Python. I heard the two are almost the same. Hope you can understand
string = "it is useless text. Hope you can help me."
split = string.split(' ') // splits on space (you can use regex for this)
result = []
no = 0
count = len(split)
for x in range(count):
no +=1
if no < count:
pair = split[x] + ' ' + split[no] // Adds the current to the next
result.append(pair)
The output will be:
['it is', 'is useless', 'useless text.', 'text. Hope', 'Hope you', 'you can', 'can help', 'help me.']
I am trying to loop through regex results, and insert the first capture group into a variable to be processed in a loop. But I can't figure out how to do so. Here's what I have so far, but it just prints the second match:
aQuote = "The big boat has a big assortment of big things."
theMatches = regmatches(aQuote, gregexpr("big ([a-z]+)", aQuote ,ignore.case = TRUE))
results = lapply(theMatches, function(m){
capturedItem = m[[2]]
print(capturedItem)
})
Right now it prints
[1] "big assortment"
What I want it to print is
[1] boat
[1] assortment
[1] things
Try this:
regmatches(aQuote, gregexpr("(?<=big )[a-z]+", aQuote ,ignore.case = TRUE,perl=TRUE))[[1]]
#[1] "boat" "assortment" "things"
Include g (global) modifier in you code as well.
Equivalent regex in perl / javascript is: /big ([a-z]+)/ig
Sample perl prog:
$aQuote = "The big boat has a big assortment of big things.";
print $1."\n" while ($aQuote =~ /big ([a-z]+)/ig);
JS Fiddle here.
Edit: In r, we can write:
aQuote = "The big boat has a big assortment of big things."
theMatches = regmatches(aQuote, gregexpr("big ([a-z]+)", aQuote ,ignore.case = TRUE))
results = lapply(theMatches, function(m){
len= length(m)
for (i in 1:len)
{
print(m[[i]])
}
})
r fiddle here.
I have two strings like:
"Nikon Coolpix AW130 16MP Point and Shoot Digital Camera Black with 5x Optical Zoom"
"Nikon Coolpix AW130 16 MP Point & Shoot Camera Black"
I am trying to compare strings like these, as you can see both of them are same, when I tokenize based on space and compare each word the space between 16 and MP in 2nd string will cause a difference which is not actually there.
Is there anyway I can add a space in the 1st string where the 16MP is together so that I can tokenize based on space properly.
val productList=List("Nikon Coolpix AW130 16MP Point and Shoot Digital Camera Black with 5x Optical Zoom","Nikon Coolpix AW130 16 MP Point & Shoot Camera Black")
val tokens = ListBuffer[String]()
productList.split(" ").foreach(x => {
tokens += x
})
val res = tokens.toList
If you just want to remove the space between a number and a fixed MP string, you can use the following regex:
scala> "Nikon Coolpix AW130 16 MP Point & Shoot Camera Black".replaceAll("""(\d+) ?(MP)""", "$1$2")
res13: String = Nikon Coolpix AW130 16MP Point & Shoot Camera Black
The (\d+) part matches any number with at least 1 digit
The ? (note the space) matches 0 or one spaces
The (MP) part matches the string MP literally.
$1$2 prints the contents of the match of the first parentheses (\d+) appended to the match of the second parantheses (MP) - omitting the space, if there is one.
After that, the 16MP tokenS should be equal. You will still have the problem of and vs. &, though.
You can do it with RegEx.
Search both format and replace it to one specific.
Instead of split it is easier to do regex replaces; consecutively.
public static boolean equivalent(Sting a, String b) {
normalize(a).equalsIgnoreCase(normalize(b));
}
private static String normalize(String s) {
return s.replaceAll("(\\d+)", "$0 ") // At least one space after digits.
.replaceAll("\\bLimited\\b", "Ltd") // Example.
.replace("'", "") // Example.
.replace("&", " and ")
.replaceAll("\\s+", " ") // Multiple spaces to one.
.trim();
}
Or do a split on the normalized string (to get keywords).
You don't give enough details about the format of these strings but from this example I can infer something like that : (\w+) (\d+)\s*MP Point.*
You can then parse the strings and read the groups of the regex to compare the product.
Here is an example :
def main(args: Array[String]): Unit = {
val s0 = "Nikon Coolpix AW130 16MP Point and Shoot Digital Camera Black with 5x Optical Zoom"
val s1 = "Nikon Coolpix AW130 16 MP Point & Shoot Camera Black"
println(Product.parse(s0) == Product.parse(s1)) // prints true
}
case class Product(name: String, resolution: Int)
object Product {
private val regex = new Regex("(\\w+) (\\d+)\\s*MP Point.*", "productName", "resolution")
def parse(s: String) = regex.findFirstMatchIn(s) match {
case Some(m) => Product(m.group("productName"), m.group("resolution").toInt)
case None => throw new RuntimeException("Invalid format")
}
}
I have a list of several phrases in the following format
thisIsAnExampleSentance
hereIsAnotherExampleWithMoreWordsInIt
and I'm trying to end up with
This Is An Example Sentance
Here Is Another Example With More Words In It
Each phrase has the white space condensed and the first letter is forced to lowercase.
Can I use regex to add a space before each A-Z and have the first letter of the phrase be capitalized?
I thought of doing something like
([a-z]+)([A-Z])([a-z]+)([A-Z])([a-z]+) // etc
$1 $2$3 $4$5 // etc
but on 50 records of varying length, my idea is a poor solution. Is there a way to regex in a way that will be more dynamic? Thanks
A Java fragment I use looks like this (now revised):
result = source.replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
result = result.substring(0, 1).toUpperCase() + result.substring(1);
This, by the way, converts the string givenProductUPCSymbol into Given Product UPC Symbol - make sure this is fine with the way you use this type of thing
Finally, a single line version could be:
result = source.substring(0, 1).toUpperCase() + source(1).replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
Also, in an Example similar to one given in the question comments, the string hiMyNameIsBobAndIWantAPuppy will be changed to Hi My Name Is Bob And I Want A Puppy
For the space problem it's easy if your language supports zero-width-look-behind
var result = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "(?<=[a-z])([A-Z])", " $1");
or even if it doesn't support them
var result2 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "([a-z])([A-Z])", "$1 $2");
I'm using C#, but the regexes should be usable in any language that support the replace using the $1...$n .
But for the lower-to-upper case you can't do it directly in Regex. You can get the first character through a regex like: ^[a-z] but you can't convet it.
For example in C# you could do
var result4 = Regex.Replace(result, "^([a-z])", m =>
{
return m.ToString().ToUpperInvariant();
});
using a match evaluator to change the input string.
You could then even fuse the two together
var result4 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "^([a-z])|([a-z])([A-Z])", m =>
{
if (m.Groups[1].Success)
{
return m.ToString().ToUpperInvariant();
}
else
{
return m.Groups[2].ToString() + " " + m.Groups[3].ToString();
}
});
A Perl example with unicode character support:
s/\p{Lu}/ $&/g;
s/^./\U$&/;