scala-regexp: split string into array of two following words - regex

I need to split string into the array with elements as two following words by scala:
"Hello, it is useless text. Hope you can help me."
The result:
[[it is], [is useless], [useless text], [Hope you], [you can], [can help], [help me]]
One more example:
"This is example 2. Just\nskip it."
Result:
[[This is], [is example], [Just skip], [skip it]]
I tried this regex:
var num = """[a-zA-Z]+\s[a-zA-Z]+""".r
But the output is:
scala> for (m <- re.findAllIn("Hello, it is useless text. Hope you can help me.")) println(m)
it is
useless text
Hope you
can help
So it ignores some cases.

First split on the punctuation and digits, then split on the spaces, then slide over the results.
def doubleUp(txt :String) :Array[Array[String]] =
txt.split("[.,;:\\d]+")
.flatMap(_.trim.split("\\s+").sliding(2))
.filter(_.length > 1)
usage:
val txt1 = "Hello, it is useless text. Hope you can help me."
doubleUp(txt1)
//res0: Array[Array[String]] = Array(Array(it, is), Array(is, useless), Array(useless, text), Array(Hope, you), Array(you, can), Array(can, help), Array(help, me))
val txt2 = "This is example 2. Just\nskip it."
doubleUp(txt2)
//res1: Array[Array[String]] = Array(Array(This, is), Array(is, example), Array(Just, skip), Array(skip, it))

First process the string as it is by removing all escape characters.
scala> val string = "Hello, it is useless text. Hope you can help me."
val preprocessed = StringContext.processEscapes(string)
//preprocessed: String = Hello, it is useless text. Hope you can help me.
OR
scala>val string = "This is example 2. Just\nskip it."
val preprocessed = StringContext.processEscapes(string)
//preprocessed: String =
//This is example 2. Just
//skip it.
Then filter out all necessary chars(like chars, space etc...) and use slide function as
val result = preprocessed.split("\\s").filter(e => !e.isEmpty && !e.matches("(?<=^|\\s)[A-Za-z]+\\p{Punct}(?=\\s|$)") ).sliding(2).toList
//scala> res9: List[Array[String]] = List(Array(it, is), Array(is, useless), Array(useless, Hope), Array(Hope, you), Array(you, can), Array(can, help))

You need to use split to break the string down into words separated by non-word characters, and then sliding to double-up the words in the way that you want;
val text = "Hello, it is useless text. Hope you can help me."
text.trim.split("\\W+").sliding(2)
You may also want to remove escape characters, as explained in other answers.

Sorry I only know Python. I heard the two are almost the same. Hope you can understand
string = "it is useless text. Hope you can help me."
split = string.split(' ') // splits on space (you can use regex for this)
result = []
no = 0
count = len(split)
for x in range(count):
no +=1
if no < count:
pair = split[x] + ' ' + split[no] // Adds the current to the next
result.append(pair)
The output will be:
['it is', 'is useless', 'useless text.', 'text. Hope', 'Hope you', 'you can', 'can help', 'help me.']

Related

Is there a simpler way to count the number of tokens in a string with duplicated delimiters in Kotlin?

I want to count the number of tokens in a string with occasional delimiter duplicates. String.split() doesn't work well in this case because it doesn't deduplicate delimiters. For example:
val msg = "The cat won't stay away from the chickens."
val tokens = msg.split(' ')
var count = 0
for(t in tokens)
if(t != "") count++
tokens = ["The", "", "cat", "won't", "", "stay", "", "", "away", "", "from", "", "the", "chickens."]
tokens.size = 14
count = 8
I'm looking for a simpler way to get to the count of 8 in this example. Maybe there's a regex way.
You can use
val msg = "The cat won't stay away from the chickens."
val regex = """\S+""".toRegex()
val tokens = regex.findAll(msg).map{it.value}
println(tokens.joinToString(", "))
// => The, cat, won't, stay, away, from, the, chickens.
See the Kotlin demo.
Here, you extract all non-whitespace text chunks from the given string.
This solution should be preferred as val tokens = msg.split("""\s+""".toRegex()) (see demo) will produce extra items in case of leading/trailing whitespaces.
The splitting approach can work better if you first trim() the string:
val tokens = msg.trim().split("""\s+""".toRegex())
See this Kotlin demo.

converting a list to string and printing it out python

I am trying to convert the first letter of each word of a string to uppercase in python. But i keep getting a generator object at 0x10315b8> no post before this seems to answer my question.
def capitalize(str):
newstr = str.split(' ')
newlist = []
for word in newstr:
if word[0][0] == word[0][0].upper():
newlist.append(word[0][0].upper())
newlist.append(word[0][1:])
newlist.append(" ")
convert_first = (str(w) for w in newlist)
print(convert_first)
capitalize(input("enter some string"))#calling the function
Your problem lies in how you are trying to make a string out of a list of strings. The opposite of "splitting" a string into a list is "joining" a list into a string.
def capitalize(str):
newstr = str.split(' ')
newlist = []
for word in newstr:
newlist.append(word[0].upper() + word[1:])
convert_first = ' '.join(newlist)
print(convert_first)
capitalize(input("enter some string"))#calling the function
Note: I made an attempt to have my code be as close as possible to that in the question.
Also, why is there an if statement in your code? With that in place you're really just capitalizing all the words that are already capitalized and discarding the rest since they never make it into newlist.
There are a few issues with your code:
The error message you got is for trying to print convert_first, which is a generator, not a string.
newstr is a list of words, so word is a string and word[0] is already the first character. Meaningless for word[0][0] or word[0][1:].
if word[0][0] == word[0][0].upper(): just filters all the words whose first character is not uppercase...
So simply some code will do what you described:
def capitalize(str):
newstr = str.split(' ')
newlist = []
for word in newstr:
newlist.append(word[0].upper())
newlist.append(word[1:])
newlist.append(" ")
convert_first = ''.join(w for w in newlist)
print(convert_first)
capitalize(input("enter some string"))
Or those who favors short code and generator expressions:
def capitalize(str):
print(' '.join(word[0].upper() + word[1:] for word in str.split(' ')))
capitalize(input("enter some string"))
This also removes the tailing space of the generated string, which may (not) be what you intended.

Spliting string into a list of substrings

I have a string id <- "Hello these are words N12345678 hooray how fun".
I would like to extract just N12345678 from this string.
So far I have used strsplit(id, " "). Now I have
>id
>[[1]]
>[1] "Hello" "these" "are" "words" "N12345678" "hooray" "how"
>[8] "fun"
Which is of type list and of length 1 (despite apparently having 8 elements?)
If I then use id <- id[grep("^[N][0-9]",id)],
id is an empty list.
I think what I need to do is split the string into a list of length 8 with each element as a substring and then grep should be able to pick out the pattern, but I'm not sure how to go about that.
Use regmatches
> regmatches(id, regexpr("N[0-9]+", id))
[1] "N12345678"
If you insist on using strsplit. I think this can solve the problem:
id <- "Hello these are words N12345678 hooray how fun"
id = strsplit(id, " ")
id[[1]][grep("^N[1-9]", id[[1]])]
Notice that I haven't changed your regex. It could be more precise expression such as ^N\\d+$.
Do you know about strtok? It will parse your input line on certain characters. For the purpose of my example, I am breaking off a piece of my string every time I hit a space.
tempVar = strtok(string, " ");
// tempVar has "id" or everything up to the first space
while (tempVar != NULL)
{
tempVar = strtok(NULL, " ");
//now tempVar picked up the next word, and will loop picking up the next word until the end of string
}
Using this, your "Hello these are words N123456789 Hooray" would do this:
tempVar would be Hello, then "these" etc etc.
Each time through the loop tempVar would get a new value. So i would suggest evaluating tempVar in the loop (before grabbing the next one) so that you can stop when you have N123456789
Try:
gsub('\\b[a-zA-Z]+\\b','',id)

Combine several list comprehension codes

I got three list comprehensions that do some trimming in a given string. What these are doing is that in a string, it removes words that contain '/', removes certain words in the list called 'remove_set', and combines single consecutive letters into a one big word.
regex = re.compile(r'.*/.*')
parent = ' '.join([p for p in parent.split() if not regex.match(p)])
remove_set = {'hello', 'corp', 'world'}
parent = ' '.join([i for i in parent.split() if i not in remove_set])
parent = ' '.join((' ' if x else '').join(y) for x, y in itertools.groupby(parent.split(), lambda x: len(x) > 1))
For example:
string = "hello C S people in some corp/llc"
changes to
string = "CS people in some"
Can these commands can be written in one beautiful command??
Thanks in advance!

regex how can I split this word?

I have a list of several phrases in the following format
thisIsAnExampleSentance
hereIsAnotherExampleWithMoreWordsInIt
and I'm trying to end up with
This Is An Example Sentance
Here Is Another Example With More Words In It
Each phrase has the white space condensed and the first letter is forced to lowercase.
Can I use regex to add a space before each A-Z and have the first letter of the phrase be capitalized?
I thought of doing something like
([a-z]+)([A-Z])([a-z]+)([A-Z])([a-z]+) // etc
$1 $2$3 $4$5 // etc
but on 50 records of varying length, my idea is a poor solution. Is there a way to regex in a way that will be more dynamic? Thanks
A Java fragment I use looks like this (now revised):
result = source.replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
result = result.substring(0, 1).toUpperCase() + result.substring(1);
This, by the way, converts the string givenProductUPCSymbol into Given Product UPC Symbol - make sure this is fine with the way you use this type of thing
Finally, a single line version could be:
result = source.substring(0, 1).toUpperCase() + source(1).replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
Also, in an Example similar to one given in the question comments, the string hiMyNameIsBobAndIWantAPuppy will be changed to Hi My Name Is Bob And I Want A Puppy
For the space problem it's easy if your language supports zero-width-look-behind
var result = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "(?<=[a-z])([A-Z])", " $1");
or even if it doesn't support them
var result2 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "([a-z])([A-Z])", "$1 $2");
I'm using C#, but the regexes should be usable in any language that support the replace using the $1...$n .
But for the lower-to-upper case you can't do it directly in Regex. You can get the first character through a regex like: ^[a-z] but you can't convet it.
For example in C# you could do
var result4 = Regex.Replace(result, "^([a-z])", m =>
{
return m.ToString().ToUpperInvariant();
});
using a match evaluator to change the input string.
You could then even fuse the two together
var result4 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "^([a-z])|([a-z])([A-Z])", m =>
{
if (m.Groups[1].Success)
{
return m.ToString().ToUpperInvariant();
}
else
{
return m.Groups[2].ToString() + " " + m.Groups[3].ToString();
}
});
A Perl example with unicode character support:
s/\p{Lu}/ $&/g;
s/^./\U$&/;