Extract all tokens from string using regex in Scala

Extract all tokens from string using regex in Scala - regex

I have a string like "httpx://__URL__/__STUFF__?param=value"
This sample is a url by convention...it could be anything with zero or more __X__ tokens in it.
I want to use a regex to extract a list of all the tokens, so output here would be List("__URL__","__STUFF__"). Remember, I don't know beforehand how many (if any) tokens may be in the input string.
I've been struggling but unable to come up with a regex expression that will do the trick.
Something like this did not work:
(?:.?(__[a-zA-Z0-9]+__).?)+

Scala Regex, which is just a wrapper around Java Regex, will never return multiple subgroups for repetitions.
The only way about it is to have a regex for the token, and then find it multiple times. You pretty much already have everything you want:
"__[a-zA-Z0-9]+__".r findAllIn "httpx://__URL__/__STUFF__?param=value"
That returns an Iterator. Use .toSeq or similar to convert into a collection.

Greg, have you tried a simple
_+[^_]+_+
This will match all the __TOKENS__
It doesn't do any check for any __TOKENLIKE__ string after the ?params, but you have mentioned you are not only using that for urls. If you need some refinement, please let us know.

Combine a regex with split:
def urlPathComponents(s: String): Option[Array[String]] =
"""(?<=http(s?)://)[^?]+""".r findFirstIn s map (_.split("/"))

Related

Regex ignore first 12 characters from string

I'm trying to create a custom filter in Google Analytic to remove the query parts of the url which I don't want to see. The url has the following structure
[domain]/?p=899:2000:15018702722302::NO:::
I would like to create a regex which skips the first 12 characters (that is until:/?p=899:2000), and what ever is going to be after that replace it with nothing.
So I made this one: https://regex101.com/r/Xgbfqz/1 (which could be simplified to .{0,12}) , but I actually would like to skip those and only let the regex match whatever is going to be after that, so that I'll be able to tell in Google Analytics to replace it with "".
The part in the url that is always the same is
?p=[3numbers]:[0-4numbers]
Thank you

Your regular expression:
\/\?p=\d{3}\:\d{0,4}(.*)
Tested in Golang RegEx 2 and RegEx101
It search for /p=###:[optional:####] and capture the rest of the right side string.
(extra) JavaScript:
paragraf='[domain]/?p=899:2000:15018702722302::NO:::'
var regex= /\/\?p=\d{3}\:\d{0,4}(.*)/;
var match = regex.exec(paragraf);
alert('The rest of the right side of the string: ' + match[1]);

Easily use "[domain]/?p=899:2000:15018702722302::NO:::".substr(12)

You can try this:
/\?p\=\d{3}:\d{0,4}
Which matches just this: ?p=[3numbers]:[0-4numbers]
Not sure about replacing though.
https://regex101.com/r/Xgbfqz/1

Match a string with regular expression

I have a problem to create a regular expression. I have a string like this (a simple json response froma webserver):
{
"key1":valueInt1,
"key2":valueInt2,
[...]
"specialKey":"",
"subtitle":"Every kind of character, like char, num or punct",
"key3":"Useful line",
}
What I want is to delete the keys "specialKey" and its value and "subtitle" with its value. Both of them can be empty (like specialKey in my example).
I have tried something like that:
(\"subtitle\"\:\")([:punct:]*[:space:]*[:word:]*)*(\",)
Without success. The error is the part ([:punct:][:space:][:word:]) to match my random sentence.
Thanks for your help!

JSON is not a regular language so regular expressions are not the best tool for this kind of manipulation. You better use a JSON parser for achieving your goal.

Character classes are usually only used within bracket expressions. If you change what you have into:
(\"subtitle\"\:\")([[:punct:][:space:][:word:]]*)*(\",)
It should work.

Ruby Puppet Regex string matching

I'm somewhat new to ruby and have done a ton of google searching but just can't seem to figure out how to match this particular pattern. I have used rubular.com and can't seem to find a simple way to match. Here is what I'm trying to do:
I have several types of hosts, they take this form:
Sample hostgroups
host-brd0000.localdomain
host-cat0000.localdomain
host-dog0000.localdomain
host-bug0000.localdomain
Next I have a case statement, I want to keep out the bugs (who doesn't right?). I want to do something like this to match the series of characters. However, it starts matching at host-b, host-c, host-d, and matches only a single character as if I did a [brdcatdog].
case $hostgroups { #variable takes the host string up to where the numbers begin
# animals to keep
/host-[["brd"],["cat"],["dog"]]/: {
file {"/usr/bin/petstore-friends.sh":
owner => petstore,
group => petstore,
mode => 755,
source => "puppet:///modules/petstore-friends.sh.$hostgroups",
}
}
I could do something like [bcd][rao][dtg] but it's not very clean looking and will match nonsense like "bad""cot""dat""crt" which I don't want.
Is there a slick way to use \A and [] that I'm missing?
Thanks for your help.
-wootini

How about using negative lookahead?
host-(?!bug).*
Here is the RUBULAR permalink matching everything except those pesky bugs!

Is this what you're looking for?
host-(brd|cat|dog)
(Following gtgaxiola's example, here's the Rubular permalink)

Article spinner with 2 tiers

I made an article spinner that used regex to find words in this syntax:
{word1|word2}
And then split them up at the "|", but I need a way to make it support tier 2 brackets, such as:
{{word1|word2}|{word3|word4}}
What my code does when presented with such a line, is take "{{word1|word2}" and "{word3|word4}", and this is not as intended.
What I want is when presented with such a line, my code breaks it up as "{word1|word2}|{word3|word4}", so that I can use this with the original function and break it into the actual words.
I am using c#.
Here is the pseudo code of how it might look like:
Check string for regex match to "{{word1|word2}|{word3|word4}}" pattern
If found, store each one as "{word1|word2}|{word3|word4}" in MatchCollection (mc1)
Split the word at the "|" but not the one inside the brackets, and select a random one (aka, "{word1|word2}" or "{word3|word4}")
Store the new results aka "{word1|word2}" and "{word3|word4}" in a new MatchCollection (mc2)
Now search the string again, this time looking for "{word1|word2}" only and ignore the double "{{" "}}"
Store these in mc2.
I can not split these up normally
Here is the regex I use to search for "{word1|word2}":
Regex regexObj = new Regex(#"\{.*?\}", RegexOptions.Singleline);
MatchCollection m = regexObj.Matches(originalText); //How I store them
Hopefully someone can help, thanks!
Edit: I solved this using a recursive method. I was building an article spinner btw.

That is not parsable using a regular expression, instead you have to use a recursive descent parser. Map it to JSON by replacing:
{ with [
| with ,
wordX with "wordX" (regex \w+)
Then your input
{{word1|word2}|{word3|word4}}
becomes valid JSON
[["word1","word2"],["word3","word4"]]
and will map directly to PHP arrays when you call json_decode.
In C#, the same should be possible with JavaScriptSerializer.

I'm really not completely sure WHAT you're asking for, but I'll give it a go:
If you want to get {word1|word2}|{word3|word4} out of any occurrence of {{word1|word2}|{word3|word4}} but not {word1|word2} or {word3|word4}, then use this:
#"\{(\{[^}]*\}\|\{[^}]*\})\}"
...which will match {{word1|word2}|{word3|word4}}, but with {word1|word2}|{word3|word4} in the first matching group.
I'm not sure if this will be helpful or even if it's along the right track, but I'll try to check back every once in a while for more questions or clarifications.

s = "{Spinning|Re-writing|Rotating|Content spinning|Rewriting|SEO Content Machine} is {fun|enjoyable|entertaining|exciting|enjoyment}! try it {for yourself|on your own|yourself|by yourself|for you} and {see how|observe how|observe} it {works|functions|operates|performs|is effective}."
print spin(s)
If you want to use the [square|brackets|syntax] use this line in the process function:
'/[(((?>[^[]]+)|(?R))*)]/x',

regular expression: how to ignore rest of the line

I have an input like this (a JSON format)
{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}
Now, I want to pick up the first "id" value which is 1BCDEFGHIJKLM using regular expressions. I have managed upto the point using
[^({"location":[?{"id":")].{0,12} but this is incomplete. Could some one help how do I ignore the rest of the line after the value 1BCDEFGHIJKLM

Regex isn't the way to do this. Whatever platform you are using, it must have a JSON parser.
That will be your best error-free solution.
Assuming you must use regex, you can grab all the id's using "id":"(.*?)", and take the first match.
I found the following article, which might help you.

While messy, how is your regex incomplete?
It could be shortened to ("id":"([^"]+)") which is more readable, and doesn't limit the ID to twelve characters. If that is beneficial.
If you problem is getting more than one result, most languages have a "g" global switch.
In javascript, the following would return "1BCDEFGHIJKLM":
var firstID = str.match(/"id":"([^"]+)"/)[1]
As match()returns an array, in which [0] is the entire returned string, and [1] the first parenthasis.

Don't have to use regex. In your favourite language, split on commas. Then go through each item, check for "id" and split on colon (:). Get the last element. Eg Python
>>> s
'{"location":[{"id":"1BCDEFGHIJKLM","somename":"abcd","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"ROTXY","fewCode":"NL","pCode":"ROTXY","someid":"1BCDEFGHIJKLM","fewid":"GIC8"},{"id":"7823XYZHMOPRE","somename":"abcd Junction","fewname":"United States","sid":"","sname":"","regionname":"New York","type":"some","siteCode":"","someCode":"USRTJ","fewCode":"US","pCode":"USNWK","someid":"7823XYZHMOPRE","fewid":"7823XYZLMOPRE"},{"id":"799XYZHMOPRE","somename":"abcd-Maasvlakte","fewname":"xyzland","sid":"","sname":"","regionname":"Zee-Whole","type":"some","siteCode":"","someCode":"XYROT","fewCode":"NL","pCode":"","someid":"799XYZHMOPRE","fewid":"OIUOWER348534"}]}'
>>> for i in s.split(","):
... if '"id"' in i:
... print i.split(":")[-1]
... break
...
"1BCDEFGHIJKLM"
Of course, ideally, you should use a dedicated JSON parser.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract all tokens from string using regex in Scala - regex

Greg, have you tried a simple _+[^_]+_+ This will match all the TOKENS It doesn't do any check for any TOKENLIKE string after the ?params, but you have mentioned you are not only using that for urls. If you need some refinement, please let us know.

Combine a regex with split: def urlPathComponents(s: String): Option[Array[String]] = """(?<=http(s?)://)[^?]+""".r findFirstIn s map (_.split("/"))

Related

Regex ignore first 12 characters from string

Match a string with regular expression

Ruby Puppet Regex string matching

Article spinner with 2 tiers

regular expression: how to ignore rest of the line

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract all tokens from string using regex in Scala - regex

Greg, have you tried a simple _+[^_]+_+ This will match all the __TOKENS__ It doesn't do any check for any __TOKENLIKE__ string after the ?params, but you have mentioned you are not only using that for urls. If you need some refinement, please let us know.

Combine a regex with split: def urlPathComponents(s: String): Option[Array[String]] = """(?<=http(s?)://)[^?]+""".r findFirstIn s map (_.split("/"))

Related

Regex ignore first 12 characters from string

Match a string with regular expression

Ruby Puppet Regex string matching

Article spinner with 2 tiers

regular expression: how to ignore rest of the line

Categories

Resources

Greg, have you tried a simple _+[^_]+_+ This will match all the TOKENS It doesn't do any check for any TOKENLIKE string after the ?params, but you have mentioned you are not only using that for urls. If you need some refinement, please let us know.