I would like to understand the patterns they are using for their smileys.
If emoticons are only replaced if surrounded by whitespace or (presumably) at start/end of a line/string, then you can use a series of regexes.
Using this list (taken from http://www.skype-forum.com/ftopic13197.html),...
you can construct these like this:
(?<=^|\s)<<smiley regex>>(?=\s|$)
will match <<smiley regex>> only if it's on its own.
Examples for <<smiley regex>>:
:-?\) :-?\( :-?D 8\)
;\( \(sweat\) :\| :\*
:\$ :\^\) \|-\) \|\(
;\) \]:\) \(talk\) \(yawn\)
\(doh\) :# \(wasntme\) \(party\)
etc. - you'll need to escape a lot of special-meaning characters for use in a regex. Your language might have a re.escape() function for this.
Related
I have a csv in following format.
id,mobile
1,02146477474
2,08585377474
3,07646474637
4,02158789566
5,04578599525
I want to add a new column and add just leading 3 numbers to that column (for specific cases and all the others NOT_VALID string). So result should be:
id,number,provider
1,02146477474,021
2,08585377474,085
3,07646474637,NOT_VALID
4,02158789566,021
5,04578599525,NOT_VALID
I can use following regex for replacing that. But I would like to use all possible conversations in one step. Using UpdateRecord processor.
${field.value:replaceFirst('085[0-9]+','085')}
When I use something like this:
${field.value:replaceFirst('085[0-9]+','085'):or(${field.value:replaceFirst('086[0-9]+','086')}`)}
This replaces all with false.
Nifi uses Java regex
As soon, as you are using record processing, this should work for you:
${field.value:replaceFirst('^(021|085)?.*','$1')}
The group () optionally ? catches 021 or 085 at the beginning of string ^
The replacement - $1 - is the first group
PS: The sites like https://regex101.com/ helps to understand regex
I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).
This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)
So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).
There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.
Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)
I have an url like below and wanted to use RegEx to extract segments like: Id:Reference, Title:dfgdfg, Status.Title:Current Status, CreationDate:Logged...
This is the closest pattern I got [=,][^,]*:[^,]*[,&] but obviously the result is not as expected, any better ideas?
P.S. I'm using [^,] to matach any characters except , because , will not exist the segment.
This is the site using for regex pattern matching.
http://regexpal.com/
The URL:
http://localhost/site/=powerManagement.power&query=_Allpowers&attributes=Id:Reference,Title:dfgdfg,Status.Title:Current Status,CreationDate:Logged,RaiseUser.Title:标题,_MinutesToBreach&sort_by=CreationDate"
Thanks,
You haven't specified what programming language you use. But almost all with support this:
([\p{L}\.]+):([\p{L}\.]+)
\p{L} matches a Unicode character in any language, provided that your regex engine support Unicode. RegEx 101.
You can extract the matches via capturing groups if you want.
In python:
import re
matchobj = re.match("^.*Id:(.*?),Title:(.*?),.*$", url, )
Id = matchobj.group(1)
Title = matchobj.group(2)
I have a string like "httpx://__URL__/__STUFF__?param=value"
This sample is a url by convention...it could be anything with zero or more __X__ tokens in it.
I want to use a regex to extract a list of all the tokens, so output here would be List("__URL__","__STUFF__"). Remember, I don't know beforehand how many (if any) tokens may be in the input string.
I've been struggling but unable to come up with a regex expression that will do the trick.
Something like this did not work:
(?:.?(__[a-zA-Z0-9]+__).?)+
Scala Regex, which is just a wrapper around Java Regex, will never return multiple subgroups for repetitions.
The only way about it is to have a regex for the token, and then find it multiple times. You pretty much already have everything you want:
"__[a-zA-Z0-9]+__".r findAllIn "httpx://__URL__/__STUFF__?param=value"
That returns an Iterator. Use .toSeq or similar to convert into a collection.
Greg, have you tried a simple
_+[^_]+_+
This will match all the __TOKENS__
It doesn't do any check for any __TOKENLIKE__ string after the ?params, but you have mentioned you are not only using that for urls. If you need some refinement, please let us know.
Combine a regex with split:
def urlPathComponents(s: String): Option[Array[String]] =
"""(?<=http(s?)://)[^?]+""".r findFirstIn s map (_.split("/"))
I'm looking for a RegEx pattern to use in a rereplace() function that will keep URL safe characters, but include UTF-8 characters with accents. For example: ç and ã.
Something like: url = rereplace(local.url, "pattern") etc. I prefer a ColdFusion only solution, but I'm open to using Java too since it's so easy to integrate with CF.
My URL pattern will look like: /posts/[postId]/[title-with-accents-like-ç-and-ã]
I don't know what language you are using. Perl has some utf8 matching, see for example Tatsuhiko Miyagawa's URI::Find::UTF8
This can be done by matching alpha numeric characters using \w.
rereplace(string, "[^\w]", "", "all")
See this answer for reference.