Regex to encode an URL with special characters - regex

I need to encode a string that contains special characters such as whitespaces, ' and ".
I don't know how to create a regex.
I've tried many solutions but none of them seem to work.
My final objective is to have a string such as "black cat" encoded like this "black%20cat".
EDIT:
Guys I'm working with a specific software called "Axway policy studio" and it has a component where you put in a regex and a string, in the end you get boolean output such as true or false.

It sounds more like your trying to encode things to make them more appropriate for a URL, which does not require you to write your own regex in most platforms.
For instance, in Python, there's the function urllib.parse.urlencode which would do this. In Javascript, there's encodeURI and encodeURIComponent.
TL;DR Look up urlencode in and you'll probably find what you need. Don't bother writing regexes for it unless you really need to.
P.S. Most of the urlencoding is just replacing characters with % followed by their ascii hex value (' ' => %20, '!' => %21, ...)

Related

Ruby: How can I capture all characters, ignoring whitespace?

I want to capture all word characters, ignoring whitespace in a given string.
str = "Hello there how are you?"
I want the result to be:
"Hellotherehowareyou"
I have tried:
str[/(\w*)*/]
# => "Hello"
…but it returns the first word only. How do I capture all the word characters?
What's Wrong
str[/(\w*)*/] returns a substring, rather than scanning the whole string for matches or removing undesirable characters. You'd be better off using one of the other String methods like #gsub, #tr, #delete, #scan, or #match, depending on what your real intent is.
Use Character Properties or Classes
If you're looking for a robust solution, Ruby character properties or POSIX character classes are probably the way to go. To get the results you provided in your original post, you could use the Unicode-aware \p{Alpha} property. For example:
str.scan(/\p{Alpha}/).join
#=> "Hellotherehowareyou"
Alternatively, if you just want to delete spaces and the question mark, and you don't care about other types of characters, then String#delete may suffice for your specific corpus.
str.delete ' ?'
#=> "Hellotherehowareyou"
If you need a more complex way to select or reject elements from a stream of characters, you could even do something like:
str.chars.select { _1 =~ /\p{Alpha}/ }.join
#=> "Hellotherehowareyou"
There are certainly other approaches, too. The KISS and YAGNI principles probably apply. Meanwhile, choose a solution based on readability and the semantic intent of your code, since most solutions will yield very similar results for your specific example.

Need regex that ignores date/sequence number, but matches the rest of the string

I'm looking for a regex that ignores part of a larger string of text, specifically a date/sequence number. For example, the string looks like this:
E0618456458784NOS REGRESSION COMPANY 5454545455SAL 3-MAAAA2018/2/00192
I would like the regex to ignore the "2018/2/00192" but still match up the rest of the string. In the next file I use the regex for, the date/sequence number might be in a different location and the string may change, but the format will always be the same, meaning "2018/#/#####". I'm using C++ and I've gotten close with this regex (found on this site):
[^2018/2/00192]+
It ignores the date/sequence number, but it's also ignore the "0", "1", and "8" at the beginning of the line. I basically don't care about the "2018/2/00192" in the string because I know that's going to change. Everything else though, I want to match. Appreciate any suggestions.
Thanks.
I dont know how you would do this in c++, but the sed equivalent of this would be
sed -e "s/[0-9]\{4\}\/[0-9]\/[0-9]\{5\}//" text
In programming languages like Perl, you could use \d in place of [0-9]

How to check for certain characters using regex

I am trying to check for following characters in my string using regex but based on tutorials online and ]some questions on SO I havent been able to figure out a solution so far. Can anyone help. I would really appreciate it.
Here is my string:
0-9~!##$%^&*()_+`-={}[]\|:”;’,./<>?ÀàÂâÄäÆæÇçÉéÈèÊêËëÎîÏïÔôÖöŒœßÙùÛûÜüŸÿ
I also want to allow single and double quotes in my string. So is there a way to do it.
If you just want to match the presence of any of those characters in the string you can just use this.
**Updated to include ' and "
/["'\d~!##\$%\^&\*\(\)_\+`\-=\{\}\[\]\|:”;’,\.\/<>\?ÀàÂâÄäÆæÇçÉéÈèÊêËëÎîÏïÔôÖöŒœßÙùÛûÜüŸÿ]/g
This is just a basic character class - http://www.regular-expressions.info/charclass.html
I would suggest you might be better to use a whitelist approach, rather than exclude characters, for example, /[^\w\s"']/g will match anything that is not " ' _ whitespace or alphanumeric

How can I remove junk characters with regex?

I have a web application that reads the contents of a web page and parses the sentences using an NLP algorithm. I have been using regex to split the contents into single sentences and then parsing them.
I would like to remove characters like  from my sentences. These characters, I imagine, are because of the HTML encoding.
I obviously cannot use a regex like [^\w\d]+ or its variations because I need the punctuations intact. Of course I could add individual exceptions for each of the punctuation like [^\w\d\.,:]+ and so on, but I would like it if there is an easier way to do this, like probably a character class that knows it is a... funny character?
Any help will be much appreciated. Thanks.
EDIT: The app is built with PHP and I am using a simple file_get_contents() to fetch the HTML data from the site and reading the contents inside <p> tags.
This was mentioned in the comments by #TheGreatCO but you are able to create a character class of "special" characters. You can use the hex code values to create a range in a character class. So for any special character over ASCII 127 would be this.
[\x80-\xFE]
That would match anything but your most basic characters. For reference sake, here's a list of the ASCII character table with their hex codes.
This page discusses the different ways you can reference special characters in regex.
I found this regexpr helpful to identify junk character in a file using atom
[^(\x20-\x7F\p{Sc})]

Why does URLEncodedFormat() in CFML encodes valid URL characters?

What are the reasons behind URLEncodedFormat() escaping valid URL characters?
valid characters:
- _ . ! ~ * " ( )
The CF8 Doc said, "[URLEncodedFormat() escapes] non-alphanumeric characters with equivalent hexadecimal escape sequences." However, why escape valid URL characters?
They are valid, but it seems pretty normal to me that if you ask a programming language to url encode a string that it converts all non alpha numeric chars to the hex equivalent.
ASP's Server.URLEncode() does the same and php urlencode() does too except for - and _. Also, in javascript, the encodeURIComponent() function will encode all non alpha numeric chars to hex equivalents.
This is a good idea anyway to encode all non alpha numeric characters when using user input for forming server requests to prevent anything unexpected from happening.
Is the encoding of valid url characters causing an error or a problem?
One issue might be that by not doing so, if you embed a link with non-encoded characters in an email, the email software may decide to break the link into two lines.
If you use a fully encoded url though, the chances of this are greatly reduced. Just one way of seeing it though.
I could see at least in the case of " that it would be nice to have it encoded when using the URL as a link in an anchor tag.