Using coldbox, Coldfusion 9.
I have tested this with a form-post and a url parameter. In both cases, I submit the string:
"à á Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü ß"
,
and in both cases I immediately dump the input to the browser, which now looks like:
"à á à à à à à à à à à à à à à à à à à à à à Ã
à à à à Ã"
.
The meta-tag has: charset=utf-8 and I have also tried charset=iso-8859-1.
It makes no difference.
GetLocale() = en_IE
GetEncoding("url") = UTF-8
GetEncoding("form") = UTF-8
Now, what's interesting is that I built a simple CF page on the same server but outside of the Coldbox framework and the characters display correctly after the form/url post.
In Coldbox, the form and url values are transferred to the RequestCollection (RC). If I dump the RC immediately after the form/url post I see the wrong characters.
Therefore it is starting to look like Coldbox is taking the 'good' characters out of the native url/form scope and putting the 'bad' characters in their place in the RC.
Can anyone suggest where I can look next? Is there a ColdBox setting I should look for? Might it be something else entirely?
UPDATE
I am calling the script with SES-style routing like this:
/index.cfm/Organisation/get/q/à á Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü ß
If I call it in a more conventional manner, I get the correct characters to display!
/index.cfm/Organisation/get?q=à á Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü ß
Trouble is, the whole application uses SES so I can't start refactoring it. So I need to find something in the SES configuration that makes it go wrong. Weird...
Anybody seen this before?
Related
Is there a way to include a space before an item of Or structure only when match to one of it ? The items can repeat inside string.
REGEX:
^([A-ZÁÂÉÊÍÓÔÚ][a-záãâéêíóõôúç]+)([(e|da|das|de|do|dos)]*[\s][A-ZÁÂÉÊÍÓÔÚ][a-záãâéêíóõôúç]+)+$
Space before one of this items (mandatory): [(e|da|das|de|do|dos)]
Result I want:
Paulo César Oliveira (this is working)
Antonio Carlos da Silva (must have ONE space before "da")
João da Silva dos Santos e Souza (must have ONE space before "da", "dos" and "e")
You can use
^\p{Lu}\p{Ll}+(?:(?:\s(?:e|d(?:[ao]s|[aeo])))?\s\p{Lu}\p{Ll}+)+$
^[A-ZÁÂÉÊÍÓÔÚ][a-záãâéêíóõôúç]+(?:(?:\s(?:e|d(?:[ao]s|[aeo])))?\s[A-ZÁÂÉÊÍÓÔÚ][a-záãâéêíóõôúç]+)+$
See the regex demo. \p{Lu} and \p{Ll} may be unsupported by your regex engine, then keep on using your character classes.
Details:
^ - start of string
\p{Lu}\p{Ll}+ - an uppercase letter followed with one or more lowercase letters
(?:(?:\s(?:e|d(?:[ao]s|[aeo])))?\s\p{Lu}\p{Ll}+)+ - one or more occurrences of the following patterns:
(?:\s(?:e|d(?:[ao]s|[aeo])))? - an optional occurrence of:
\s - a whitespaces
(?:e|d(?:[ao]s|[aeo])) - e or d followed with either os/as or a, e, o
\s - a whitespaces
\p{Lu}\p{Ll}+ - an uppercase letter followed with one or more lowercase letters
$ - end of string.
This is my regex so far which will split on non-alphanumeric characters, including international characters (ie Korean, Japanese, Chinese characters).
title = '[MV] SUNMI(선미) _ 누아르(Noir)'
title.split(/[^a-zA-Z0-9 ']/)
this is the regex to match any international character:
[^\x00-\x7F]+
Which I got from: Regular expression to match non-English characters? Let'a ssume this is 100% correct (no debating!)
How do I combine these 2 so I can split on non-alphanumeric characters, excluding international characters? The easy part is done. I just need to combine these regex's somehow.
My expected output would be something like this
["MV", "SUNMI", "선미", "누아르", "Noir"]
TLDR: I want to split on non-alphanumeric characters only (english letters, foreign characters should not be split on)
(?:[^a-zA-Z0-9](?<![^\x00-\x7F]))+
https://regex101.com/r/EDyluc/1
What is not matched (remains from split) is what you want to keep.
Explained:
(?:
[^a-zA-Z0-9] # Not Ascii AlphaNum
(?<! [^\x00-\x7F] ) # Behind, not not Ascii range (Ascii boundary)
)+
Let me know if you need a more detailed explanation.
So basically you want to split on all ascii but non-alphabet characters. You can use this regex which selects all characters within ascii range.
[ -#[-`{-~]+
This regex having ranges space to # then ignoring all uppercase letters then picks all characters from [ to backtick then ignores all lowercase letters then picks all characters from { to ~ as can be seen in ascii table.
In case you want to exclude till extended ascii characters, you can change ~ in regex with ÿ and use [ -#[-{-ÿ]+` regex.
Demo
Check out these Ruby codes,
s = '[MV] SUNMI(선미) _ 누아르(Noir)'
puts s.split(/[ -#\[-`{-~]+/)
Prints,
MV
SUNMI
선미
누아르
Noir
Online Ruby Demo
My regex should return lines, where the last word ending with a consonant letter.
I write:
egrep '[^aeiou]\b$'
but it returns only lines, which not ending in a dot.
I'm a beginner in regex, so I will be grateful if you could help me.
For example, my test file:
Hello world
Hello world.
London is the capital of GB.
Oslo is the capital of Norway
Oslo is the capital of Norway.
Oslo is not a capital of Ukraine.
Now my expression returns:
Hello world
Oslo is the capital of Norway
But it should return:
Hello world
Hello world.
London is the capital of GB.
Oslo is the capital of Norway
Oslo is the capital of Norway.
The problem with your regex is that it looks for a letter that is not a vowel but in other side it doesn't necessarily looks for consonants. The \b shouldn't be there too. As You want to ignore punctuation marks try the following:
egrep '[b-df-hj-np-tv-z]\W*$'
\W means a character that is not [a-zA-Z0-9_]
I was attempting to replace what I thought was a standard dash using gsub. The code I was testing was:
gsub("-", "ABC", "reported – estimate")
This does nothing, though. I copied and pasted the dash into http://unicodelookup.com/#–/1 and it seems to be a en dash. That site provides the hex, dec etc codes for an en dash and I've been trying to replace the en dash but am not having luck. Suggestions?
(As a bonus, if you can tell me if there is a function to identify special characters that would be helpful).
I'm not sure if SO's code formatting will change the dash format so here is the dash I'm using (–).
You can replace the en-dash by just specifying it in the regex pattern.
gsub("–", "ABC", "reported – estimate")
You can match all hyphens, en- and em-dashes with
gsub("[-–—]", "ABC", "reported – estimate — more - text")
See IDEONE demo
To check if there are non-ascii characters in a string, use
> s = "plus ça change, plus c'est la même chose"
> gsub("[[:ascii:]]+", "", s, perl=T)
[1] "çê"
See this IDEONE demo
You will either get an empty result (if a string only consists of "word" characters and whitespace), or - as here - some "special" characters.
for special character replacement you can do a negative complement.
gsub('[^\\w]*', 'ABC', 'reported - estimate', perl = True) will replace all special characters with ABC. The [^\w] is a pattern that says anything that isn't a normal character.
Good Day,
I have a simple working routine in Perl that swaps two words:
i.e. John Doe -----> Doe John
Here it is:
sub SwapTokens()
{
my ($currentToken) = #_;
$currentToken =~ s/([A-Za-z]+) ([A-Za-z]+)/$2 $1/;
# $currentToken =~ s/(\u\L) (\u\L)/$2 $1/;
return $currentToken;
}
The following usage yields exactly what I want:
print &SwapTokens("John Doe");
But when I uncomment out the line '$currentToken =~ s/(\u\L) (\u\L)/$2 $1/;
I get an error. Am I missing something, it looks like my syntax is correct.
TIA,
coson
\u is not a regex atom that match a uppercase letter. \L is not a regex atom that match a number of lowercase letters. You're looking for
s/(\p{Lu}\p{Ll}+) (\p{Lu}\p{Ll}+)/$2 $1/;
\p{Lu} Uppercase letter.
\p{Ll} Lowercase letter.
$ unichars '\p{Lu}' | head -n 5
A U+0041 LATIN CAPITAL LETTER A
B U+0042 LATIN CAPITAL LETTER B
C U+0043 LATIN CAPITAL LETTER C
D U+0044 LATIN CAPITAL LETTER D
E U+0045 LATIN CAPITAL LETTER E
$ unichars '\p{Ll}' | head -n 5
a U+0061 LATIN SMALL LETTER A
b U+0062 LATIN SMALL LETTER B
c U+0063 LATIN SMALL LETTER C
d U+0064 LATIN SMALL LETTER D
e U+0065 LATIN SMALL LETTER E
Perhaps you're looking for something like this:
sub swap_the_words {
my ($processed_string) = #_;
$processed_string =~ s/([A-Z][A-Za-z]+) ([A-Z][A-Za-z]+)/$2 $1/;
return $processed_string;
}
print swap_the_words('John Doe'); # prints Doe John
As for \u and \l, they are good for modifying the string - not the regex. For example, you can slightly alter your script like that...
$processed_string =~ s/([a-z]+) ([a-z]+)/\u\L$2\E \u\L$1\E/i;
...
print swap_the_words('cOsOn hAcKeR'); # Hacker Coson
... so your words are not only swapped, but given the proper case as well. Note, though, that these modifiers are used in the replacement part of s/// operator.
\L means "lowercase till \E"; i.e., it needs to be followed at some point by \E. You do not have \E in your regex, thus it is not valid; adding \E after each \L gets the script to compile, though I have no idea what you are actually trying to accomplish there.