I want to show a php function with a regex in a code snippet using the lstlisting package. TeX gives me several errors "Package inputenc Error: Invalid UTF-8 byte sequence" and the dollar sign seems to put my tex code in math mode. The whole document is UTF-8 encoded. Any ideas how to correctly deal with these special chars in lstlisting environment? Thanks.
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{listings}
\begin{lstlisting}[language=php,label={lis:mylisting}]
public function passes($attribute, $value)
{
return preg_match("/^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[0-9])(?=.*?[#?!#()$%^&*=_{}[\]:;\"'|\\<>,.\/~`±§+-]).{8,255}$/", $value);
}
\end{lstlisting}
The problem is the plus-minus and section symbol. You can add specify them as literate:
\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{listings}
\begin{document}
\begin{lstlisting}[language=php,label={lis:mylisting},extendedchars=true,literate={±}{{$\pm$}}1 {§}{{\S}}1]
public function passes($attribute, $value)
{
return preg_match("/^(?=.*?[A-Z])(?=.*?[a-z])(?=.*?[0-9])(?=.*?[0-9])(?=.*?[#?!#()$%^&*=_{}[\]:;\"'|\\<>,.\/~`±§+-]).{8,255}$/", $value);
}
\end{lstlisting}
\end{document}
Related
I am trying to use the perl module "RTF::Writer" for strings of text that must be a mix of formats. This is proving more complicated than I anticipated. I am just trying a test at the moment with:
$rtf->paragraph( \'\b', "Name: $name, le\cf1 ng\cf0 th $len" );
but this writes:
{\pard
\b
Name: my_name, le\'061 ng\'060 th 7
\par}
where \'061 should be \cf1 and \'060 should be \cf0.
I then tried to remedy this with a perl 1-liner:
perl -pi -e "s/\'06/\cf/g"
but this made things worse, I do not know what "\^F" represents in vi, but that is what it shows.
It did not matter if I escaped the backslashes or not.
Can anyone explain this behavior, and what to do about it?
Can anyone suggest how to get the RTF::Writer to create the file as desired from the start?
Thanks
\ is a special character in double-quoted string literals. If you want a string that contains \, you need to use \\ in the literal. To create the string \cf1, you need to use "\\cf1". ("\cf" means Ctrl-F, which is to say the byte 06.)
Alternatively, \ is only special if followed by \ or a delimiter in single-quoted string literals. So the string \cf1 could also be created from '\cf1'.
Both produce the string you want, but they don't produce the document you want. That's because there's a second problem.
When you pass a string to RTF::Writer, it's expected to be text to render. But you are passing a string you wanted included as is in the final document. You need to pass a reference to a string if you want to provide raw RTF. \'...', \"..." and \$str all produce a reference to a string.
Fixed:
use RTF::Writer qw( );
my $name = "my_name";
my $rtf = RTF::Writer->new_to_file("greetings.rtf");
$rtf->prolog( 'title' => "Greetings, hyoomon" );
$rtf->paragraph( \'\b', "Name: $name, le", \'\cf1', "ng", \'\cf0', "th".length($name));
$rtf->close;
Output from the call to paragraph:
{\pard
\b
Name: my_name, le\cf1
ng\cf0
th7
\par}
Note that I didn't use the following because it would be code injection bug:
$rtf->paragraph(\("\\b Name: $name, le\\cf1 ng\\cf0 th".length($name)));
Don't pass text such as the contents of $name using \...; use that for raw RTF only.
I have a parsing environment (Marpa::R2::Scanless) that needs to use single Perl regexp character classes to control tokenizing. I've got something to tokenize that doesn't seem to fit any of the existing character classes. So, after digging around in the perlunicode docs, I've come up with the following code, except it doesn't work as expected. I expect to see a row of dots interspersed with all the non-alphanumerics (except parens). Instead, I get a runtime error about not being able to find the character class.
#!/usr/bin/env perl
use 5.018;
use utf8;
local $| = 1;
for my $i (map { chr($_) } 32 .. 127) {
if ($i =~ /\p{Magic::Wow}/) {
print $i;
}
else {
print ".";
}
}
package Magic;
sub Wow {
return <<'MAGIC';
+utf8::Assigned
-utf8::Letter
-utf8::Number
-0028
-0029
MAGIC
}
1;
Any hints, tips, tricks, or suggestions?
Name the sub IsWow and the property Magic::IsWow.
Quoting User-Defined Character Properties in perlunicode:
You can define your own binary character properties by defining subroutines whose names begin with "In" or "Is".
I'm parsing some html pages, and need to detect any Arabic char inside..
Tried various regexs, but no luck..
Does anyone know working way to do that?
Thanks
Here is the page I'm processing: http://pastie.org/2509936
And my code is:
#!/usr/bin/perl
use LWP::UserAgent;
#MyAgent::ISA = qw(LWP::UserAgent);
# set inheritance
$ua = LWP::UserAgent->new;
$q = 'pastie.org/2509936';;
$request = HTTP::Request->new('GET', $q);
$response = $ua->request($request);
if ($response->is_success) {
if ($response->content=~/[\p{Script=Arabic}]/g) {
print "found arabic";
} else {
print "not found";
}
}
If you're using Perl, you should be able to use the Unicode script matching operator. /\p{Arabic}/
If that doesn't work, you'll have to look up the range of Unicode characters for Arabic, and test them something like this /[\x{0600}\x{0601}...\x{06FF}]/.
EDIT (as I have obviously wandered into tchrist's area of expertise). Skip using $response->content, which always returns a raw byte string, and use $response->decoded_content, which applies any decoding hints it gets from the response headers.
The page you are downloading is UTF-8 encoded, but you are not reading it as UTF-8 (in fairness, there are no hints on the page about what the encoding is
[update: the server does return the header Content-Type: text/html; charset=utf-8, though]).
You can see if this if you examine $response->content:
use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";
If you get a value less than 256, then you are reading this content in as raw bytes, and your strings will never match /\p{Arabic}/. You must decode the input as UTF-8 before you apply the regex:
use Encode;
my $content = decode('utf-8', $response->content);
# now check $content =~ /\p{Arabic}/
Sometimes (and now I am wading well outside my area of expertise) the page you are loading contains hints about how it is decoded, and $response->content may already be decoded correctly. In that case, the decode call above is unnecessary and may be harmful. See other SO posts on detecting the encoding of an arbitrary string.
Just for the record, at least in .NET regexps, you need to use \p{IsArabic}.
I have a Groovy script that converts some very poorly formatted data into XML. This part works fine, but it's also happily passing some characters along that aren't legal in XML. So I'm adding some code to strip these out, and this is where the problem is coming from.
The code that isn't compiling is this:
def illegalChars = ~/[\u0000-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/
What I'm wondering is, why? What am I doing wrong here? I tested this regex in http://regexpal.com/ and it works as expected, but I'm getting an error compiling it in Groovy:
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] line 23:26: unexpected char: 0x0
The line above is line 23. The surrounding lines are just variable declarations that I haven't changed while working on the regex.
Thanks!
Update:
The code compiles, but it's not filtering as I'd expected it to.
In regexpal I put the regex:
[\u0000-\u0008\u000B-\u000C\u000E-\u001F\u007F-\u009F]
and the test data:
name='lang'>E</field><field name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc>
<doc><field name='page'>72-88</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field name='issue'>NUMBER</field>
<field name='auth'>Dvorak, A.</field><field name='pub'>KARGER</field><field
name='rr'>GBP013.51</field><field name='issn'>1660-2242</field><field
name='class1'>TS</field><field name='freq'>S</field><field
name='class2'>616.079</field><field name='text'>Subcellular Localization of the
Cytokines, Basic Fibroblast Growth Factor and Tumor Necrosis Factor- in Mast
Cells</field><field name='id'>RN170369808</field><field name='volume'>VOL 85</field>
<field name='year'>2005</field><field name='lang'>E</field><field
name='title'>CHEMICAL IMMUNOLOGY AND ALLERGY</field></doc><doc><field
name='page'>89-97</field><field name='shm'>3146.757500</field><field
name='pubc'>47</field><field name='cs'>1</field><field
It's a grab from a file with one of the illegal characters, so it's a little random. But regexpal highlights only the illegal character, but in Groovy it's replacing even the '<' and '>' characters with empty strings, so it's basically annihilating the entire document.
The code snippet:
def List parseFile(File file){
println "reading File name: ${file.name}"
def lineCount = 0
List data = new ArrayList()
file.eachLine {
String input ->
lineCount ++
String line = input
if(input =~ illegalChars){
line = input.replaceAll(illegalChars, " ")
}
Map document = new HashMap()
elementNames.each(){
token ->
def val = getValue(line, token)
if(val != null){
if(token.equals("ISSUE")){
List entries = val.split(";")
document.putAt("year",entries.getAt(0).trim())
if(entries.size() > 1){
document.putAt("volume", entries.getAt(1).trim())
}
if(entries.size() > 2){
document.putAt("issue", entries.getAt(2).trim())
}
} else {
document.putAt(token, val)
}
}
}
data.add(document)
}
println "done"
return data
}
I don't see any reason that the two should behave differently; am I missing something?
Again, thanks!
line 23:26: unexpected char: 0x0
This error message points to this part of the code:
def illegalChars = ~/[\u0000-...
12345678901234567890123
It looks like for some reason the compiler doesn't like having Unicode 0 character in the source code. That said, you should be able to fix this by doubling the slash. This prevents Unicode escapes at the source code level, and let the regex engine handle the unicode instead:
def illegals = ~/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/
Note that I've also combined the character classes into one instead of as alternates. I've also removed the range definition when they're not necessary.
References
regular-expressions.info/Character Classes
On doubling the slash
Here's the relevant quote from java.util.regex.Pattern
Unicode escape sequences such as \u2014 in Java source code are processed as described in JLS 3.3. Such escape sequences are also implemented directly by the regular-expression parser so that Unicode escapes can be used in expressions that are read from files or from the keyboard. Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.
To illustrate, in Java:
System.out.println("\n".matches("\\u000A")); // prints "true"
However:
System.out.println("\n".matches("\u000A"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is because \u000A, which is the newline character, is escaped in the second snippet at the source code level. The source code essentially becomes:
System.out.println("\n".matches("
"));
// DOES NOT COMPILE!
// "String literal is not properly closed by a double-quote"
This is not a legal Java source code.
Try this Regular Expression to remove unicode char from the string :
/*\\u([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])([0-9]|[a-fA-F])/
OK here's my finding:
>>> print "XYZ".replaceAll(
/[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]/,
"-"
)
---
>>> print "X\0YZ".replaceAll(
/[\u0000-\u0008\u000B\u000C\u000E-\u001F\u007F-\u009F]/,
"-"
)
X-YZ
>>> print "X\0YZ".replaceAll(
"[\\u0000-\\u0008\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F]",
"-"
)
X-YZ
In other words, my \\uNNNN answer within /pattern/ is WRONG. What happens is that 0-\ becomes part of the range, and this includes <, > and all capital letters.
The \\uNNNN only works in "pattern", not in /pattern/.
I will edit my official answer based on comments to this "answer".
Related questions
How to escape Unicode escapes in Groovy’s /pattern/ syntax
try
def illegalChars = ~/[\u0001-\u0008]|[\u000B-\u000C]|[\u000E-\u001F]|[\u007F-\u009F]/`
I've written a url validator for a project I am working on. For my requirements it works great, except when the last part for the url goes longer than 22 characters it breaks. My expression:
/((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)/i
It expects input that looks like "http(s)://hostname:port/location".
When I give it the input:
https://demo10:443/111112222233333444445
it works, but if I pass the input
https://demo10:443/1111122222333334444455
it breaks. You can test it out easily at http://ryanswanson.com/regexp/#start. Oddly, I can't reproduce the problem with just the relevant (I would think) part /(:\d+\/\S+)/i. I can have as many characters after the required / and it works great. Any ideas or known bugs?
Edit:
Here is some code for a sample application that demonstrates the problem:
<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml" layout="absolute">
<mx:Script>
<![CDATA[
private function click():void {
var value:String = input.text;
var matches:Array = value.match(/((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)/i);
if(matches == null || matches.length < 1 || matches[0] != value) {
area.text = "No Match";
}
else {
area.text = "Match!!!";
}
}
]]>
</mx:Script>
<mx:TextInput x="10" y="10" id="input"/>
<mx:Button x="178" y="10" label="Button" click="click()"/>
<mx:TextArea x="10" y="40" width="233" height="101" id="area"/>
</mx:Application>
I debugged your regular expression on RegexBuddy and apparently it takes millions of steps to find a match. This usually means that something is terribly wrong with the regular expression.
Look at ([^\s.]+.)+([^\s.]+)(:\d+\/\S+).
1- It seems like you're trying to match subdomains too, but it doesn't work as intended since you didn't escape the dot. If you escape it, demo10:443/123 won't match because it'll need at least one dot. Change ([^\s.]+\.)+ to ([^\s.]+\.)* and it'll work.
2- [^\s.]+ is a bad character class, it will match the whole string and start backtracking from there. You can avoid this by using [^\s:.] which will stop at the colon.
This one should work as you want:
https?:\/\/([^\s:.]+\.)*([^\s:.]+):\d+\/\S+
This is a bug, either in Ryan's implementation or within Flex/Flash.
The regular expression syntax used above (less surrounding slashes and flags) matches Python which provides the following output:
# ignore case insensitive flag as it doesn't matter in this case
>>> import re
>>> rx = re.compile('((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)')
>>> print rx.match('https://demo10:443/1111122222333334444455').groups()
('https://', 'https', 'demo1', '0', ':443/1111122222333334444455')