Perl: replace binary content - regex

I need to replace a binary content with some other content (textual or binary) in Perl.
If the content is textual I can use s/// to replace content.
my $txt = "tomturbo";
$txt =~ s/t/T/g; # TomTurbo
But that is not working if the content is binary.
Is there an easy way to handle that??
Thanks!

Perl's regex engine will work with arbitrary strings.
s/t/T/g
is no different than
s/\x{74}/\x{54}/g
meaning it will change the value of characters from 7416 to 5416. The regex engine doesn't care whether value 7416 means t or something else.
In binary data, character A16 doesn't indicate the end of a line, so you'll want to use the s flag, avoid the m flag, and avoid using $.
Furthermore, you'll find the i flag and the built-in character class (e.g. \d, \p{Letter}, etc) useless unless you are matching against strings of decoded text (strings of Unicode Code Points), but you may otherwise use any feature of the regex engine.

Related

Case insensitive regex for non-English characters

I need to perform a regular expression match on text that includes non-English characters (Spanish, French, German, and Russian).
I want the match to ignore case, so with English characters I would just use the /i modifier, but that doesn't work with words like übermäßig.
What is the simplest way to write a regex that will match both, say, übermäßig and ÜBERMÄßig? And can the same approach be used to convert upper case non-English letters to their lowercase equivalents in Perl?
It works perfectly fine
$ perl -E'use utf8; say "ÜBERMÄẞIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
$ perl -E'use utf8; say "ÜBERMÄSSIG" =~ /^übermäßig\z/i ? "match" : "no match"'
match
(The use utf8; says the source code is encoded using UTF-8. It would be impossible to have those characters in the script any other way.)
I suspect an encoding problem, meaning you think you gave Perl "ß" when you didn't. It could also be that you're using an older version of Perl that can't handle multi-char folds correctly. Generally speaking, it could help to use /u, but it shouldn't make a difference for this example.
The /i modifier works nicely if the strings use Perl's internal encoding.
For example, this prints "yes":
perl -le 'use utf8; print "yes" if "ÜBERMäßig" =~ /überMÄßiG/i'
The "use utf8" tells Perl that my source code is encoded in UTF-8, and therefore Perl decodes all literal strings in my source code from UTF-8 into its internal encoding. This example will not work without use utf8.
If your strings come from somewhere else then you may need to apply Encode::decode -- or tell your source to generate properly decoded strings (e.g. possible with most DBI drivers).
It works for me. Do you need to use utf8;, maybe?
(Disclaimer: I don't know Perl.)
If you set the locale to the appropriate value in your Perl script, then the /i modifier will work on non-English characters--as will other features like regex matching of word boundaries and the uc and lc functions.
Note that if you need to handle multiple foreign character sets, the linked documentation shows you how to switch locales within your script as needed, using setlocale().
Edit: I should have mentioned that this method is deprecated in most cases. Things should just work with UTF-8. But it can still be useful sometimes.
use locale;
use POSIX qw(locale_h);
setlocale (LC_ALL, $locale{German}) or die "failed to load locale!";

How to represent Unicode characters in an ASCII regex pattern?

RegEx flavor: wxRegEx in C++.
One of the strings that I need to match contains characters like '…' (U+2026, Horizontal Ellipsis) which translates to \205 when pasted to Emacs and '»' (U+00BB, Right-Pointing Double Angle Quotation Mark) which remains » when pasted to Emacs (ASCII source code mode).
In the regex pattern itself I tried representing '…' as both \205 and \\205 to no avail.
What is the right way of approaching this problem?
Update: The wxRegEx documentation states that to represent a Unicode character you use \uwxyz (where wxyz is exactly four hexadecimal digits) the Unicode character U+wxyz in the local byte ordering.
I tried that, but for some reason it doesn't work for me (yet).
It depends on the language. In many languages there’s no need to escape non-ASCII, but you may have to tell the compiler what encoding the source is in. For example:
$ java -encoding UTF-8 SomeThing.java
or
$ perl -Mutf8 somescript
Although with things like Perl, Python, and Ruby, you can put the declaration inside the file, providing it’s upwards compatible with ASCII. For example:
#!/usr/bin/perl
use utf8;
use strict;
use warnings;
use autodie;
my $s = "Où se trouve mon élève?";
if ($s =~ /élève/) { ... }
# although of course this also works fine:
while ($s =~ /\b(\w+)\b/g) {
print "Found <$1>\n";
}
That’s the easiest way to do it, and I highly recommend it: just put the real UTF-8 characters in your source code. If you have to figure out to escape things, well, it’s far less convenient.
If you are going to use escapes, well, how you specify non-ASCII symbolically also varies by language. In Java you can use the asquerous Java preprocessor via \uXXXX:
String s = "e\u0301le\u0300ve";
although I do not recommend that way. If it’s going to be used in a pattern, you can delay interpolation, which is cleaner and messier at the same time:
String s = "e\\u0301le\\u0300ve";
That second mechanism spares you from the trying to figure out what it is after the Java preprocessor has its way with it (you can’t use \u0022 but can use \\0022), but then it screws up your Pattern.CANON_EQ flag.
Most other languages have a more straightforward way to do it that Java — which also insists on ugly UTF-16 unless you use java -encoding UTF-8 for your source. Hardcoding UTF-16 surrogates is absolutely idiotic. Do not do it!!
In Perl you could use:
my $s = "e\x{301}le\x{300}ve"; # NFD form
my $s = "\xE9l\xE8ve"; # NFC form
but you can also name them symbolically
use charnames qw< :full >;
my $s_as_NFD = "e\N{COMBINING ACUTE ACCENT}le\N{COMBINING GRAVE ACCENT}e";
my $s_as_NFC = "\N{LATIN SMALL LETTER E WITH ACUTE}l\N{LATIN SMALL LETTER E WITH GRAVE}ve";
The last one can be made much shorter if you’d prefer:
use charnames qw< :full latin >;
my $s_as_NFC = "\N{e WITH ACUTE}l\N{e WITH GRAVE}ve";
All of those are just about infinitely superior to hardcoding magic numbers into your code.
This all assumes your language supports Unicode, but many do not.

find all text before using regex

How can I use regex to find all text before the text "All text before this line will be included"?
I have includes some sample text below for example
This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.
Starting with an explanation... skip to end for quick answers
To match upto a specific piece of text, and confirm it's there but not include it with the match, you can use a positive lookahead, using notation (?=regex)
This confirms that 'regex' exists at that position, but matches the start position only, not the contents of it.
So, this gives us the expression:
.*?(?=All text before this line will be included)
Where . is any character, and *? is a lazy match (consumes least amount possible, compared to regular * which consumes most amount possible).
However, in almost all regex flavours . will exclude newline, so we need to explicitly use a flag to include newlines.
The flag to use is s, (which stands for "Single-line mode", although it is also referred to as "DOTALL" mode in some flavours).
And this can be implemented in various ways, including...
Globally, for /-based regexes:
/regex/s
Inline, global for the regex:
(?s)regex
Inline, applies only to bracketed part:
(?s:reg)ex
And as a function argument (depends on which language you're doing the regex with).
So, probably the regex you want is this:
(?s).*?(?=All text before this line will be included)
However, there are some caveats:
Firstly, not all regex flavours support lazy quantifiers - you might have to use just .*, (or potentially use more complex logic depending on precise requirements if "All text before..." can appear multiple times).
Secondly, not all regex flavours support lookaheads, so you will instead need to use captured groups to get the text you want to match.
Finally, you can't always specify flags, such as the s above, so may need to either match "anything or newline" (.|\n) or maybe [\s\S] (whitespace and not whitespace) to get the equivalent matching.
If you're limited by all of these (I think the XML implementation is), then you'll have to do:
([\s\S]*)All text before this line will be included
And then extract the first sub-group from the match result.
(.*?)All text before this line will be included
Depending on what particular regular expression framework you're using, you may need to include a flag to indicate that . can match newline characters as well.
The first (and only) subgroup will include the matched text. How you extract that will again depend on what language and regular expression framework you're using.
If you want to include the "All text before this line..." text, then the entire match is what you want.
This should do it:
<?php
$str = "This can include deleting, updating, or adding records to your database, which would then be reflex.
All text before this line will be included
You can make this a bit more sophisticated by encrypting the random number and then verifying that it is still a number when it is decrypted. Alternatively, you can pass a value and a key instead.";
echo preg_filter("/(.*?)All text before this line will be included.*/s","\\1",$str);
?>
Returns:
This can include deleting, updating, or adding records to your database, which would then be reflex.

(grep) Regex to match non-ASCII characters?

On Linux, I have a directory with lots of files. Some of them have non-ASCII characters, but they are all valid UTF-8. One program has a bug that prevents it working with non-ASCII filenames, and I have to find out how many are affected. I was going to do this with find and then do a grep to print the non-ASCII characters, and then do a wc -l to find the number. It doesn't have to be grep; I can use any standard Unix regular expression, like Perl, sed, AWK, etc.
However, is there a regular expression for 'any character that's not an ASCII character'?
This will match a single non-ASCII character:
[^\x00-\x7F]
This is a valid PCRE (Perl-Compatible Regular Expression).
You can also use the POSIX shorthands:
[[:ascii:]] - matches a single ASCII char
[^[:ascii:]] - matches a single non-ASCII char
[^[:print:]] will probably suffice for you.**
No, [^\x20-\x7E] is not ASCII.
This is real ASCII:
[^\x00-\x7F]
Otherwise, it will trim out newlines and other special characters that are part of the ASCII table!
You could also to check this page: Unicode Regular Expressions, as it contains some useful Unicode characters classes, like:
\p{Control}: an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character.
You can use this regex:
[^\w \xC0-\xFF]
Case ask, the options is Multiline.
[^\x00-\x7F] and [^[:ascii:]] miss some control bytes so strings can be the better option sometimes. For example cat test.torrent | perl -pe 's/[^[:ascii:]]+/\n/g' will do odd things to your terminal, where as strings test.torrent will behave.
To Validate Text Box Accept Ascii Only use this Pattern
[\x00-\x7F]+
I use [^\t\r\n\x20-\x7E]+ and that seems to be working fine.
You don't really need a regex.
printf "%s\n" *[!\ -~]*
This will show file names with control characters in their names, too, but I consider that a feature.
If you don't have any matching files, the glob will expand to just itself, unless you have nullglob set. (The expression does not match itself, so technically, this output is unambiguous.)
This turned out to be very flexible and extensible.
$field =~ s/[^\x00-\x7F]//g ; # thus all non ASCII or specific items in question could be cleaned. Very nice either in selection or pre-processing of items that will eventually become hash keys.

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.