Perl decimal to ASCII conversion - regex

I am pulling SNMP information from an F5 LTM, and storing this information in a psql database. I need help converting the returned data in decimal format into ASCII characters.
Here is an example of the information returned from the SNMP request:
iso.3.6.1.4.1.3375.2.2.10.2.3.1.9.10.102.111.114.119.97.114.100.95.118.115 = Counter64: 0
In my script, I need to identify the different sections of this information:
my ($prefix, $num, $char-len, $vs) = ($oid =~ /($vsTable)\.(\d+)\.(\d+)\.(.+)/);
This gives me the following:
(prefix= .1.3.6.1.4.1.3375.2.2.10.2.3.1)
(num= 9 )
(char-len= 10 )
(vs= 102.111.114.119.97.114.100.95.118.115)
The variable $vs is the Object name in decimal format. I would like to convert this to ASCII characters (which should be "forward_vs").
Does anyone have a suggestion on how to do this?

Given that this is related to interpreting SNMP data, it seems logical to me to use one or more of the SNMP modules available from CPAN. You have to know quite a lot about SNMP to determine when the string you quote stops being the identifier (prefix) and starts to be the value. You have a better chance of getting a general solution with SNMP code than with hand-hacked code.

Jonathan Leffler has the right answer, but here are a couple of things to expand your Perl horizons:
use v5.10;
$_ = "102.111.114.119.97.114.100.95.118.115";
say "Version 1: " => eval;
say "Version 2: " => pack "W".(1+y/.//) => /\d+/g;
Executed, that prints:
Version 1: forward_vs
Version 2: forward_vs
Once both are clear to you, you may hit space to continue or q to quit. :)
EDIT: The last one can also be written
pack "WW".y/.//,/\d+/g
But please don't. :)

my $new_vs = join("", map { chr($_) } split(/\./,$vs));

Simple solution:
$ascii .= chr for split /\./, $vs;

pack 'C*', split /\./
For example,
>perl -E"say pack 'C*', split /\./, $ARGV[0]" 102.111.114.119.97.114.100.95.118.115
forward_vs

Related

Including regex on variable before matching string

I'm trying to find and extract the occurrence of words read from a text file in a text file. So far I can only find when the word is written correctly and not munged (a changed to # or i changed to 1). Is it possible to add a regex to my strings for matching or something similar? This is my code so far:
sub getOccurrenceOfStringInFileCaseInsensitive
{
my $fileName = $_[0];
my $stringToCount = $_[1];
my $numberOfOccurrences = 0;
my #wordArray = wordsInFileToArray ($fileName);
foreach (#wordArray)
{
my $numberOfNewOccurrences = () = (m/$stringToCount/gi);
$numberOfOccurrences += $numberOfNewOccurrences;
}
return $numberOfOccurrences;
}
The routine receives the name of a file and the string to search. The routine wordsInFileToArray () just gets every word from the file and returns an array with them.
Ideally I would like to perform this search directly reading from the file in one go instead of moving everything to an array and iterating through it. But the main question is how to hard code something into the function that allows me to capture munged words.
Example: I would like to extract both lines from the file.
example.txt:
russ1#anh#ck3r
russianhacker
# this variable also will be read from a blacklist file
$searchString = "russianhacker";
getOccurrenceOfStringInFileCaseInsensitive ("example.txt", $searchString);
Thanks in advance for any responses.
Edit:
The possible substitutions will be defined by an user and the regex must be set to fit. A user could say that a common substitution is to change the letter "a" to "#" or even "1". The possible change is completely arbitrary.
When searching for a specific word ("russian" for example) this could be done with something like:
(m/russian/i); # would just match the word as it is
(m/russi[a#1]n/i); # would match the munged word
But I'm not sure how to do that if I have the string to match stored in a variable, such as:
$stringToSearch = "russian";
This is sort of a full-text search problem, so one method is to normalize the document strings before matching against them.
use strict;
use warnings;
use Data::Munge 'list2re';
...
my %norms = (
'#' => 'a',
'1' => 'i',
...
);
my $re = list2re keys %norms;
s/($re)/$norms{$1}/ge for #wordArray;
This approach only works if there's only a single possible "normalized form" for any given word, and may be less efficient anyway than just trying every possible variation of the search string if your document is large enough and you recompute this every time you search it.
As a note your regex m/$randomString/gi should be m/\Q$randomString/gi, as you don't want any regex metacharacters in $randomString to be interpreted that way. See docs for quotemeta.
There are parts of the problem which aren't specified precisely enough (yet).
Some of the roll-your-own approaches, that depend on the details, are
If user defined substitutions are global (replace every occurrence of a character in every string) the user can submit a mapping, as a hash say, and you can fix them all. The process will identify all candidates for the words (along with the actual, unmangled, words, if found). There may be false positives so also plan on some post-processing
If the user can supply a list of substitutions along with words that they apply to (the mangled or the corresponding unmangled ones) then we can have a more targeted run
Before this is clarified, here is another way: use a module for approximate ("fuzzy") matching.
The String::Approx seems to fit quite a few of your requirements.
The match of the target with a given string relies on the notion of the Levenshtein edit distance: how many insertions, deletions, and replacements ("edits") it takes to make the given string into the sought target. The maximum accepted number of edits can be set.
A simple-minded example:
use warnings;
use strict;
use feature 'say';
use String::Approx qw(amatch);
my $target = qq(russianhacker);
my #text = qw(that h#cker was a russ1#anh#ck3r);
my #matches = amatch($target, ["25%"], #text);
say for #matches; #==> russ1#anh#ck3r
See documentation for what the module avails us, but at least two comments are in place.
First, note that the second argument in amatch specifies the percentile-deviation from the target string that is acceptable. For this particular example we need to allow every fourth character to be "edited." So much room for tweaking can result in accidental matches which then need be filtered out, so there will be some post-processing to do.
Second -- we didn't catch the easier one, h#cker. The module takes a fixed "pattern" (target), not a regex, and can search for only one at a time. So, in principle, you need a pass for each target string. This can be improved a lot, but there'll be more work to do.
Please study the documentation; the module offers a whole lot more than this simple example.
I've ended solving the problem by including the regex directly on the variable that I'll use to match against the lines of my file. It looks something like this:
sub getOccurrenceOfMungedStringInFile
{
my $fileName = $_[0];
my $mungedWordToCount = $_[1];
my $numberOfOccurrences = 0;
open (my $inputFile, "<", $fileName) or die "Can't open file: $!";
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
while (my $currentLine = <$inputFile>)
{
chomp ($currentLine);
$numberOfOccurrences += () = ($currentLine =~ m/$mungedWordToCount/gi);
}
close ($inputFile) or die "Can't open file: $!";
return $numberOfOccurrences;
}
Where the line:
$mungedWordToCount =~ s/a/\[a\#4\]/gi;
Is just one of the substitutions that are needed and others can be added similarly.
I didn't know that Perl would just interpret the regex inside of the variable since I've tried that before and could only get the wanted results defining the variables inside the function using single quotes. I must've done something wrong the first time.
Thanks for the suggestions, people.

Match string with escape characters or backslashes

The following perl script and TestData simulate the situation where I can only find 2 instead of 4 expected. (to match all support.tier.1 with backslash in between).
How can I modify this perl regex here? thanks
my #TestData(
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com"
);
Here is the code to be changed:
my $count = 0;
foreach my $tier(#TestData){
if($tier =~ m/support.tier.1/){
print "$count: $tier\n";
}
$count++;
}
I only get 2 matches while the expected is 4:
0: support.tier.1
6: support.tier.1#example.com
Update
Since it seems that you may indeed be getting strings containing backslashes, I suggest that you use String::Unescape to remove those backslashes before testing your strings. You will probably have to install it as it isn't a core module
Your code would look like this
use strict;
use warnings;
use String::Unescape;
my #tiers = (
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com",
);
my $count = 0;
for my $tier ( #tiers ) {
my $plain = String::Unescape->unescape($tier);
if ( $plain =~ /support\.tier\.1/ ) {
printf "%d: %s\n", ++$count, $tier;
}
}
output
1: support.tier.1
2: "support\.tier\.1"
3: support\.tier\.1\#example\.com
4: support.tier.1#example.com
Note that there is a bug in the String::Unescape module that prevents it from exporting the unescape function. It just means you have to use String::Unescape::unescape or String::Unescape->unescape all the time. Or you could import it manually with *unescape = \&String::Unescape::unescape
The #tiers array contains these exact strings
support.tier.1
support.tier.2
"support\.tier\.1"
support.tier.2
support\.tier\.1\#example\.com
support.tier.2#example.com
support.tier.1#example.com
support.tier.2#example.com
sales#example.com
Can you see that only items 1 and 7 contain the string support.tier.1? The other two that I imagine you expected to match are 3 and 5, which contain spurious backslashes
It's not clear, but it seems unlikely that you will be getting data in this format. If you really want to match support.tier.1 where either dot may be preceded by a backslash character then you need /support\\?\.tier\\?\.1/, but I think you are misunderstanding the way Perl strings work
I may not fully understand, but if I do I agree with the answer that Matt has already attempted to give you. Regex definitely can handle your request if you are saying that the escape character may or may not be before each period in support.tier.1.
A single backslash is \\ and ? means essentially "one or zero:"
use strict;
use warnings;
my #tiers = (
"support.tier.1",
"support.tier.2",
qw("support\.tier\.1"),
"support\.tier\.2",
quotemeta("support.tier.1\#example.com"),
"support.tier.2\#example.com",
"support\.tier\.1\#example\.com",
"support\.tier\.2\#example\.com",
"sales\#example\.com",
);
my $count = 0;
foreach my $tier (#tiers) {
if ($tier =~ /support\\?.tier\\?.1/) {
print "$count: $tier\n";
}
$count++;
}
On an unrelated note, for the purpose of creating an easy-to-follow example, I included a suggestion on how you might better format your sample data instead of using the $str and pushes.
If this works, I'd recommend you ask Matt to post his comment responses as an answer and accept it.

Need help in designing a regular expression

"LIM-1-2::PROVPEC=NTK552DA,CTYPE=\"LIM C-Band\":OOS-AU,UEQ"
"2XOSC-1-4::PROVPEC=NTK554BA,CTYPE=\"OSC w/WSC 2 Port SFP 2 Port 10/100 BT\":OOS-AU,UEQ"
"P155M-1-4-1::PROVPEC=NTK592NP,CTYPE=\"OC-3 0-15dB CWDM 1511 nm\":OOS-AU,UEQ"
I have this data in a file. I need to extract -1-2 for first equipment likewise -1-4-1 for last one. I will using this data later. I am able to figure out how to get -1-1 but it's not versatile enough to get -1-1-4 also.
Equipment can also have a subslot.This list is tentative.
EQP-shelf-slot-subslot. I need some expression which can check if subslot exists or not provides me out in form -shelf-slot-subslot or -shelf-slot
How about:
my ($wanted) = $str =~ /^\w+([^:]+)/;
or, if quotes are part of the string:
my ($wanted) = $str =~ /^"\w+([^:]+)/;

Use Regex to modify specific column in a CSV

I'm looking to convert some strings in a CSV which are in 0000-2400 hour format to 00-24 hour format. e.g.
2011-01-01,"AA",12478,31703,12892,32575,"0906",-4.00,"1209",-26.00,2475.00
2011-01-02,"AA",12478,31703,12892,32575,"0908",-2.00,"1236",1.00,2475.00
2011-01-03,"AA",12478,31703,12892,32575,"0907",-3.00,"1239",4.00,2475.00
The 7th and 9th columns are departure and arrival times, respectively. Preferably the lines should look like this when I'm done:
2011-01-01,"AA",12478,31703,12892,32575,"09",-4.00,"12",-26.00,2475.00
The whole csv will eventually be imported into R and I want to try and handle some of the processing beforehand because it will be kinda large. I initially attempted to do this with Perl but I'm having trouble picking out multiple digits w/ a regex. I can get a single digit before a given comma with a lookbehind expression, but not more than one.
I'm also open to being told that doing this in Perl is needlessly silly and I should stick to R. :)
I may as well offer my own solution to this, which is
s/"(\d\d)\d\d"/"$1"/g
Like I mentioned in the comments, using a CSV module like Text::CSV is a safe option. This is a quick sample script of how its used. You'll notice that it does not preserve quotes, though it should, since I put in keep_meta_info. If it's important to you, I'm sure there's a way to fix it.
use strict;
use warnings;
use Data::Dumper;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
keep_meta_info => 1,
});
while (my $row = $csv->getline(*DATA)) {
for ($row->[6], $row->[8]) {
s/\d\d\K\d\d//;
}
$csv->print(*STDOUT, $row);
}
__DATA__
2011-01-01,"AA",12478,31703,12892,32575,"0906",-4.00,"1209",-26.00,2475.00
2011-01-02,"AA",12478,31703,12892,32575,"0908",-2.00,"1236",1.00,2475.00
2011-01-03,"AA",12478,31703,12892,32575,"0907",-3.00,"1239",4.00,2475.00
Output:
2011-01-01,AA,12478,31703,12892,32575,09,-4.00,12,-26.00,2475.00
2011-01-02,AA,12478,31703,12892,32575,09,-2.00,12,1.00,2475.00
2011-01-03,AA,12478,31703,12892,32575,09,-3.00,12,4.00,2475.00

Regex to replace gibberish

I have to clean some input from OCR which recognizes handwriting as gibberish. Any suggestions for a regex to clean out the random characters? Example:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
, ':, Ie
':... 11'1
. '(.. ~!' ': f I I
. " .' I ~
I' ,11 l
I I I ~ \ :' ,! .~ , .. r, 1 , ~ I . I' , .' I ,.
, i
I ; J . I.' ,.\ ) ..
. : I
'I', I
.' '
r,"
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
eh....l
~.\O ::t
e;~~~
s: ~ ~. 0
qs c::; ~ g
o t/J (Ii .,
::3 (1l Il:l
~ cil~ 0 2:
t:lHj~(1l
. ~ ~a
0~ ~ S'
N ("b t/J :s
Ot/JIl:l"-<:!
v'g::!t:O
-....c......
VI (:ll <' 0
:= - ~
< (1l ::3
(1l ~ '
t/J VJ ~
Pl
.....
....
(II
One of the simpleset solutions(not involving regexpes):
#pseudopython
number_of_punct = sum([1 if c.ispunct() else 0 for c in line])
if number_of_punct >len(line)/2: line_is_garbage()
well. Or rude regexpish s/[!,'"##~$%^& ]{5,}//g
A simple heuristic, similar to anonymous answer:
listA = [0,1,2..9, a,b,c..z, A,B,C,..Z , ...] // alphanumerical symbols
listB = [!#$%^&...] // other symbols
Na = number_of_alphanumeric_symbols( line )
Nb = number_of_other_symbols( line )
if Na/Nb <= garbage_ratio then
// garbage
No idea how well it would work, but I have considered this problem in the past, idly. I've on occasions played with a little programmatic device called a markov chain
Now the wikipedia article probably won't make much sense until you see some of the other things a markov chain is good for. One example of a markov chain in action is this Greeking generator. Another example is the MegaHAL chatbot.
Greeking is gibberish that looks like words. Markov chains provide a way of randomly generating a sequence of letters, but weighting the random choices to emulate the frequency patterns of an examined corpus. So for instance, Given the letter "T", the letter h is more likely to show up next than any other letter. So you examine a corpus (say some newspapers, or blog postings) to produce a kind of fingerprint of the language you're targeting.
Now that you have that frequency table/fingerprint, you can examine your sample text, and rate each letter according to the likelyhood of it appearing. Then, you can flag the letters under a particular threshold likelyhood for removal. In other words, a surprise filter. Filter out surprises.
There's some leeway for how you generate your freqency tables. You're not limited to one letter following another. You can build a frequency table that predicts which letter will likely follow each digraph (group of two letters), or each trigraph, or quadgraph. You can work the other side, predicting likely and unlikely trigraphs to appear in certain positions, given some previous text.
It's kind of like a fuzzy regex. Rather than MATCH or NO MATCH, the whole text is scored on a sliding scale according to how similar it is to your reference text.
I did a combo of eliminating lines that don't contain at least two 3 letter words, or one 6 letter word.
([a-z|A-Z]{3,}\s){2,}|([a-z|A-Z]{6,})
http://www.regexpal.com/
Here is a Perl implementation of the garbage_ratio heuristic:
#!/usr/bin/perl
use strict;
use warnings;
while ( defined( my $chunk = read_chunk(\*DATA) ) ) {
next unless length $chunk;
my #tokens = split ' ', $chunk;
# what is a word?
my #words = grep {
/^[A-Za-z]{2,}[.,]?$/
or /^[0-9]+$/
or /^a|I$/
or /^(?:[A-Z][.])+$/
} #tokens;
# completely arbitrary threshold
my $score = #words / #tokens;
print $chunk, "\n" if $score > 0.5;
}
sub read_chunk {
my ($fh) = #_;
my ($chunk, $line);
while ( my $line = <$fh> ) {
if( $line =~ /\S/ ) {
$chunk .= $line;
last;
}
}
while (1) {
$line = <$fh>;
last unless (defined $line) and ($line =~ /\S/);
$chunk .= $line;
}
return $chunk;
}
__DATA__
Paste the text above after __DATA__ above (not repeating the text here to save space). Of course, the use of the __DATA__ section is for the purpose of posting a self-contained script. In real life, you would have code to open the file etc.
Output:
Federal prosecutors on Monday charged a Miami man with the largest
case of credit and debit card data theft ever in the United States,
accusing the one-time government informant of swiping 130 million
accounts on top of 40 million he stole previously.
Gonzalez is a former informant for the U.S. Secret Service who helped
the agency hunt hackers, authorities say. The agency later found out that
he had also been working with criminals and feeding them information
on ongoing investigations, even warning off at least one individual,
according to authorities.
Regex won't help here. I'd say if you have control over the recognition part then focus on better quality there:
http://www.neurogy.com/ocrpreproc.html
You can also ask user to help you and specify the type of text you work with. e.g. if it is a page from a book then you would expect the majority of lines to be the same length and mainly consisting of letters, spaces and punctuation.
Well a group of symbols would match a bit of gibberish. Perhaps checking against a dictionary for words?
There seems to be a lot of line breaks where gibberish is, so that may be an indicator too.
Interesting problem.
If this is representative, I suppose you could build a library of common words and delete any line which didn't match any of them.
Or perhaps you could match character and punctuation characters and see if there is a reliable ratio cut-off, or simply a frequency of occurrence of some characters which flags it as gibberish.
Regardless, I think there will have to be some programming logic, not simply a single regular expression.
I guess that a regex would not help here. Regex would basically match a deterministic input i.e. a regex will have a predefined set of patterns that it will match. And gibberish would in most cases be random.
One way would be to invert the problem i.e. match the relevant text instead of matching the gibberish.
I'd claim a regex like "any punctuation followed by anything except a space is spam'.
So in .NET it's possibly something like
.Replace("\\p{1,}[a-zA-Z0-9]{1,}", "");
Then you'd consider "any word with two or more punctuations consecutively:
.Replace(" \\p{2,} ", "");
Seems like a good start anyway.
I like #Breton's answer - I'd suggest using his Corpus approach also with a library of known 'bad scans', which might be easier to identify because 'junk' has more internal consistency than 'good text' if it comes from bad OCR scans (the number of distinct glyphs is lower for example).
Another good technique is to use a spell checker/dictionary and look up the 'words' after you've eliminated the non readable stuff with regex.