Is it possible to replicate SAS md5 function output via GNU coreutils? - sas

I expected this to be fairly straightforward, but I've run out of ideas this time. I'm working with with GNU coreutils on Windows 7 (not that it should make any difference). I've found another command line utility that does what I want, but I'd prefer to find a way of doing this via GNU md5sum if possible.
Here's what I'm trying to reproduce:
data _null_;
length a $32;
a = put(md5("Hello"), $hex32.);
put a=;
run;
/*Output to replicate: 8B1A9953C4611296A827ABF8C47804D7*/
Here's what I've tried so far:
%macro wincmd /parmbuff;
filename cmd pipe "&SYSPBUFF" lrecl = 32767;
data _null_;
infile cmd lrecl = 32767;
input;
put _infile_;
run;
filename cmd clear;
%mend wincmd;
%let MD5SUM = C:\Program Files (x86)\coreutils\bin\md5sum.exe;
%wincmd(echo Hello | ""&MD5SUM"");
/*Output: f0d07a42adce73f0e4bc2d5e1cdb71e5 *- */
%wincmd(echo Hello | ""&MD5SUM"" -t);
/*Output: adb3f07f896745a101145fc3c1c7b2ea *- */
%wincmd(echo ""Hello"" | ""&MD5SUM"");
/*Output: 2c3a70806465ad43c09fd387e659fbce *- */
%let MD5 = C:\Program Files (x86)\md5\md5.exe;
%wincmd(echo Hello | ""&MD5"");
/*Output: F0D07A42ADCE73F0E4BC2D5E1CDB71E5 (matches md5sum)*/
%wincmd(echo ""Hello"" | ""&MD5"");
/*Output: 2C3A70806465AD43C09FD387E659FBCE (matches md5sum)*/
%wincmd(""&MD5"" -d""Hello"");
/*Output: 8B1A9953C4611296A827ABF8C47804D7 (matches SAS!)*/
Is there some form of syntax I can use with md5sum that will result in the same output (except possibly for upper/lower case differences) as SAS and md5 -d ? And why does the same string produce a different MD5 hash when read from stdin rather than as a command line parameter?
Update: fix, as suggested by DomPazz and Rob:
I thought I might as well go all in with coreutils at this point and match the SAS output exactly:
%let GNUPATH = C:\Program Files (x86)\coreutils\bin;
%let ECHO = &GNUPATH\echo.exe;
%let TR = &GNUPATH\tr.exe;
%let CUT = &GNUPATH\cut.exe;
%wincmd(""&ECHO"" -n ""Hello"" | ""&MD5SUM"" | ""&TR"" '[a-f]' '[A-F]' | ""&CUT"" -f 1 -d "" "");
/*Output: 8B1A9953C4611296A827ABF8C47804D7*/

You problem is not in md5sum, but in echo. It is adding white space to the "Hello" string.
Verify
C:\>echo Hello > c:\temp\test.txt
C:\>md5sum c:\temp\test.txt
-- I get: f0d07a42adce73f0e4bc2d5e1cdb71e5
Now open the file and notice the white space and a newline. Delete those
Run
C:\>md5sum c:\temp\test.txt
-- I get 8b1a9953c4611296a827abf8c47804d7, which matches SAS.
EDIT:
As mentioned in the comments below GNU echo has the -n option to strip the white space.
C:\Cygwin\bin>echo.exe -n Hello | md5sum.exe
returns: 8b1a9953c4611296a827abf8c47804d7
which matches the SAS value.

The MD5 algorithm output is only affected by 2 things as far as I'm aware:
case of source string
length of source string (includes leading/trailing blanks, length of empty string, etc.)
My guess is that the reason you are getting different outputs is because the different approaches pass in the string to hash with different (perhaps default) lengths and/or leading/trailing blanks, or perhaps your quotes are being included in the hash.

Related

How to use sed/grep to extract text between two diffrent strings (+ the 2 strings) that contain "/and *"

I am trying to extract the text between two strings that contain the symbols / and *. I believe this is the reason the regex commands break. So my starting string is "/25043*" and my ending string is "*25043/". I have the extract the text between these two strings (including these 2 strings) from a text file.
I've tried this using : grep -o -P '(?<="/25043*").*(?="*25043/")' code.txt
which gives me an error : "grep: lookbehind assertion is not fixed length"
My input file:
#include <stdio.h>
#include <stdlib.h>
#define MAX 256
/25043/definiramo konstanto MAX
int main(){
/25043*Vstopna tocka programa
Ne uporabljamo parameterov pri main, kjer jih ne potrebujemo
*25043/
printf("");/25043/izpis vrednosti konstante MAX
return 0;
}
/25043/Koncamo z izvajanjem programa
My wanted output:
/25043*Vstopna tocka programa
Ne uporabljamo parameterov pri main, kjer jih ne potrebujemo
*25043/
With grep -Po, you could use grep -oPz '(?s)/25043\*.*?\*25043/' file, but this solution has some drawbacks, namely, the value extracted has no line contents before /25043* and there is no trailing newline at the end.
The best approach here is to use awk:
awk '/\/25043\*/,/\*25043\//' file
Output:
/25043*Vstopna tocka programa
Ne uporabljamo parameterov pri main, kjer jih ne potrebujemo
*25043/
Note that / and * have to be escaped here (/ is a regex delimiter char, and the * is a special regex metacharacter).
EDIT:
based on the OP edit try using -z as it treats as large string
so :
grep -zoP '/25043\*(.|\s)*\*25043/' code.txt
ORIGINAL ANSWER:
have you tried to do it without grouping? like so :
example assumption of code.txt:
asdf asdfasd /25403* asdf *25043/ asdfa asdfa
command:
grep -o '/25403\* .* \*25043/' code.txt
result is :
/25403* asdf *25043/
If you can't use awk, and are looking for a more portable solution, you could try this:
sed '/\/25043\*/,/\*25043\//!d;s/\(\*25043\/\).*$/\1/g' file

List lines beetween 2 keywords using grep/sed/awk

I have a sas log file and I want to list only those lines that are between two words: data and run.
File can contain many such words in many lines, for example:
MPRINT: data xxxxx;
yyyyy
xxxxxx
MPRINT: run;
fffff
yyyyy
data fff;
fffff
run;
I would like to have lines 1-4 and 8-10.
I tried something like
egrep -iz file -e '\sdata\s+\S*\s+(.|\s)*\srun\s' but this expression lists all lines between first begin and last end ((.|\s) is for the purpose of new line character).
I may also want to add additional words to pattern between data and run like:
MPRINT: data xxx;
fffff
NOTE: ffdd
set fff;
xxxxxx
MPRINT: run;
data fff;
yyyyyy
run;
In some cases I would like to list only lines between data and run where there is set word in some line.
I know there are many similar threads, but I didn't find any when keywords can repeat multiple times.
I'm not familiar awk or sed but if it can help I can also use it.
[Edit]
Note that data and run are not necessarily on the beginning of the line (I updated the example). Also there can't be any other data between data and run.
[Edit2]
As Tom noted every line that I was looking for started with MPRINT(...):, so filtered those lines.
Anubhava answer helped me the most with my final solution so I mark it as an answer.
Final expression looked like this :
grep -o path -e 'MPRINT.*' | cut -f '2-' -d ' '|
grep -iozP '(?ms) data [^\(;\s]+.*?(set|infile).*?run[^\n]*\n
You may use this gnu grep command witn -P (PCRE) option:
grep -ozP '(?ms).*?data .*?run[^\n]*\n' file
If you only want to print block with line starting from set then use:
grep -ozP '(?ms).*?data .*?^set.*?run[^\n]*\n' file
MPRINT: data xxxxx;
yyyyy
set fff;
xxxxxx
MLOGIC: run;
You may use this awk to print between 2 keywords that must contain a line starting with set:
awk '/data / {
p=1
}
p && !y {
if (/^set/)
y=1
else
buf = buf $0 ORS
}
y {
if (buf != "")
printf "%s", buf
buf=""
print
}
/run/ {
p=y=0
}' file
MPRINT: data xxxxx;
yyyyy
set fff;
xxxxxx
MLOGIC: run;
If you just want to print data between 2 keywords in awk, it is so simple:
awk '/data /,/run/' file
For what i understand the following will do the trick
sed -n '/data.*;/,/run;/p' $FILENAME
Note that the '.*' after data can be improved by something like [a-z|A-Z]{5} that you protect against matching the word data somewhere in the middle
From there matching from data to set would already require some external decision processes, so the command would be
sed -n '/data.*;/,/set.*;/p' $FILENAME
(Probably learned along the way from How to use sed/grep to extract text between two words?)
Just try (?s)data.+?run;
Explanation:
(?s) - single line mode, . matches newline character
data - match data literally
.+? - match one or more of any character (including neline), non-greedy due to ?
run; - match run; literally
Demo

How to append space before match pattern in bash

How to append number of space before match pattern or after line by line in bash with sed command?
file.txt
str_len: equ $ - str ; calcs length of string (bytes) by
; subtracting this address ($ symbol)
output.txt
str_len: equ $ - str ; calcs length of string (bytes) by
; subtracting this address ($ symbol)
There is just a tool from unix-magic-set-of-wondrous-things that does exactly what you want:
$ column -t -s ';' -o ';' <input>
str_len: equ $ - str ; calcs length of string (bytes) by
; subtracting this address ($ symbol)
Other than that, sed is Turing complete, and so is Turing's machine. But that does not mean one has time to implement non-trivial solutions on such architectures :D
Edit: The above command was run using column from util-linux 2.25.2, with flags:
-o, --output-separator string
Specify the columns delimiter for table output (default is two spaces).
-s, --separator separators
Specify the possible input item delimiters (default is whitespace).
-t, --table
Determine the number of columns the input contains and create a table.
Columns are delimited with whitespace, by default, or with the characters
supplied using the --output-separator option. Table output is useful for
pretty-printing.
Here's one way you could do it using awk:
$ awk -F' *;' -vOFS=\; '{print $1 substr(" ",1,24-length($1)),$2}' file.txt
str_len: equ $ - str ; calcs length of string (bytes) by
; subtracting this address ($ symbol)
Set the input field separator to any number of spaces followed by a semicolon. Set the output field separator to a semicolon. Print the first column, followed by as many spaces as are needed to pad to 24 characters, followed by the second column.

AWK regex to find String with pattern

I have a File that contain too many charechter and symbols. I want to find an exact string and then cut it and give it to two variables. I have write it with grep but i want to write it in **AWK** or SED.
Here is my example file :
.f#alU|A#Z<inCWV6a=L?o`A5vIod"%Mm+YW1RM#,L;aN
r^n<&)}[??!VcVIV**2zTest1.Test2n9**94EN~yK,$lU=9?UT.[
e`)G:FS.nGz%?#~k!20aLJ^PU-[#}0W\ !8x
cujOmEK"1;!cI134lu%0-A +/t!VIf?8uT`!
aC1QAQY>4RE$46iVjAE^eo5yR|
1?/T?<H5,%G~[|9I/c&8MY$O]%,UYQe{!{Bm[rRC[
aHC`<m?BUau#N_O>Yct.MXo[>r5^uV&26#MkYB'Kiu\Y
K(*}ldO:ZQnI8t989fi+
CrvEwmTQ80k3==,a'Jj9907+}NNy=0Op
"nzb.j-.i%z5`U*8]~#64sF'r;\x\;ylr_;q5F` A!~p*
first i want to find 2zTest1.Test2n9 then cut the first 2 and last two charechter and finally get Two Words without dot(.). First word will i send to one variable and second one two another Variable.
Note : I want to find 2zTest1.Test2n9 and then i want to cut it.
output :
variable 1 = test1
variable 2 = test2
Thanks
With sed its:
sed -n 's/.*\(2z\(\(.*\)\.\(.*\)\)n9\).*/variable 1 = \L\3\nvariable 2 = \L\4/p' your.file
Output:
variable 1 = test1
variable 2 = test2
Using GNU awk:
read var1 var2 < <(
gawk 'match($0, /2[[:alpha:]]([^.]+)\.(.*)[[:alpha:]]9/, m) {
print m[1], m[2]
}' file
)
echo "var1=$var1"
echo "var2=$var2"
var1=Test1
var2=Test2
I read your comments to hek2mgl's answer -- those requirement need to be in the question itself.

Convert columns and rows in readable format

I have a file whose entries are like:
Time;Instance;Database;Status;sheapthres;bp_heap;fcm_heap;other_heap;sessions;sessions_in_exec;locks_held;lock_escal;log_reads;log_writes;deadlocks;l_reads;p_reads;hit_ratio;pct_async_reads;d_writes;a_writes;lock_waiting;sortheap;sort_overflows;pct_sort_overflows;AvgPoolReadTime;AvgDirReadTime;AvgPoolWriteTime;AvgDirWriteTim
02:07:49;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
02:08:09;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
02:08:29;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
and want to convert this in a readable format like:
Time Instance Database
02:07:49 SAN33 SAMPLE
02:08:09 SAN33 SAMPLE
02:08:29 SAN33 SAMPLE
and so on..
I have tried tr -s ";" "\t" but did not get any good result.. Can anyone help me in this.
You might want to use column as follows:
column -s\; -t your_file
where -s\; says that your column delimiter is a semicolon (protected with a backslash to avoid interpretation by the shell). See also Command line CSV viewer?.
How about more unix aware variant:
cat <your file> | sed 's/;/\t/g'
Solaris and HP-UX users note: instead of \t character typing, use Ctrl+V and then TAB key sequence.