With the base64-encoded string JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN I am getting difference results from emacs than from the clojure code below.
Can anyone explain to me why?
The elisp below gives the correct output, giving me ultimately a valid pdf document (when i past the entire string). I am sure my emacs buffer is set to utf-8:
(base64-decode-string "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")
"%PDF-1.1
%âãÏÓ
1 0 obj
<<
Here is the same output with the chars in decimal (i think):
"%PDF-1.1
%\342\343\317\323
1
The clojure below gives incorrect output, rendering the pdf document invalid when i give the entire string:
(import 'java.util.Base64 )
(defn decode [to-decode]
(let [
byts (.getBytes to-decode "UTF-8")
decoded (.decode (java.util.Base64/getDecoder) byts)
]
(String. decoded "UTF-8")))
(decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")
"%PDF-1.1
%����
1 0 obj
<<
Same output, chars in decimal (i think). I couldn't even copy/paste this, i had to type it in. This is what it looks like when i opened the PDF in text-mode for the first three columns:
"%PDF-1.1
%\357\277\275\357\277\275\357\277\275\357\277\275
1"
Edit Taking emacs out of the equation:
If i write the encoded string to a file called encoded.txt and pipe it through the linux program base64 --decode i get valid output and a good pdf also:
This is clojure:
(defn decode [to-decode]
(let [byts (.getBytes to-decode "ASCII")
decoded (.decode (java.util.Base64/getDecoder) byts)
flip-negatives #(if (neg? %) (char (+ 255 %)) (char %))
]
(String. (char-array (map flip-negatives decoded)) )))
(spit "./output/decoded.pdf" (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))
(spit "./output/encoded.txt" "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")
Then this at the shell:
➜ output git:(master) ✗ cat encoded.txt| base64 --decode > decoded2.pdf
➜ output git:(master) ✗ diff decoded.pdf decoded2.pdf
2c2
< %áâÎÒ
---
> %����
➜ output git:(master) ✗
update - this seems to work
Alan Thompson's answer below put me on the correct track, but geez what a pain to get there.
Here's the idea of what works:
(def iso-latin-1-charset (java.nio.charset.Charset/forName "ISO-8859-1" ))
(as-> some-giant-string-i-hate-at-this-point $
(.getBytes $)
(String. $ iso-latin-1-charset)
(base64/decode $ "ISO-8859-1")
(spit "./output/a-pdf-that-actually-works.pdf" $ :encoding "ISO-8859-1" ))
Returning the results as a string, I get:
(b64/decode-str "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN")
=> "%PDF-1.1\r\n%����\r\n1 0 obj\r\n<< \r"
and as a vector of ints:
(mapv int (b64/decode-str "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))
=> [37 80 68 70 45 49 46 49 13 10 37 65533 65533 65533 65533 13 10 49 32 48
32 111 98 106 13 10 60 60 32 13]
Since both the beginning and end of the string look OK, I suspect the B64 string might be malformed?
Update
I went to http://www.base64decode.org and got the result
"Malformed input... :("
Update #2
The root of the problem is that the source characters are not UTF-8 encoded. Rather, they are ISO-8859-1 (aka ISO-LATIN-1) encoded. See this code:
(defn decode-bytes
"Decodes a byte array from base64, returning a new byte array."
[code-bytes]
(.decode (java.util.Base64/getDecoder) code-bytes))
(def iso-latin-1-charset (java.nio.charset.Charset/forName "ISO-8859-1" )) ; aka ISO-LATIN-1
(let [b64-str "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"
bytes-default (vec (.getBytes b64-str))
bytes-8859 (vec (.getBytes b64-str iso-latin-1-charset))
src-byte-array (decode-bytes (byte-array bytes-default))
src-bytes (vec src-byte-array)
src-str-8859 (String. src-byte-array iso-latin-1-charset)
]... ))
with result:
iso-latin-1-charset => <#sun.nio.cs.ISO_8859_1 #object[sun.nio.cs.ISO_8859_1 0x3edbd6e8 "ISO-8859-1"]>
bytes-default => [74 86 66 69 82 105 48 120 76 106 69 78 67 105 88 105 52 56 47 84 68 81 111 120 73 68 65 103 98 50 74 113 68 81 111 56 80 67 65 78]
bytes-8859 => [74 86 66 69 82 105 48 120 76 106 69 78 67 105 88 105 52 56 47 84 68 81 111 120 73 68 65 103 98 50 74 113 68 81 111 56 80 67 65 78]
(= bytes-default bytes-8859) => true
src-bytes => [37 80 68 70 45 49 46 49 13 10 37 -30 -29 -49 -45 13 10 49 32 48 32 111 98 106 13 10 60 60 32 13]
src-str-8859 => "%PDF-1.1\r\n%âãÏÓ\r\n1 0 obj\r\n<< \r"
So the java.lang.String constructor will work correctly with a byte[] input, even when the high bit is set (making them look like "negative" values), as long as you tell the constructor the correct java.nio.charset.Charset to use for interpreting the values.
Interesting that the object type is sun.nio.cs.ISO_8859_1.
Update #3
See the SO question below for a list of libraries that can (usually) autodetect the encoding of a byte stream (e.g. UTF-8, ISO-8859-1, ...)
What is the most accurate encoding detector?
I think you need to verify the actual bytes that are produced in both scenarios. I would save both decoded results in a file and then compare them using for example xxd command line tool to get the hex display of the bytes in the files.
I suspect your emacs and clojure application uses different font causing the same non-ASCII bytes to be rendered differently, e.g. the same byte value is rendered as â in emacs and � in clojure output.
I would also check if elisp indeed creates the resulting string using UTF-8. base64-decode-string mentions unibytes and I am not sure it's really UTF-8. Unibyte sounds like encoding characters using always one byte per character whereas UTF-8 uses one to four bytes per character.
Update
#glts made a correct point in his comment to the question. If we go to http://www.utilities-online.info/base64/ (for example), and we try to decode the original string, we get a third, different result:
%PDF-1.1
%⣏Ӎ
1 0 obj
<<
However, if we try to encode the data the OP posted, we get a different Base64 string: JVBERi0xLjEKICXDosOjw4/DkwogMSAwIG9iagogPDwg, which if we run using the original decode implementation as written by the OP we get the same output:
(decode "JVBERi0xLjEKICXDosOjw4/DkwogMSAwIG9iagogPDwg")
"%PDF-1.1\n %âãÏÓ\n 1 0 obj\n << "
No need to make any conversions. I guess you should check out the encoder.
Original answer
This problem is due to java's Byte being signed.. So much fun!
When you convert it to string, it truncates all negative values to 65533, which is plain wrong:
(map long (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))
;; (37 80 68 70 45 49 46 49 13 10 37 65533 65533 65533 65533 13 10 49 32 48 32 111 98 106 13 10 60 60 32 13)
lets see what happens:
(defn decode [to-decode]
(let [byts (.getBytes to-decode "UTF-8")
decoded (.decode (java.util.Base64/getDecoder) byts)]
decoded))
(into [] (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))
;; [37 80 68 70 45 49 46 49 13 10 37 -30 -29 -49 -45 13 10 49 32 48 32 111 98 106 13 10 60 60 32 13]
See the negatives? lets try to fix that:
(into [] (char-array (map #(if (neg? %) (char (+ 255 %)) (char %))(decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))))
;; [\% \P \D \F \- \1 \. \1 \return \newline \% \á \â \Î \Ò \return \newline \1 \space \0 \space \o \b \j \return \newline \< \< \space \return]
And if we turn this into a string, we get what emacs gave us:
(String. (char-array (map #(if (neg? %) (char (+ 255 %)) (char %)) (decode "JVBERi0xLjENCiXi48/TDQoxIDAgb2JqDQo8PCAN"))))
;; "%PDF-1.1\r\n%áâÎÒ\r\n1 0 obj\r\n<< \r"
Related
I have a source data file which I have been using JREPL.BAT to perform some very simple search and replace operations quite successfully. I now need to expand on that to do 2 jobs.
1. remove all lines that start with the string "appX_formContent". This line contain a lot of html output also, it all needs to be deleted on that line.
2. remove all lines that start with "Hex Payload:" and the subsequent line that comes with it.
This is an example of the input data file which shows 2 records. The delimiter between each record is the row that contains "-----------------".
-----------------
Message Headers
JMSCorrelationID: 60bb7750-e9e2-11e9-98bb-42010a961307
JMSPriority: 4
JMSRedelivered: false
Message Properties
app_trackingId: 190990a2-d8d8-43eb-814a-36ceba7a9111
appX_formInstanceIdentifier: FRM389083
appX_formContent: {"data":{"C7d14a6eb-70e7-402d-9d6e-4efd01ba561c":"N","Y","test.</p>\n<p>test form data to be informed </p>\n<p>...............</p>\n<p><strong>Update</strong></p>\n<p><strong>years</strong>"<p>supervision</p>","<p>:true,"c9377ae2-901d-4461-929c-c76e26dc6183":false}}}
app_sourceSystemId: source
app_eventCode: FORM_OUTPUT
app_instigatingUserId: 66
JMSXGroupSeq: 0
Hex Payload:
25 50 44 46 2D 31 2E 35 0D 0A 34 20 30 20 6F 62 6A 0D 0A 3C 3C 2F 54
-----------------
Message Headers
JMSCorrelationID: 641a80d0-e9e2-11e9-98bb-42010a961307
JMSPriority: 4
JMSTimestamp: 2019 10 08 16:43:40
JMSRedelivered: false
Message Properties
app_trackingId: a3c2fe93-ef71-4611-9605-9858ff67a6e8
appX_formInstanceIdentifier: FRM388843
appX_formContent: {"data":{"C7d14a6eb-70e7-402d-9d6e-4efd01ba561c":"N","Y","test.</p>\n<p>test form data to be informed </p>\n<p>...............</p>\n<p><strong>Update</strong></p>\n<p><strong>years</strong>"<p>supervision</p>","<p>:true,"c9377ae2-901d-4461-929c-c76e26dc6183":false}}}
app_sourceSystemId: source
app_eventCode: FORM_OUTPUT
app_instigatingUserId: 433
JMSXGroupSeq: 0
Hex Payload:
25 50 44 46 2D 31 2E 35 0D 0A 34 20 30 20 6F 62 6A 0D 0A 3C 3C 2F
-----------------
This is the batch file that I use to call jrepl - very simple.
call jrepl ".*(?:appX_formContent: .*)" "" /m /f "inpu.txt" /o "output.txt"
I've only tried to remove the appX_formContent line with the regex but it isn't producing any output. I'm not good with regex so help appreciated.
Not sure how to handle the second task of deleting the Hex Payload: line.
I am messing around with this regex2dfa library -> https://github.com/kpdyer/regex2dfa, using the command ./regex2dfa -r "(abc+)+"
This returns
0 2 97 97
1 3 99 99
2 1 98 98
3 2 97 97
3 3 99 99
3
Looking at this
https://lambda.uta.edu/cse5317/spring01/notes/node8.html
and using the DFA generated here for the regex (abc+)+
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa
I can't seem to figure out how to go from the diagram, to the transition table(?) that the regex2dfa tool is outputting.
What am I missing?
I am doing some research on image compression via discrete cosine transformations and i want to change to quantization tables sizes so that i can study what happens when i change the matrix sizes that i divide my pictures into. The standard sub-matrix size is 8X8 and there is a lot of tables for those dimensions. For example the standard JPEG quantization table (that i use) is:
standardmatrix8 = np.matrix('16 11 10 16 24 40 51 61;\
12 12 14 19 26 58 60 55;\
14 13 16 24 40 57 69 56;\
14 17 22 29 51 87 80 62;\
18 22 37 56 68 109 103 77;\
24 35 55 64 81 104 103 92;\
49 64 78 77 103 121 120 101;\
72 92 95 98 112 100 103 99').astype('float')
I have assumed that the quantization tabel for 2X2 and 4X4 would be:
standardmatrix2 = np.matrix('16 11; 12 12.astype('float')
standardmatrix4 = np.matrix('16 11 10 16;
12 12 14 19;
14 13 16 24;
18 22 37 56').astype('float')
Since the entries in the standard table correspond to the same frequencies in the smaller matrixes.
But what about a quantization with dimentions 16X16, 24X24 and so on. I know that the standard quantization tabels are worked out by experiments and can't be calculated from some formula, but i assume that someone have tried changing the matrix sizes before me! Where can i find these tabels? Or can i just make something up and scale the last entries to higher frequenzies?
I am learning little sas book. Below is a code from book. and raw data. The issue is when I run it, the final data set keeps missing the record at end of line, i.e., it keeps missing 75 and 56, and label them as missing ("."). Could anyone point out where could possible be the problem? When I add spaces after 75 and 56 at line ends, the problem is gone.
DATA class;
INFILE 'c:\MyRawData\Scores.dat';
INPUT Score ##;
RUN;
PROC UNIVARIATE DATA = class;
VAR Score;
TITLE;
RUN;
Data in that file:
56 78 84 73 90 44 76 87 92 75
85 67 90 84 74 64 73 78 69 56
87 73 100 54 81 78 69 64 73 65
after run it shows more like
56 78 84 73 90 44 76 87 92 .
85 67 90 84 74 64 73 78 69 .
87 73 100 54 81 78 69 64 73 65
My suspicion is that you have something wrong with your end of lines; either you have a spurious character, or your end of line isn't correct in some fashion. Most likely you are using a windows file and you are running in Unix, so you have
75CRLF85
and since Unix uses only LF for line terminator, it sees "75CR" endofline "85", not "75" endofline "85" like it should.
In that case you can either do what you did - add a space, though that likely will still leave some 'blank' records in there - or use TERMSTR in your infile statement to tell SAS how to properly read the file in.
Otherwise, you may have some spurious end characters - for example, if you pasted this from the web, it's possible you have a non-breaking space that is not converted to a regular space.
You can find out by doing this:
data _null_;
infile 'c:\rawdata\myfile.dat';
input #;
put _infile_ $HEX60.;
run;
The 60 is 2x the length of the line. That tells you what SAS is seeing. What you should see:
3536203738203834203733203930203434203736203837203932203735
3835203637203930203834203734203634203733203738203639203536
383720373320313030203534203831203738203639203634203733203635
Digits in ASCII are 30+digit, so 35 is a 5, 36 is a 6, etc. Space is 20. The first line:
35|36|20|37|38|20|38|34|20|37|33|20| ...
so 5 6 space 7 8 space 3 8 space 7 3 space. If you see something else after the 37 35, then you know there is a problem. You might see any of the following:
0A = Line feed.
0D = Carriage return.
A0 = Nonbreaking (web) space.
There are lots of other things you could see, but those are the most likely to trip you up. Pasting from the web is often a problem.
i have a spreadsheed in calc. with some records. There is a column that contains the following information
Ecole Saint-Exupery
Rue Saint-Malo 24
67544 Paris
Well i need to have those lines divided into at least three columns
name: Ecole Saint-Exupery
street: Rue Saint-Malo 24
postal code and town 67544 Paris
Or even better - i have divided the postal code and town into two seperate columns!?
Question: is this possible? Can (or should) i do this in calc (open document-formate)?
Do i need to have to use a regex and perl or am i able to solve this issues without an regex?
Note - finally i need to transfer the data into MySQL-database...
I look forward to a tipp...
greetings
BTW: you can see all the things in a real world-live-demo: http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=50&s=1750 - see the filed
Schulname
Straße
PLZ Ort
These field contains three things - the name, the street and the Postal Code and the town!
Question: can this be divided into parts!? If you copy and paste the information - and drop it to calc then you get all the information in only one cell. How to divide and seperate all those information into three cells or even four?
BTW - i tried to translate the information to hex-code - see the follwoing...:
Staatl. Realschule Grafenau
Rachelweg 20
94481 Grafenau
00000000: 53 74 61 61 74 6C 2E 20 52 65 61 6C 73 63 68 75
00000010: 6C 65 20 47 72 61 66 65 6E 61 75 20 0A 52 61 63
00000020: 68 65 6C 77 65 67 20 32 30 0A 39 34 34 38 31 20
00000030: 20 47 72 61 66 65 6E 61 75 20 20
but i do not know if this helps here!??
Can you help me to solve the problem. Do i need to have a regex!?
Many thanks in advance for any and all help!
You may not need a regex. You should be able to take the contents of the cell in question and split it up using the newline character that is present. I am not familiar with calc, but if there is a split() or explode() function that returns an array, then splitting on a newline will yield the 3 pieces you are looking for.