Consider the set of all bit arrays of length n. Now consider the set of all 1-to-1 functions that map from this set to this set.
Now select a single function out of the latter set. Is there any algorithm to find a "minimal" method of implementing this function? Assume that we only have access to fundamental bit array operators such as AND OR XOR NOT and left and right bitshifts.
In case you're wondering, the reason I want this is because I'm writing an algorithm to convert from z-curve ordering of bits to hilbert-curve ordering of bits. My current method is to make a lookup table, but I bet there's a better option.
As a simple example, let's say I have a truth table that looks like this:
00 -> 10
01 -> 01
10 -> 00
11 -> 11
Then I should be able to infer that, given an input bit string input, the output bit string output is (in java syntax)
output = ((~input) << 1) ^ input
Here's the proof in this case:
00 -> 11 -> 10 -> 10
01 -> 10 -> 00 -> 01
10 -> 01 -> 10 -> 00
11 -> 00 -> 00 -> 11
Related
I'm trying to compare two hex values as in the code at the bottom. I was expecting that the IF statement, where I compare field A and field WS-FLD-X, would result in true, but it doesn't do it.
In other words, when I move 12 to WS-FLD-A, the value in WS-FLD-X should be stored as X'0C', right? This value is expected to be the same value in field A. Comparing the two values should result in true, however this is not happening.
Why? What is the difference between the value held in field A and the value in WS-FLD-X?
IDENTIFICATION DIVISION.
PROGRAM-ID. HELLO-WORLD.
DATA DIVISION.
WORKING-STORAGE SECTION.
01 FF.
05 A PIC XX.
05 B PIC XXXXXX.
01 F.
05 WS-FLD PIC S9(4) COMP.
05 WS-FLD-X REDEFINES WS-FLD PIC XX.
PROCEDURE DIVISION.
DISPLAY 'Hello, world' UPON CONSOLE.
MOVE X'0C' TO A.
MOVE "SOME TEXT" TO B.
DISPLAY FF UPON CONSOLE.
MOVE 12 TO WS-FLD
DISPLAY "HEX OF 12 IS:" WS-FLD-X UPON CONSOLE.
IF WS-FLD-X = A THEN DISPLAY "SAME" UPON CONSOLE END-IF.
Code in web IDE
You moved a single byte to a 2 byte field so the move padded to the right with a space (per Simon).
MOVE X'0C' TO A. // (A now contains x'0C20' which is not 12)
You'd need to move both bytes to keep the value of the number intact.
MOVE x'000C' TO A
The program now displays 'SAME'.
The problem: create a function with one input. Return the index of an array containing the fibonacci sequence (starting from 0) whose element matches the input to the function.
16 ~ │ def fib(n)
17 ~ │ return 0 if n == 0
18 │
19 ~ │ last = 0u128
20 ~ │ current = 1u128
21 │
22 ~ │ (n - 1).times do
23 ~ │ last, current = current, last + current
24 │ end
25 + │
26 + │ current
27 │ end
28 │
60 │ def usage
61 │ progname = String.new(ARGV_UNSAFE.value)
62 │
63 │ STDERR.puts <<-H
64 │ #{progname} <integer>
65 │ Given Fibonacci; determine which fib value would
66 │ exist at <integer> index.
67 │ H
68 │
69 │ exit 1
70 │ end
71 │
72 │ if ARGV.empty?
73 │ usage
74 │ end
75 │
76 │ begin
77 ~ │ i = ARGV[0].to_i
78 ~ │ puts fib i
79 │ rescue e
80 │ STDERR.puts e
81 │ usage
82 │ end
My solution to the problem is in no way elegant and I did it at 2AM when I was quite tired. So I'm not looking for a more elegant solution. What I am curious about is that if I run the resultant application with an input larger than 45 then I'm presented with Arithmetic overflow. I think I've done something wrong with my variable typing. I ran this in Ruby and it runs just fine so I know it's not a hardware issue...
Could someone help me find what I did wrong in this? I'm still digging, too. I just started working with Crystal this week. This is my second application/experiment with it. I really like, but I am not yet aware of some of its idiosyncrasies.
EDIT
Updated script to reflect suggested change and outcome of runtime from said change. With said change, I can now run the program successfully over the number 45 now but only up to about low 90s. So that's interesting. I'm gonna run through this and see where I may need to add additional explicit casting. It seems very unintuitive that changing the type at the time of initiation didn't "stick" through the entire runtime, which I tried first and that failed. Something doesn't make sense here to me.
Original Results
$ crystal build fib.cr
$ ./fib 45
1836311903
$ ./fib 46
Arithmetic overflow
$ ./fib.rb 460
985864329041134079854737521712801814394706432953315\
510410398508752777354792040897021902752675861
Latest Results
$ ./fib 92
12200160415121876738
$ ./fib 93
Arithmetic overflow
./fib <integer>
Given Fibonacci; determine which fib value would
exist at <integer> index.
Edit ^2
Now also decided that maybe ARGV[0] is the problem. So I changed the call to f() to:
62 begin
63 i = ARGV[0].to_u64.as(UInt64)
64 puts f i
65 rescue e
66 STDERR.puts e
67 usage
68 end
and added a debug print to show the types of the variables in use:
22 return 0 if p == 0
23
24 puts "p: %s\tfib_now: %s\tfib_last: %s\tfib_hold: %s\ti: %s" % [typeof(p), typeof(fib_now), typeof(fib_last), typeof(fib_hold), typeof(i)]
25 loop do
p: UInt64 fib_now: UInt64 fib_last: UInt64 fib_hold: UInt64 i: UInt64
Arithmetic overflow
./fib <integer>
Given Fibonacci; determine which fib value would
exist at <integer> index.
Edit ^3
Updated with latest code after bug fix solution by Jonne. Turns out the issue is that I'm hitting the limits of the structure even with 128 bit unsigned integers. Ruby handles this gracefully. Seems that in crystal, it's up to me to gracefully handle it.
The default integer type in Crystal is Int32, so if you don't explicitly specify the type of an integer literal, you get that.
In particular the lines
fib_last = 0
fib_now = 1
turn the variables into the effective type Int32. To fix this, make sure you specify the type of these integers, given you don't need negative numbers, UInt64 seems most appropriate here:
fib_last = 0u64
fib_now = 1u64
Also note the the literal syntax I'm using here. Your 0.to_i64's create an In32 and then an Int64 out of that. The compiler will be smart enough to do this conversion at compile time in release builds, but I think it's nicer to just use the literal syntax.
Edit answering to to the updated question
Fibonacci is defined as F0 = 0, F1 = 1, Fn = Fn-2 + Fn-1, so 0, 1, 1, 2, 3, 5.
Your algorithm is off by one. It calculates Fn+1 for a given n > 1, in other words 0, 1, 2, 3, 5, in yet other words it basically skips F2.
Here's one that does it correctly:
def fib(n)
return 0 if n == 0
last = 0u64
current = 1u64
(n - 1).times do
last, current = current, last + current
end
current
end
This correctly gives 7540113804746346429 for F92 and 12200160415121876738 for F93. However it still overflows for F94 because that would be 19740274219868223167 which is bigger than 264 = 18446744073709551616, so it doesn't fit into UInt64. To clarify once more, your version tries to calculate F94 when being asked for F93, hence you get it "too early".
So if you want to support calculating Fn for n > 93 then you need to venture into the experimental Int128/UInt128 support or use BigInt.
I think one more thing should be mentioned to explain the Ruby/Crystal difference, besides the fact that integer literals default to Int32.
In Ruby, a dynamically typed interpreted language, there is no concept of variable type, only value type. All variables can hold values of any type.
This allows it to transparently turn a Fixnum into a Bignum behind the scenes when it would overflow.
Crystal on the contrary is a statically typed compiled language, it looks and feels like Ruby thanks to type inference and type unions, but the variables themselves are typed.
This allows it to catch a large number of errors at compile time and run Ruby-like code at C-like speed.
I think, but don't take my word for it, that Crystal could in theory match Ruby's behavior here, but it would be more trouble than good. It would require all operations on integers to return a type union with BigInt, at which point, why not leave the primitive types alone, and use big integers directly when necessary.
Long story short, if you need to work with very large integer values beyond what an UInt128 can hold, require "big" and declare the relevant variables BigInt.
edit: see also here for extreme cases, apparently BigInts can overflow too (I never knew) but there's an easy remedy.
I used qpdf to uncompress a PDF file and below is the output. You can see that there both, encoding and ToUnicode, are present. If there is only ToUnicode I know how to map individual characters with Cmap file. But if you see output of Content stream is following
Tf
0.999402 0 0 1 71.9995 759.561 Tm
[()-2.11826()-1.14177()2.67786()-2.11826()8.55269()-5.44998()-4.70186()2.67786()-2.32338()2.67786()12.679( )-3.75591()9.73429()]TJ
in break-at there are some garbage data that is not visible. So how to link data to cmap file ?
And one another question is that in /Encoding what are values contain in Difference ?
10 0 obj
<< /BaseEncoding /WinAnsiEncoding /Differences [ 1 /g100 /g28 /g94 /g3 /g87 /g24 /g38 /g47 /g62 ] /Type /Encoding >>
Even if I pass one by one values of Difference array into one of FreeType function is named as FT_Get_Name_Indek. This function return values like [ 100 28 94 3 87 24 38 47 62]
What is those values ? how to map those Value ?
here is pdf
run following cmd
qpdf --stream-data=uncompress input.pdf output.text
output.text
I got the same output if I pass contents stream data into zlib. kindly check output.txt file from link
Firstly the general question
how to exract the text in pdf if encoding and ToUnicode both are present in pdf? how to map it?
[...] if you see there are encoding and ToUnicode both are present in pdf. i know if only ToUnicode is there so how to map individual char with Cmap file.
In such a case, i.e. when you have both a sufficiently complete and correct ToUnicode map and an Encoding for a font, you can ignore the Encoding and only use the ToUnicode map.
This follows from the PDF specification which in section 9.10.2 "Mapping Character Codes to Unicode Values" states that the methods to map a character code to a Unicode value with the highest priority is
If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.
Thus, if you (as you say) already know how to extract text if there only is a ToUnicode map, you can use the same algorithm unchanged. And as a corollary, if that doesn't work, the ToUnicode map in question is insufficiently complete or incorrect, or your knowledge itself on how to extract text using only a ToUnicode map actually is incomplete.
Secondly the sample document
You wrote
[()-2.11826()-1.14177()2.67786()-2.11826()8.55269()-5.44998()-4.70186()2.67786()-2.32338()2.67786()12.679( )-3.75591()9.73429()]TJ
in break-at there are some garbag data that is not visible. so how to link data to cmap file ?
In the brackets there are the values identifying your glyphs, so they definitively are not garbage.
Thus, here are the byte values from within the brackets:
[(
01
)-2.11826(
02
)-1.14177(
03
)2.67786(
01
)-2.11826(
04
)8.55269(
05
)-5.44998(
06
)-4.70186(
07
)2.67786(
04
)-2.32338(
07
)2.67786(
08
)12.679(
09
)-3.75591(
02
)9.73429(
04
)]TJ
Using the ToUnicode map of the font in question
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
1 begincodespacerange
<00><ff>
endcodespacerange
9 beginbfrange
<01><01><0054>
<02><02><0045>
<03><03><0053>
<04><04><0020>
<05><05><0050>
<06><06><0044>
<07><07><0046>
<08><08><0049>
<09><09><004c>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end
the byte values from within the brackets map to:
01 0054 "T"
02 0045 "E"
03 0053 "S"
01 0054 "T"
04 0020 " "
05 0050 "P"
06 0044 "D"
07 0046 "F"
04 0020 " "
07 0046 "F"
08 0049 "I"
09 004c "L"
02 0045 "E"
04 0020 " "
Thus,
"TEST PDF FILE "
which matches the rendered file just fine:
Thirdly the encoding
and one another question is that in /Encoding what are values contain in Difference ?
10 0 obj << /BaseEncoding /WinAnsiEncoding /Differences [ 1 /g100 /g28 /g94 /g3 /g87 /g24 /g38 /g47 /g62 ] /Type /Encoding >>
According to the PDF specification,
The value of the Differences entry shall be an array of character codes and character names organized as follows:
code1 name1,1 name1,2 …
code2 name2,1 name2,2 …
…
coden namen,1 namen,2 …
Each code shall be the first index in a sequence of character codes to be changed. The first character name after the code becomes the name corresponding to that code. Subsequent names replace consecutive code indices until the next code appears in the array or the array ends. These sequences may be specified in any order but shall not overlap.
Thus, the encoding entry in your case says that the encoding basically is WinAnsiEncoding with the difference that the codes 1, ..., 9 instead represent the glyphs named /g100, /g28, /g94, /g3, /g87, /g24, /g38, /g47, and /g62 respectively.
As these glyph names are no standard glyph names, the PDF specification does not consider this encoding helpful for text extraction because it only describes a method for a simple font
that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D)
The "/gXX" names in your sample clearly are not among them.
It's worth observing that most of the time the /Encoding map is a character codes (intended as the encoded bytes of a string) to CID map, where CID (Character ID) in most font types corresponds to a glyph index/identifier. The exception appears to with Type2 fonts which have separate CID and GID (Glyph ID) concepts, supplying a /CIDToGIDMap to convert between them. In the above cases the /Encoding map has nothing to do with decoding an Unicode representation of the string. To decode the Unicode representation you definitely should use the /ToUnicode when available, as pointed bt #mkl. If it is not available, you are in one case where you either have a predefined encoding (optionally with a /Difference map) or CMap, or you a in a case where the font program supplies an implicit encoding, like in Type1 fonts. This is all stated in the very good #mkl answer as well. /Encoding could possibly corresponds to the map to convert between the character codes and Unicode code points when it's either a predefined encoding (like MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, but I also saw use of possibly non compliant Identity-H, which is a predefined CMap name, not a predefined encoding) or in a supposedly malformed font. With this regard PDF reference/standard is often confusing about what is legal and what is not, so a library decoding encoded strings in PDF should always be as lenient as possible. Also the PDF reference/standard itself is not much clear in explaining the distinction between character codes, CID, GID and Unicode representations.
I found this code which does exactly what I want: gets a list of integers, and generates every combination.
perm([H|T], Perm) :-
perm(T, SP),
insert(H, SP, Perm).
perm([], []).
insert(X, T, [X|T]).
insert(X, [H|T], [H|NT]) :-
insert(X, T, NT).
Now what I want to do is, if one permutation does not meet some criteria, I want perm to return another result. So, and sorry for the lack of vocabulary, I want the same effect that would happen if I would execute that code, got a solution, and typed ; to get more results. I believe this is a very simple idea but I can't see it right now.
So, pseudocode would be:
enumerate(inputList, outputNodes, OutputArcs) :-
perm(inputList,OutputPermutation),
getArcs(OutputPermutation,OutputArcs),%I want to build the OutputArcs, then check for every element to be unique, if it isn't, generate another list with perm, if it IS, return said list as accepted)
areArcsNumberUniques(OutputArcs,OutputArcs),%TODO now is when I do not know how to make the call, here if it is valid, end, if it isn't, call perm again)
So I would need to understand how do I go about this. Also, any other ideas about the problem are welcome, since I'm brute forcing my way because I'm unable to find any type of algorithm or pattern to solve the actual problem (which I've asked about before. This is my attempted solution, just in order to give an actual answer to the exercise...)
edit: query:
enumerate([a-b,b-c], EnumNodos, EnumArcos).
expected output:
EnumNodos = [enum(3,a), enum(1,b), enum(2,c)],
EnumArcos = [enum(2,a‐b), enum(1,b‐c)]
This would be like the end game goal, where I get a list of arcs where each arc has an unique value that is equal to substracting the values of its nodes (every node also has an unique value).
And so far, since I did not find any way to do this algorithmically, I thought about trying every possibility (basically I cannot get an unique way to do this, trees with different branches seem different to me, and only restriction is that there are N nodes and N-1 arcs).
edit more examples:
6a
5 4
1b 2e
2 3
3c 5f
1
4d
EnumNodos = [enum(6,a), enum(1,b), enum(2,e), enum(3,c), enum(5,f), enum(4,d)],
EnumArcos = [enum(5,a‐b), enum(4,a-e), enum(3,e-f), , enum(2,b-c), enum(1,c-d)]
5a
4 3
1b 2e
1 2
3c 4f
EnumNodos = [enum(5,a), enum(1,b), enum(2,e), enum(3,c), enum(4,f)],
EnumArcos = [enum(4,a‐b), enum(3,a-e), enum(1,b-c), , enum(2,e-f)]
5a
4 3
1b 2e
2
3c
1
4d
9a
8 7
1b 2e
6 4
7c 6f
2 2
5d 4g
3 1
8h 3i
I'm trying to use the indices of a sorted column of a dataset. I want to reorder the entire dataset by one sorted column.
area.sort<-sort(xsample$area1, index.return=TRUE)[2]
The output is a list, so I can't use it index through the whole dataset.
Error in xj[i] : invalid subscript type 'list'
Someone suggested using unlist but I can't get rid of the ix*.
Any ideas? Thanks
> area.sort<-unlist(area.sort)
ix1 ix2 ix3 ix4 ix5 ix6 ix7 ix8 ix9 ix10 ix11 ix12 ix13
45 96 92 80 53 54 24 21 63 81 40 66 64
The call to sort with index.return=TRUE returns a list with two components: x and ix. Indexing with [2] returns a subset of the list - still a list.
If you index using [[2]] it should work better. That returns the element in the list.
But indexing using $ix is perhaps a bit clearer.
But then again, if you only need the sorted indices, you should call order instead of sort...