shortest encoding for Guid for use in a URL - compression

Mads Kristensen got one down to 00amyWGct0y_ze4lIsj2Mw
Can it go smaller than that?

Looks like there are only 73 characters that can be used unescaped in a URL. IF that's the case, you could convert the 128-bit number to base 73, and have a 21 character URL.
IF you can find 85 legal characters, you can get down to a 20 character URL.

A GUID looks like this c9a646d3-9c61-4cb7-bfcd-ee2522c8f633 - that's 32 hex digits, each encoding 4 bits, so 128 bits in total
A base64 encoding uses 6 bits per symbol, which is easy to achieve with URL safe chars to give a 22 char encoded string. As others have noted, you could with with 73 url safe symbols and encoded as a base 73 number to give 21 chars.

Related

Extract fixed length string between two numbers

I have this number: 003859389453604802410207622210986832370060. In this instance, I need to extract 07622210986832 which comes before 02 and ends with 37.
In the real world, 07622210986832 is always 14 digits, and will always start with 02 and end with 37 BUT it could appear at any point in a string that is of random length - all we know is that the number will be there somewhere.
I'm currently using the formula:
=IF(LEN(IFERROR(REGEXEXTRACT(A1:A&"", "02(.*)37")))=14,
However, you will notice in the number sample there is another 02 - "024102".
This is causing an issue.
What I really want to happen is:
Lookup 02
Find the string of 14 numbers and if number 15 is 3 and 16 is 7 (37), that is the number we need.
If you find another 02 number with a 14 digit string and the next two numbers are not 37 - ignore.
Use the pattern 02(\d{14})37, it will extract a sequence of 14 digits preceded by 02 and followed by 37.
try like this:
=ARRAYFORMULA(REGEXEXTRACT(TO_TEXT({A2:A,B2:B,C2:C}), "02(\d{14})37"))
if you want to smash it into 1 column then:
=ARRAYFORMULA(TRIM(TRANSPOSE(QUERY(TRANSPOSE(REGEXEXTRACT(TO_TEXT({A2:A,B2:B,C2:C}),
"02(\d{14})37")),,999^99))))

Regular expression to validate 2 character hex string

I have a source of data that was converted from an oracle database and loaded into a hadoop storage point. One of the columns was a BLOB and therefore had lots of control characters and unreadable/undetectable ascii characters outside of the available codeset. I am using Impala to write regex replace function to parse some of the unicode characters that the regex library cannot understand. I would like to remove the offending 2 character hex codes BEFORE I use the unhex query function so that I can do the rest of the regex parsing with a "clean" string.
Here's the code I've used so far, which doesn't quite work:
'[2-7]{1}([A-Fa-f]|[0-9]{1})'
I've determined that I only need to capture \u0020-\u007f - or represented in the two bit hex - 20-7f
If my string looks like this:
010A000000153020405C00000000143020405CBC000000F53320405C4C010000E12F204058540100002D01
I would like to be able to capture 2 characters at a time (e.g. 01,0A,00) evaluate whether or not that fits the acceptable range of 2 byte hex I mentioned above and return only what is acceptable.
The correct output should be:
30 20 40 5C 30 20 40 5C 33 20 40 5C 4C 2F 20 40 58 and 54
However, my expression finds the first acceptable number in my first range (5) and starts the capture from there which returns the position or indexing wrong for the rest of the string... and this is the return from my expression -
010A0000001**53**0**20****40****5C**000000001**43**0**20****40****5C**BC000000F**53****32**0**40****5C****4C**010000E1**2F****20****40****58****54**010000**2D**01
I just don't know how to evaluate only two characters at a time in a mixed-length string. And, if they don't fit the expression, iterate to the next two characters. But only in two character increments.
My example: https://regex101.com/r/BZL7t0/1
I have added a Positieve Lookbehind to it. Which starts at the beginning of the string and then matches 2 characters at the time. This ensures that the group you're matching always has groups of 2 characters before it.
Positieve Lookbehind:
(?<=^(..)*)
Updated regex:
(?<=^(..)*)([2-7]{1}[A-Fa-f0-9]{1})
Preview:
Regex101

Regex that allows 3-20 digits and must end with 00

I have a regex ".*00$" which restricts that the input must be ending with 00. How can I improve this to add a max length of 20 also. So:
100 => valid,
00 = > invalid,
12345678900 => valid,
111111111111111111100 = > 21 digits - invalid.
Based on your title and short description this should probably work for you:
^[0-9]{1,18}00$
This will allow 3-20 digits in input that ends with 00
This can be expressed almost literally as
/^[0-9]{1,18}00$/
20 digits, the last two being zeros.
You can use {} to specify a range of characters, so
".{1,18}00"
would allow any 1-18 characters followed by 00. If you want to restrict it further, you could use
"[1-9][0-9]{0,17}00"
so that you ensure the first number is not 0, followed by 0-17 numbers and finally 00.
Try this :
^[1-9][0-9]{0,17}00$
The first [] will ensure it doesn't start by 0.

Storing crypted data with libconfig

I'm using the libconfig to create an configuration file and one of the fields is a content of a encrypted file. The problem occurs because in the file have some escapes characters that causes a partial storing of the content. What is the best way to store this data to avoid accidental escapes caracter ? Convert to unicode?
Any suggestion?
You can use either URL encoding, where each non-ASCII character is encoded as a % character followed by two hex digits, or you case use base64 encoding, where each set of 3 bytes is encoded to 4 ASCII characters (3x8 bits -> 4x6 bits).
For example, if you have the following bytes:
00 01 41 31 80 FE
You can URL encode it as follows:
%00%01A1%80%FE
Or you can base64 encode it like this, with 0-25 = A-Z, 26-51 = a-z, 52-62 = 0-9, 62 = ., 63 = /:
(00000000 00000001 01000001) (00110001 10000000 11111110) -->
(000000 000000 000101 000001) (001100 011000 000011 111110)
AAJBNYD.
The standard for encoding binary data in text used to be uuencode and is now base64. Both use same paradigm: a byte uses 8bits, so 3 bytes use 24 bits or 4 6 bits characters.
uuencode just used the 6 bits with an offset of 32 (ascii code for space), so characters are in range 32-96 => all in printable ascii range, but including space and possibly other characters that could have special meanings
base64 choosed these 64 characters to represent values from 0 to 63 (no =:;,'"\*(){}[] that could have special meaning...):
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
and the equal sign(=) being a place holder for empty positions and the end of an encoded string to ensure that the encoded string length is a multiple of 4.
Unfortunately, neither the C nor C++ standard library offer functions for uuencode not base 64 conversions, but you can find nice implementations around, with many pointers in this other SO answer: How do I base64 encode (decode) in C?

Parse certain bytes of a variable in bash

i have a variable that contains the following string (where each dot stands for a non-printable character):
.[?1h.=.81..
which is this in hex:
ESC [ ? 1 h ESC = CR 8 1 CR LF
1b 5b 3f 31 68 1b 3d 0d 38 31 0d 0a
What i want is to isolate the '81'. The number 81 can change, so it can be for example 100 and uses 3 bytes in the string then but the number is always between the two "0x0d".
So the goal is to isolate all bytes (which are always numbers in ascii) between the two "0x0d" and save them as an integer in another variable.
Is this possible with only using bash? Would it be possible to work with regex?
You can do it like this:
a=$'\033[?1h\033=\r81\r\n' # or a=$'\x1b[?1h\x1b=\r81\r\n'
[[ $a =~ $'\r'([0-9]+)$'\r' ]] && echo ${BASH_REMATCH[1]}
The $'...' will interpret escape sequences in a string like \r, \n, octal representation \033 or hex representation \x1b
A simple Regex would capture the required decimal characters in hex as follows:
0[dD](\s*3(\d))*\s*0[dD]
Group 2 captures the decimal value, which is the hex value - 30, so only the second character.
Unfortunately only the last group is captured. If you can restrict yourself to a certain number of maximal decimal places you can simply duplicate the term as in
0[dD](\s*3(\d))(\s*3(\d))?(\s*3(\d))?\s*0[dD]
and replace it by
\2\4\6
to get the decimal value.
Edit
If your input is not hex but an ordinary string, it would look as follows
\x0d(\d)*\x0d
or with manual repetition (here 3x):
\x0d(\d)(\d)?(\d)?\x0d
with the same replacement pattern
\1\2\3
Edit2
In sed it should work as follows:
sed -n "s/^.*\x0d(\d)(\d)?(\d)?\x0d.*$/\1\2\3/"
now with start and end padding ^.*matcher.*$, and replacement pattern. s/search/replace/