How to check formatting of a SHA-1 message-digest [duplicate]

How to check formatting of a SHA-1 message-digest [duplicate] - regex

This question already has answers here:
A Regex to match a SHA1
(6 answers)
Closed 5 years ago.
I need some basic validation (sanitation checks) to determine if some input is a valid SHA1 sum or just a (random) string. If possible with simple parsing rules or a Regex.
Are there any rules to what a SHA1 sum should adhere? I cannot find any, but from quick tests, all seem to be hexadecimal and around 40 characters long[1].
I am not interested in tests that prove whether or not the SHA-1 sum was made in a secure, properly random or other manner. Just that the format is correct.
I am also not interested in testing that the digest is an actual representation of some message; Just that it has the format of digest in the first place.
For the curious: this is for an application where I build avatars for users based on a.o. their uuid. I don't, however, want to place those uuids in the URL, but obfuscate them a little. So instead of avatars/baa4833d-b962-4ab1-87c5-283c9820eac4.png, we request avatars/5f2a13cb1d84a2e019842cdb8d0c8b03c9e1e414.png. Where 5f2a... is e.g. Digest::SHA1.hexdigest(uuid + "secrect").
On the receiving side, I am adding some basic protection that sends back a 400 bad request whenever something is obviously false. Such as avatars/haxor.png or avatars/traversal../../../../attempt.png. Note that this is a very much simplified example.
[1] Two tests with different outcome:
Using sha1sum on Ubuntu Linux:
$ echo "hello" | sha1sum | cut -d" " -f1 | wc -c
41
using Ruby's Digest:
Digest::SHA1.hexdigest("hello").length
=> 40
Edit: turns out this is me, being stupid, wc-c includes the newline, as kennytm points out in the comments. Still: is it safe to assume it will be 40 characters, always?

SHA-1 has a 160 bits digest
160 bits rendered is 160 / 8 = 20 bytes.
20 bytes rendered in hexadecimal format has a length of 40 chars (digits), two chars for each byte.
Digits can be [0-9a-f]
So the following regex should correctly validate the Sha1sum rendered as a string in hexadecimal format:
/^[0-9a-f]{40}$/

Related

regex to match max of input numeric string

I have incoming input entries.
Like these
750
1500
1
100
25
55
And There is an lookup table like given below
25
7
5
75
So when I will receive my first entry, in this case its 750. So this will look up into lookup entry table will try to match with a string which having max match from left to right.
So for 750, max match case would be 75.
I was wondering, Is that possible if we could write a regex for this kind of scenario. Because if I choose using startsWith java function It can get me output of 7 as well.
As input entries will be coming from text file one by one and all lookup entries present in file different text file.
I'm using java language.
May I know how can I write a regex for this flavor..?

This doesn't seem like a regex problem at first, but you could actually solve it with a regex, and the result would be pretty efficient.
A regex for your example lookup table would be:
/^(75?|5|25)/
This will do what you want, and it will avoid the repeated searches of a naive "check every one" approach.
The regex would get complicated,though, as your lookup table grew. Adding a couple of terms to your lookup table:
25
7
5
75
750
72
We now have:
/^(7(50?|2)?|5|25)/
This is obviously going to get complicated quickly. The trick would be programmatically constructing the appropriate regex for arbitrary data--not a trivial problem, but not insurmountable either.
That said, this would be an..umm...unusual thing to implement in production code.
I would be hesitant to do so.
In most cases, I would simply do this:
Find all the strings that match.
Find the longest one.

(?: 25 | 5 | 75? )
There is free software that automatically makes full blown regex trie for you.
Just put the output regex into a text file and load it instead.
If your values don't change very much, this is a very fast way to do a lookup.
If it does change, generate another one.
Whats good about a full blown trie here, is that it never takes more than 8
steps to match.
The one I just did http://imgur.com/a/zwMhL
App screenshot
Even a 175,000 Word Dictionary takes no more than 8 steps.
Internally the app initially makes a ternary tree from the input
then converts it into a full blown regex trie.

Convert paged columns to rows with a regular expressions

So first a sample of the actual data mangled (data is originally a mix of text and numbers, there's no significance to any of the data at this point and some of the patterns are just because I replaced most of the characters with 0s, 1s and Zs because the random number generator in my brain is broken):
011.0ZN1ZZ 001.F5ZS1Z 001.ZO5ZY0
014.5ZZZ1Z 001.1SZZOZ 001.ZLMZY0
016.01NM1SU54 001.EX0Z1Z 001.LIZZOZ
018.01NM1SS41 001.F83Z1Z 001.0011M1SU54
014.ZZ1YZZ 001.ZZZ1IZ 001.0011M1SS41
013.2EBSIZ 001.ZZZ11Z 001.0011SE4
01N.ZINSIZ 001.ZZZZ1Z P01.ZZZZ1Z
01N.01NSE4 001.LSZZHG N01.ZZZZ1Z
001.01ON5O 001.5Z21OL F01.ZZZZ1Z
001.NE5ZO1 001.ZOM05O D01.ZZZZ1Z
001.ZO5ZOZ 001.01NO1G Z01.ZZZZ1Z
001.ZO5ZOZ 001.01NO1G Z01.ZZZZ1Z
001.011ZOZ 001.01NZ0Y
Some additional comments.. I can clean up whitespace and deal with record length with no issues, so I'd like to simplify the question to this, I'm just including the above in case there's a solution to the simplified version that can't be easily extended to a more complex version.
1 7 13
2 8 14
3 9 15
4 10 16
5 11 17
6 12 18
19 25
20 26
21 27
22 28
23 29
24
So there will be a variable number of pages, but the same number of columns and rows on each page (although, in case it matters significantly, it's actually 12x3 instead of 6x3 but I wanted to keep it simple if possible), although the last page may be some empty rows/columns.
I'm using notepad++ but I have access to various gnutilities so if there's a solution that's way, way better than a regular expression I don't mind, although since I'll be using this a lot and use notepad++ a lot I'd appreciate a regex solution if it isn't too insane.

If you've got Git installed on your Windows machine, you may use Perl bundled with it from Git bash. Provided your input file is named data, try the following command (caution: it will orverwrite the input file):
echo >>data ; \
perl -i -lane'
$i=0;
push #{$c[$i++]}, $_ foreach #F;
if (/^\s*$/) {
push #l, #{$_} foreach #c;
print "#l\015";
#l=#c=();
}' data
The Perl command treats each line of input as space delimited fields and accumulates the fields in the #c matrix. When encounters an empty line (if (/^\s*$/) ...), it prints the matrix columns concatenated in a list.
The input file is changed in-place. A backup copy data.bak is created.
The input file may not end with an empty line so I add one with echo >>data. This makes the Perl script shorter and easier.
Another trick is the trailing \015 in print "#l\015";. This allows us to get Windows CRLF line endings in Unix-flavoured Git bash environment.
A demo can be found here: https://ideone.com/vnYoOd. But since Ideone forbids file read/write, the original command has been modified to make the code run there.

C++ generate passwords from master password

I want to have a program that takes 2 arguments, the first a master password and the second an easy to remember string, relevant to the generated password. It processes this information and turns it into a string. So my passwords wouldn't be written anywhere, I would just remember the master password and the easy to remember string for each password. For example something like
get-password --master-pass Gh3vBF2d --name stackoverflow
would get my password for Stackoverflow.
I tried to do it with sha512. It takes a hardcoded salt + master password + the relevant string and goes 60k+ rounds and returns the hash.
This is far from perfect as the hash is hex, so it has low entropy. I'd like the output to consist of alphanumerics, lower case and upper case and some special characters. I tried to convert it to base64 and the output is too short. Not only that, but the generated passwords seem similar, for example: N2Q5MjJkZWM=, N2YzNGRkYWQ=
Anyone has an idea how I could generate a high entropy password, about 16-20 chars in length and it must not generate similar passwords.

I have been using md5sum to good effect for a while.
Command:
echo -n Gh3vBF2d#stackoverflow | md5sum | cut --bytes=1-20
^^^ domain
^^^ Master password
Output:
0e8dc2aa9a8d85afc267

Store 32 bit value as C string in most efficient form

I am trying to find the most efficient way to encode 32 bit hashed string values into text strings for transmission/logging in low bandwidth environments. Complex compression can't be used because the hash values need to be contained in human readable text strings when logged and sent between client and host.
Consider the following contrived examples:
given the key/value map
table[0xFE12ABCD] = "models/texture/red.bmp";
table[0x3EF088AD] = "textures/diagnostics/pink.jpg";
and the string formats:
"Loaded asset (0x%08x)"
"Replaced (0x%08x) with (0x%08x)"
they could be printed as:
"Loaded asset models/texture/red.bmp"
"Replaced models/texture/red.bmp with textures/diagnostics/pink.jpg"
Or if the key/value map is known by the client and server:
"Loaded asset (0xFE12ABCD)"
"Replaced (0xFE12ABCD) with (0x3EF088AD)"
The receiver can then scan for the (0xNNNNNNNN) pattern and expand it locally.
This is what I am doing right now but I would like to find a way to represent the 32 bit value more efficiently. A simple step would be to use a better identifying token:
"Loaded asset $FE12ABCD"
"Replaced $1000DEEE with $3EF088AD"
Which already reduces the length of each token - $ is not used anywhere else so it is reasonable.
However, what other options are there to make that 32 bit value even smaller? I can't use an index - it has to be a full 32 bit value because in some cases the generator of the string has the hash and sometimes it has a string it will hash immediately.

A common solution is to use Base-85 coding. You can code four bytes into five Base-85 digits, since 855 > 232. Pick 85 printable characters and assign them to the digit values 0..84. Then do base conversion to go either way. Since there are 94 printable characters in ASCII, it is usually easy to find 85 that are "safe" in whatever constrains your strings to be "readable".

Is it possible to losslessy compress 32 hexadecimal numbers into 30?

For example is it possible to compress
002e3483bbdc11ddaae0754822a559f6 into something that just takes at most 30 characters.

Yes, you can convert it to a base-32 number so the greatest 32 characters hex number i.e. ffffffffffffffffffffffffffffffff is equivalent to 80000000000000000000000000 in base-32 that only has 26 characters, also note that in base-32 you will end with a string containing only this characters: 123456789ABCDEFGHIJKLMNOPQRSTUV
For example: 002e3483bbdc11ddaae0754822a559f6 is 5OQ87EUS27F0000000000000 in base-32

If your question is to compress 32 hex numbers into 30 hex numbers.
This is impossible to occur for all test cases, since, if it were possible, multiple 32-length hex strings would have to compress to the same 30-length hex string, thus you wouldn't know which one it was (the pigeonhole principle).
A less proof-y proof - you'd be able to repeatedly invoke the process on any size file to get down to a single 30-length hex string, which doesn't make a whole lot of sense.
Here is a article I just found. Wikipedia says something similar.

Convert hex to binary then use something like base64 or any other encoding scheme, see Binary-to-text encoding (Wikipedia). This has the advantage of not requiring 128bit arithmetic like the suggested base32 solution.
Conversion to base64 and back:
$ echo 002e3483bbdc11ddaae0754822a559f6 |xxd -r -ps |openssl base64 -e |tee >(openssl base64 -d |xxd -ps)
AC40g7vcEd2q4HVIIqVZ9g==
002e3483bbdc11ddaae0754822a559f6
Cut the line starting from |tee to get only the encoded output. In most programing languages you will have core or external libraries to do hex to binary conversion and base64 encoding.
NB: Conversion to base32 would also be possible but the base32 binary to text encoding requires 8-bytes padding, so you would have to trim it then re-add the pads (=) on decode.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js