AWS Signature Version 2 Example not reproducible - amazon-web-services

Like the guy in this question (AWS Signature Version 2 - can't reproduce signature from example) I can't run the example of AWS Signature Version 2 (https://docs.aws.amazon.com/general/latest/gr/signature-version-2.html).
We have the string:
GET\nelasticmapreduce.amazonaws.com\n/\nAWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&Action=DescribeJobFlows&SignatureMethod=HmacSHA256&SignatureVersion=2&Timestamp=2011-10-03T15%3A19%3A30&Version=2009-03-31
and the sample secret key
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
To be independent of any programming language, lets take an online tool for the hash, which is calculated with HmacSHA256: https://www.liavaag.org/English/SHA-Generator/HMAC/
But I get the following hash value:
xgbYI2xegVYMVTvnhoqc8/opbN0v/5Pn+8i9usAQAjk=
which is sadly not the expected value (not URL-encoded here):
i91nKc4PWAt0JJIdXwz9HxZCJDdiy6cf/Mj6vPxyYIs=
What did I do wrong? Why is my calculation of the hash value not correct? Is the initial string correct? If you manage to get the right result with the online tool, please let me know how it was done.

TLDR: It's the newlines
Although some tools and programming languages, particularly those based on C or originating on Unix where C was heavily used, treat \n as a notation or representation for newline, that webpage does not. If I enter the string from your Q in the webpage's 'text' mode, it computes the HMAC of a value containing a backslash and a lowercase letter 'en', not a newline as required by the AWS spec.
If I enter the correct input (containing newlines) in hex as
4745540a656c61737469636d61707265647563652e616d617a6f6e6177732e636f6d0a2f0a4157534163636573734b657949643d414b4941494f53464f444e4e374558414d504c4526416374696f6e3d44657363726962654a6f62466c6f7773265369676e61747572654d6574686f643d486d6163534841323536265369676e617475726556657273696f6e3d322654696d657374616d703d323031312d31302d3033543135253341313925334133302656657273696f6e3d323030392d30332d3331
or in base64 as
R0VUCmVsYXN0aWNtYXByZWR1Y2UuYW1hem9uYXdzLmNvbQovCkFXU0FjY2Vzc0tleUlkPUFLSUFJT1NGT0ROTjdFWEFNUExFJkFjdGlvbj1EZXNjcmliZUpvYkZsb3dzJlNpZ25hdHVyZU1ldGhvZD1IbWFjU0hBMjU2JlNpZ25hdHVyZVZlcnNpb249MiZUaW1lc3RhbXA9MjAxMS0xMC0wM1QxNSUzQTE5JTNBMzAmVmVyc2lvbj0yMDA5LTAzLTMx
then I get the correct result (and you should too).

Related

CloudSearch wildcard query not working with 2013 API after migration from 2011 API

I've recently upgraded a CloudSearch instance from the 2011 to the 2013 API. Both instances have a field called sid, which is a text field containing a two-letter code followed by some digits e.g. LC12345. With the 2011 API, if I run a search like this:
q=12345*&return-fields=sid,name,desc
...I get back 1 result, which is great. But the sid of the result is LC12345 and that's the way it was indexed. The number 12345 does not appear anywhere else in any of the resulting document fields. I don't understand why it works. I can only assume that this type of query is looking for any terms in any fields that even contain the number 12345.
The reason I'm asking is because this functionality is now broken when I query using the 2013 API. I need to use the structured query parser, but even a comparable wildcard query using the simple parser is not working e.g.
q.parser=simple&q=12345*&return=sid,name,desc
...returns nothing, although the document is definitely there i.e. if I query for LC12345* it finds the document.
If I could figure out how to get the simple query working like it was before, that would at least get me started on how to do the same with the structured syntax.
Why it's not working
CloudSearch v1 (2011) had a different way of tokenizing mixed alpha+numeric strings. Here's the logic as described in the archived docs (emphasis mine).
If a string contains both alphabetic and numeric characters and is at
least three and no more than nine characters long, the alphabetic and
numeric portions of the string are treated as separate tokens. For
example, the string DOC298 is tokenized into two terms: doc 298
CloudSearch v2 (2013) text processing follows Unicode Text Segmentation, which does not specify that behavior:
Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Solution
You should just be able to search *12345 to get back results with any prefix. There may be some edge cases like getting back results you don't want (things with more preceding digits like AB99912345); I don't know enough about your data to say whether those are real concerns.
Another option would would be to index the numeric prefix separately from the alphabetical suffix but that's additional work that may be unnecessary.
I'm guessing you are using Cloudsearch in English, so maybe this isn't your specific problem, but also watch out for Stopwords in your search queries:
https://docs.aws.amazon.com/cloudsearch/latest/developerguide/configuring-analysis-schemes.html#stopwords
In your example, the word "jo" is a stop word in Danish and another languages, and of course, supported languages, have a dictionary of stop words that has very common ones. If you don't specify a language in your text field, it will be English. You can see them here: https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html#text-processing-settings

What's the format of a CUID in SAP BI/BO?

I'm interfacing with an SAP BI/BO server and some webservices require an input id, called "CUID" (Cluser Unique ID). for example, there's a webservice getObjectById which reqires a cuid as input.
I'm trying to make my code more robust by checking if the cuid entered by a user makes sense, but I can't find a regular expression that properly describes how a CUID looks like. There is a lot of documentation for GUID, but they're not the same. Below are some examples of CUID's found in our system and it looks like they are well-formatted but I'm not sure:
AQA9CNo0cXNLt6sZp5Uc5P0
AXiYjXk_6cFEo.esdGgGy_w
AZKmxuHgAgRJiducy2fqmv0
ASSn7jfNPCFDm12sv3muJwU
AUmKm2AjdPRMl.b8rf5ILww
AaratKz7EDFIgZEeI06o8Fc
ATjdf_MjcR9Anm6DgSJzxJ8
AaYbXdzZ.8FGh5Lr1R1TRVM
Afda1n_SWgxKkvU8wl3mEBw
AaZBfzy_S8FBvQKY4h9Pj64
AcfqoHIzrSFCnhDLMH854Qc
AZkMAQWkGkZDoDrKhKH9pDU
AaVI1zfn8gRJqFUHCa64cjg
My guess would: start with capital A, then add 22 random characters in range [0-9A-Za-Z_.]. but perhaps it could be the A means something else and after awhile it would be using B...
Is anyone familiar with this type of id's and how they are formatted?
(quick side question: do I need to escape the "dot" in the square brackets like this \. to get the actual dot character?)
The definition of the different ID types and their purpose is described in the SAP KB note 1285103: What are the different types of IDs used in the BusinessObjects Enterprise repository?
However, I couldn't find any description of the format of the CUID. I wouldn't make any assumptions about it though, other than the fact that it's alphanumeric.
I did a quick query on a repository and found CUIDs consisting up to 35 characters and beginning with the letters A,B,C,F,k and M.
If you look at the repository database, more specifically the table CMS_INFOOBJECTS7, you'll notice that the column SI_CUID is defined as a VARCHAR2, 56 bytes in size (Oracle RDBMS).
Thus, a valid regex expression to match these would be [a-zA-Z0-9\._]+.

Facing difficulties with amazon url signature

I am following each step to create a signature for my url requests to amazon(or at least that's what I think) but it doesn't work.
I am trying to sign an example from the amazon's page( http://docs.aws.amazon.com/AWSECommerceService/latest/DG/rest-signature.html
I have downloaded the s3-sigtester, a javascript file that creates the signatures. The string that I am signing is:
GET \necs.amazonaws.co.uk \n/onca/xml \nAWSAccessKeyId=AKIAJOCH6NNDJFTB4LYA&Actor=Johnny%20Depp&AssociateTag=memagio-21&Operation=ItemSearch&ResponseGroup=ItemAttributes%2COffers%2CImages%2CReviews%2CVariations&SearchIndex=DVD&Service=AWSECommerceService&Sort=salesrank&Timestamp=2014-10-19T21%3A21%3A55Z&Version=2009-01-01
The string above is the result from the sigtester. I am feeding it in hex. I get a signature and then, I am trying to access the following url, in order to get the xml values:
http://ecs.amazonaws.co.uk/onca/xml?AWSAccessKeyId=AKIAJOCH6NNDJFTB4LYA
&Actor=Johnny%20Depp&AssociateTag=memagio-21&Operation=ItemSe
arch&ResponseGroup=ItemAttributes%2COffers%2CImages%2CReviews%2CV
ariations&SearchIndex=DVD&Service=AWSECommerceService&Signature=vZK%2BhDqtcV1CoTf6%2FN1ohR3Da5M%3D&Sort=salesrank&Ti
mestamp=2014-10-19T21%3A21%3A55Z&Version=2009-01-01
The AWASCcessKeyId and signature key are the AWS keys that I have created. However, I get an error that the signatures do not match. I think that I am following all the steps and I really don't know what's going on. Thanks.
The string that I am signing is:
GET \necs.amazonaws.co.uk \n/onca/xml \nAWSAccessKeyId=AKIAJOCH6NNDJFTB4LYA&Actor=Johnn
I assume \n denotes a newline character (Unicode 000A). The problem is that there should not be spaces before the newlines - it needs to be GET\necs.amazonaws.co.uk\n...

How to set the ASN1 NumericString type to SubjectDN OID?

I have a working program, which generates a CSR, from specified SubjectDN string (example: 2.5.4.3=Name Surname, 1.2.300.38.22=12345678), using MS Crypto API. I use the function: CertStrToName(), to encode it, and everything is working fine, except one thing: all OID values is created with ASN1 type PrintableString.
Is there any way to make OID 1.2.300.38.22 of type NumericString ?
So, i've found 2 ways to fix that:
1. programmatically, using the function CryptEncodeObject()
2. my cryptoprovider supports some specific oid's, so i could use the CertStrToName with them, without touching the code.
Microsoft's CertStrToName()-method is not RFC 4514 compliant. Instead of treating #-encodings as the AttributeValue-encodings, it treats them as values to be encoded in OctetStrings. This means that not all Distringuished Names can be generated from the CertStrToName-method - in particular yours cannot be generated.
The string representation of the distinguished name is the one from RFC 4514: String Representation of Distinguished Names.
Here you can see that if the attribute-type is in the dotted-decimal form, you are actually supposed to encode the attribute-value as a # followed by a BER encoding in hexadecimal of the ASN.1 AttributeValue. I.e.:
2.5.4.3=Name Surname, 1.2.300.38.22=#12083132333435363738
You can also read in the documentation for CertStrToName() that:
A value that starts with a number sign (#) is treated as ASCII
hexadecimal and converted to a CERT_RDN_OCTET_STRING. Embedded white
space is ignored. For example, 1.2.3 = # AB CD 01 is the same as
1.2.3=#ABCD01.

RegEx to parse or validate Base64 data

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.
I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.
So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.
Content-Transfer-Encoding: base64
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.
Content-Transfer-Encoding: base64
http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.
My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!
[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8
Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?
UPDATE:
Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.
[^-A-Za-z0-9+/=]|=[^=]|={3,}$
And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.
From the RFC 4648:
Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.
So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.
But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
This one is good, but will match an empty String
This one does not match empty string :
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$
The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://www.rfc-editor.org/rfc/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.
Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$
The cited RFC considers the empty string as valid (see https://www.rfc-editor.org/rfc/rfc4648#section-10) therefore the above regex also does.
The equivalent regular expression for base64url (again, refer to the above RFC) is:
^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$
Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like
my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;
say decode_base64($sanitized_str);
might be what you want. It produces
This is simple ASCII Base64 for StackOverflow exmaple.
The best regexp which I could find up till now is in here
https://www.npmjs.com/package/base64-regex
which is in the current version looks like:
module.exports = function (opts) {
opts = opts || {};
var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';
return opts.exact ? new RegExp('(?:^' + regex + '$)') :
new RegExp('(?:^|\\s)' + regex, 'g');
};
Here's an alternative regular expression:
^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$
It satisfies the following conditions:
The string length must be a multiple of four - (?=^(.{4})*$)
The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
It can have up to two padding (=) characters on the end - ={0,2}
It accepts empty strings
To validate base64 image we can use this regex
/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}
private validBase64Image(base64Image: string): boolean {
const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
return base64Image && regex.test(base64Image);
}
The shortest regex to check RFC-4648 compiliance enforcing canonical encoding (i.e. all pad bits set to 0):
^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$
Actually this is the mix of this and that answers.
I found a solution that works very well
^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$
It will match the following strings
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy
while it won't match any of those invalid
YW5#IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===
My simplified version of Base64 regex:
^[A-Za-z0-9+/]*={0,2}$
Simplification is that it doesn't check that its length is a multiple of 4. If you need that - use other answers. Mine is focusing on simplicity.
To test it: https://regex101.com/r/zdtGSH/1