What characters are invalid in searchtext for the here.com geocoding REST api? - geocoding

I can't find any documentation for what characters are invalid in the searchtext url parameter. Here.com refers all technical questions here, and it seems no one has asked this before.
This url generates an error with a hash:
https://geocoder.api.here.com/6.2/geocode.json?app_id=myAppID&app_code=myAppCode&searchtext=16055 Sierra Lakes Pkwy #100, Fontana, 92336
Removing the hash fixes the problem, but I was hoping someone has worked out what is and isn't acceptable.

The hash character itself is not a "forbidden" character, it just needs to be properly encoded:
16055%20Sierra%20Lakes%20Pkwy%20%23100%2C%20fontana
On the contrary the hash symbol may be used to denote a secondary address unit designator as it is frequently used in the US and used in the query above.

Related

AWS Signature Version 2 Example not reproducible

Like the guy in this question (AWS Signature Version 2 - can't reproduce signature from example) I can't run the example of AWS Signature Version 2 (https://docs.aws.amazon.com/general/latest/gr/signature-version-2.html).
We have the string:
GET\nelasticmapreduce.amazonaws.com\n/\nAWSAccessKeyId=AKIAIOSFODNN7EXAMPLE&Action=DescribeJobFlows&SignatureMethod=HmacSHA256&SignatureVersion=2&Timestamp=2011-10-03T15%3A19%3A30&Version=2009-03-31
and the sample secret key
wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
To be independent of any programming language, lets take an online tool for the hash, which is calculated with HmacSHA256: https://www.liavaag.org/English/SHA-Generator/HMAC/
But I get the following hash value:
xgbYI2xegVYMVTvnhoqc8/opbN0v/5Pn+8i9usAQAjk=
which is sadly not the expected value (not URL-encoded here):
i91nKc4PWAt0JJIdXwz9HxZCJDdiy6cf/Mj6vPxyYIs=
What did I do wrong? Why is my calculation of the hash value not correct? Is the initial string correct? If you manage to get the right result with the online tool, please let me know how it was done.
TLDR: It's the newlines
Although some tools and programming languages, particularly those based on C or originating on Unix where C was heavily used, treat \n as a notation or representation for newline, that webpage does not. If I enter the string from your Q in the webpage's 'text' mode, it computes the HMAC of a value containing a backslash and a lowercase letter 'en', not a newline as required by the AWS spec.
If I enter the correct input (containing newlines) in hex as
4745540a656c61737469636d61707265647563652e616d617a6f6e6177732e636f6d0a2f0a4157534163636573734b657949643d414b4941494f53464f444e4e374558414d504c4526416374696f6e3d44657363726962654a6f62466c6f7773265369676e61747572654d6574686f643d486d6163534841323536265369676e617475726556657273696f6e3d322654696d657374616d703d323031312d31302d3033543135253341313925334133302656657273696f6e3d323030392d30332d3331
or in base64 as
R0VUCmVsYXN0aWNtYXByZWR1Y2UuYW1hem9uYXdzLmNvbQovCkFXU0FjY2Vzc0tleUlkPUFLSUFJT1NGT0ROTjdFWEFNUExFJkFjdGlvbj1EZXNjcmliZUpvYkZsb3dzJlNpZ25hdHVyZU1ldGhvZD1IbWFjU0hBMjU2JlNpZ25hdHVyZVZlcnNpb249MiZUaW1lc3RhbXA9MjAxMS0xMC0wM1QxNSUzQTE5JTNBMzAmVmVyc2lvbj0yMDA5LTAzLTMx
then I get the correct result (and you should too).

Query string degenerate cases

I am looking around looking for a correct regualr expression for validating URI query strings. I found some answers here or here but I still have doubts on the edge cases, where the key or the value could be empty. For example, should be the following treated as valid query strings?
?&&
?=
?a=
?a=&
?=a
?&=a
I am looking [...] for a correct regular expression for [valid] URI query strings.
Sure thing, no prob. As per RFC 3986, appendix B, here it is:
^([^#]*)$
If you want something more elaborate, you can check section 3.4 for the allowed characters in addition to percent-encoded entities. The regex would look something like this:
^(%[[:xdigit:]]{2}|[[:print:]])*$
As far as RFC 3986 is concerned, all your examples are valid so far. The RFC is telling us how the query string has to be encoded while saying little about how the query string has to be structured. Older RFCs are continuously shifting authority over the structure of query strings between CGI and HTTP without ever formally specifying a grammar (see e.g. RFC 3875, sec. 4.1.7, RFC 2396, sec. 3.4, RFC 1808, sec. 2.1, …).
An interesting note can be found in RFC 7230, section 2.4:
Applications MUST NOT directly specify the syntax of queries, as this can cause operational difficulties for deployments that do not support a particular form of a query.
[…]
HTML constrains the syntax of query strings used in form submission. New form languages SHOULD NOT emulate it, but instead allow creation of a broader variety of URIs
For a full validity check on such query strings, you would have to implement the algorithm for decoding formdata recommended by the W3C. Could be done in regex, but I would advise against it for reasons of sanity.
With regard to your examples: I believe they are all valid. How they are interpreted should be left to the receiving application. Some are not as much of a fringe case as you may think, though: ?&& is simply an empty dictionary while ?=a could map to { "": "a" }.

Enabling Elasticsearch index names with illegal characters

I am trying to create elasticsearch indexes with strings like xxx/yyy and xxx yyy but these are not permitted because they contain illegal characters (/ and ). These names are largely user created and out of my control so changing the names for the sake of fitting into the requirements of elasticsearch is not really an option.
This is the exact error message:
[Error: InvalidIndexNameException[[XXX\%FFZZZ] Invalid index name [XXX\%FFZZZ], must not contain the following characters [\, /, *, ?, ", <, >, |, , ,]]]
Anyways, I've tried URL encoding the strings, but that doesn't work because those include capital letters which are not permitted and backslash escaping is out of the question because it is in the list of illegal characters.
Is there a conventional solution to this problem, or do I have to come up with some sketchy serialization and/or hashing scheme to solve this?
Hmm, letting users have the control on such things like index name is asking for troubles :)
But if you're willing to pursue that route, what I suggest is simply to remove any character that is not alphanumeric and lowercase the result in the process.
In PHP that would be:
$index = preg_replace("/[^a-z0-9]+/i", "", $index);
In Java:
index = index.replace("/[^a-z0-9]+/i", "");
In Javascript:
index = index.replace(/[^a-z0-9]+/i, "");
Please do not allow users to define the index name. You can try to filter out illegal characters, but your regexp might have an issue, and you might run into trouble later.
Also users might not understand why they create problems if one usere uses My_Index and writes stuff in and the next user trying to access yndex accesses the same index.
BTW: The regexp given above is more strict than the list of legal characters asks for. For example _ is legal (but not at the beginning of the name), if you wanted to create a regexp that allows everything that is legal by ES standards, your regexp becomes more complicated and more error prone.

What's the format of a CUID in SAP BI/BO?

I'm interfacing with an SAP BI/BO server and some webservices require an input id, called "CUID" (Cluser Unique ID). for example, there's a webservice getObjectById which reqires a cuid as input.
I'm trying to make my code more robust by checking if the cuid entered by a user makes sense, but I can't find a regular expression that properly describes how a CUID looks like. There is a lot of documentation for GUID, but they're not the same. Below are some examples of CUID's found in our system and it looks like they are well-formatted but I'm not sure:
AQA9CNo0cXNLt6sZp5Uc5P0
AXiYjXk_6cFEo.esdGgGy_w
AZKmxuHgAgRJiducy2fqmv0
ASSn7jfNPCFDm12sv3muJwU
AUmKm2AjdPRMl.b8rf5ILww
AaratKz7EDFIgZEeI06o8Fc
ATjdf_MjcR9Anm6DgSJzxJ8
AaYbXdzZ.8FGh5Lr1R1TRVM
Afda1n_SWgxKkvU8wl3mEBw
AaZBfzy_S8FBvQKY4h9Pj64
AcfqoHIzrSFCnhDLMH854Qc
AZkMAQWkGkZDoDrKhKH9pDU
AaVI1zfn8gRJqFUHCa64cjg
My guess would: start with capital A, then add 22 random characters in range [0-9A-Za-Z_.]. but perhaps it could be the A means something else and after awhile it would be using B...
Is anyone familiar with this type of id's and how they are formatted?
(quick side question: do I need to escape the "dot" in the square brackets like this \. to get the actual dot character?)
The definition of the different ID types and their purpose is described in the SAP KB note 1285103: What are the different types of IDs used in the BusinessObjects Enterprise repository?
However, I couldn't find any description of the format of the CUID. I wouldn't make any assumptions about it though, other than the fact that it's alphanumeric.
I did a quick query on a repository and found CUIDs consisting up to 35 characters and beginning with the letters A,B,C,F,k and M.
If you look at the repository database, more specifically the table CMS_INFOOBJECTS7, you'll notice that the column SI_CUID is defined as a VARCHAR2, 56 bytes in size (Oracle RDBMS).
Thus, a valid regex expression to match these would be [a-zA-Z0-9\._]+.

Regular expression for validating names and surnames?

Although this seems like a trivial question, I am quite sure it is not :)
I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:
^[a-z -']+$
However, I need to support also these cases:
other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
no numbers or symbols or unnecessary punctuation or runes, etc..
titles, middle initials, suffixes are not part of this data
names are already separated by surnames.
we are prepared to force ultra rare names to be simplified (there's a person named '#' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
note that many countries have laws about names so there are standards to follow
Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?
I would be looking for something similar to the many "email address" regexes that you can find on google.
I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.
Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.
Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.
I'll try to give a proper answer myself:
The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.
Regarding numbers, there's only one case with an 8. I think I can safely disallow that.
Regarding letters, any letter is valid.
I also want to include space.
This would sum up to this regex:
^[\p{L} \.'\-]+$
This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.
So the validation code should be something like this (untested):
var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$"))
throw new ArgumentException("nameParam");
name = name.Replace("'", "'"); //&apos; does not work in IE
Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?
complete tested solution
using System;
using System.Text.RegularExpressions;
namespace test
{
class MainClass
{
public static void Main(string[] args)
{
var names = new string[]{"Hello World",
"John",
"João",
"タロウ",
"やまだ",
"山田",
"先生",
"мыхаыл",
"Θεοκλεια",
"आकाङ्क्षा",
"علاء الدين",
"אַבְרָהָם",
"മലയാളം",
"상",
"D'Addario",
"John-Doe",
"P.A.M.",
"' --",
"<xss>",
"\""
};
foreach (var nameParam in names)
{
Console.Write(nameParam+" ");
var name = nameParam.Trim();
if (!Regex.IsMatch(name, #"^[\p{L}\p{M}' \.\-]+$"))
{
Console.WriteLine("fail");
continue;
}
name = name.Replace("'", "'");
Console.WriteLine(name);
}
}
}
}
I would just allow everything (except an empty string) and assume the user knows what his name is.
There are 2 common cases:
You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.
In case (1), you can allow all characters because you're checking against a paper document.
In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".
I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like ##$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name.
EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on its own. However, this article has a section on filtering that is a starting point if you want to go that route:
s/[\<\>\"\'\%\;\(\)\&\+]//g;
"Secure Programming for Linux and Unix HOWTO" by David A. Wheeler, v3.010 Edition (2003)
v3.72, 2015-09-19 is a more recent version.
BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?
As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.
I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.
You could use the following regex code to validate 2 names separeted by a space with the following regex code:
^[A-Za-zÀ-ú]+ [A-Za-zÀ-ú]+$
or just use:
[[:lower:]] = [a-zà-ú]
[[:upper:]] =[A-ZÀ-Ú]
[[:alpha:]] = [A-Za-zÀ-ú]
[[:alnum:]] = [A-Za-zÀ-ú0-9]
It's a very difficult problem to validate something like a name due to all the corner cases possible.
Corner Cases
Anything anything here
Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.
If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.
A very contentious subject that I seem to have stumbled along here. However sometimes it's nice to head dear little-bobby tables off at the pass and send little Robert to the headmasters office along with his semi-colons and SQL comment lines --.
This REGEX in VB.NET includes regular alphabetic characters and various circumflexed european characters. However poor old James Mc'Tristan-Smythe the 3rd will have to input his pedigree in as the Jim the Third.
<asp:RegularExpressionValidator ID="RegExValid1" Runat="server"
ErrorMessage="ERROR: Please enter a valid surname<br/>" SetFocusOnError="true" Display="Dynamic"
ControlToValidate="txtSurname" ValidationGroup="MandatoryContent"
ValidationExpression="^[A-Za-z'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$">
This one worked perfectly for me in JavaScript:
^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$
Here is the method:
function isValidName(name) {
var found = name.search(/^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$/);
return found > -1;
}
Steps:
first remove all accents
apply the regular expression
To strip the accents:
private static string RemoveAccents(string s)
{
s = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.Length; i++)
{
if (CharUnicodeInfo.GetUnicodeCategory(s[i]) != UnicodeCategory.NonSpacingMark) sb.Append(s[i]);
}
return sb.ToString();
}
This somewhat helps:
^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$
This one should work
^([A-Z]{1}+[a-z\-\.\']*+[\s]?)*
Add some special characters if you need them.