How to execute a structured query containing symbols in AWS Cloudsearch - amazon-web-services

I'm trying to execute a structured prefix query in Cloudsearch.
Here's a snippet of the query args (csattribute is of type text)
{
"query": "(prefix field=csattribute '12-3')",
"queryParser": "structured",
"size": 5
}
My above query will result in No matches for "(prefix field=csattribute '12-3')".
However, if I change my query to
{
"query": "(prefix field=csattribute '12')",
"queryParser": "structured",
"size": 5
}
Then I will get a list of results I expect.
I haven't found much in my brief googling. How do I include the - in the query? Does it need to be escaped? Are there other characters that need to be escaped?

I got pointed to the right direction via this SO question: How To search special symbols AWS Search
Below is a snippet from https://docs.aws.amazon.com/cloudsearch/latest/developerguide/text-processing.html
Text Processing in Amazon CloudSearch ... During tokenization, the
stream of text in a field is split into separate tokens on detectable
boundaries using the word break rules defined in the Unicode Text
Segmentation algorithm.
According to the word break rules, strings separated by whitespace
such as spaces and tabs are treated as separate tokens. In many cases,
punctuation is dropped and treated as whitespace. For example, strings
are split at hyphens (-) and the at symbol (#). However, periods that
are not followed by whitespace are considered part of the token.
From what I understand, text and text-array fields are tokenized based upon the analysis scheme (in my case it's english). The text was tokenized, and the - symbol is a word break token.
This field doesn't need to be tokenized. Updating the index type to literal prevents all tokenization on the field, which allows the query in my question to return the expected results.

Related

What is the regular expression for all pages except "/"?

I am using NextAuth for Next.js for session management. In addition, I am using the middleware.js to protect my routes from unauthenticated users.
According to https://nextjs.org/docs/advanced-features/middleware#matcher,
if we want to exclude a path, we do something like
export const config = {
matcher: [
/*
* Match all request paths except for the ones starting with:
* - api (API routes)
* - static (static files)
* - favicon.ico (favicon file)
*/
'/((?!api|static|favicon.ico).*)',
],
}
In this example, we exclude /api, /static,/favicon.icon. However, I want to exclude all path except the home page, "/". What is the regular expression for that? I am tried '/(*)'. It doesn't seem to work.
The regular expression which matches everything but a specific one-character string / is constructed as follows:
we need to match the empty string: empty regex.
we need to match all strings two characters long or longer: ..+
we need to match one-character strings which are not that character: [^/].
Combining these three together with the | branching operator: "|..+|[^/]".
If we are using a regular expression tool that performs substring searching rather than a full match, we need to use its anchoring features; perhaps it supports the ^ and $ notation for that: "^(|..+|[^/])$".
I'm guessing that you might not want to match empty strings; in which case, revise your requirement and drop that branch from the expression.
Suppose we wanted to match all strings, except for a specific fixed word like abc. Without negation support in the regex language, we can use a generalization of the above trick.
Match the empty string, like before, if desired.
Match all one-character strings: .
Match all two-character strings: ..
Match all strings longer than three characters: ....+
Those simple cases taken care of, we focus on matching just those three-symbol strings that are not abc. How can we do that?
Match all three-character strings that don't start with a: [^a]...
Match all three-character strings that don't have a b in the middle: .[^b].
Match all three-character strings that don't end in c: ..[^c].
Combine it all together: "|.|..|....+|[^a]..|.[^b].|..[^c]".
For longer words, we might want to take advantage of the {m,n} notation, if available, to express "match from zero to nine characters" and "match eleven or more characters".
I will need to exclude the signin page and register page as well. Because, it will cause an infinite loop and an error, if you don't exclude signin page. For register page, you won't be able to register if you are redirected to the signin page.
So the "/", "/auth/signin", and "/auth/register" will be excluded. Here is what I needed:
export const config = {
matcher: [
'/((?!auth).*)(.+)'
]
}

Extract id from URL using regex including underscores and alfanumeric characters [duplicate]

I am using a data analysis package that exposes a Regex function for string parsing. I am trying to parse a response from a website that is in the format...
key1=val1&key2=val2&key3=val3 ...
[There is the possibility that the keys and values may be percent encoded, but the current return values are not, the current return values are tokens and other info that are alphanumeric].
I understand this data to be www-form-urlencoded, or alternatively it might be known as query string format.
The object is to extract the value for a given key, if the order of the keys cannot be relied upon. For example, I might know that one of the keys I should receive is "token", so what regex pattern can I use to extract the value for the key "token"? I have searched for this but cannot find anything that does what I need, but if there is a duplicate question, apologies in advance.
In Alteryx, you may use Tokenize with a regex containing a capturing group around the part you need to extract:
The Tokenize Method allows you to specify a regular expression to match on and that part of the string is parsed into separate columns (or rows). When using the Tokenize method, you want to match to the whole token, and if you have a marked group, only that part is returned.
I bolded the part of the method description that proves that if there is a capturing group, only this part will be returned rather than the whole match.
Thus, you may use
(?:^|[?&])token=([^&]*)
where instead of token you may use any of the keys the value for which you want to extract.
See the regex demo.
Details
(?:^|[?&]) - the start of a string, ? or & (if the string is just a plain key-value pair string, you may omit ? and use (?:^|&) or (?<![^&]))
token - the key
= - an equal sign
([^&]*) - Group 1 (this will get extracted): 0 or more chars other than & (if you do not want to extract empty values, replace * with + quantifier).

Regex is not removing websites from text data in preprocessing

I am doing text preprocessing and in my text there are websites. I want to remove these but I couldn't do it.
Below is the sample text:
\n\nWorldwide web (www)\n\nName for the entirety of documents linked
through hyperlinks on the Internet; often used as a synonym for the
latter26.\n\n\n\n\n\n\n\n24\xe2\x80\x83\twww.sicherheitskultur.at,
Information Security Glossary\n\n25\xe2\x80\x83\tSource of text
(partly): KS\xc3\x96: Cyber Risk Matrix -
Glossary\n\n26\xe2\x80\x83\twww.sicherheitskultur.at, Information
Security Glossary\n\n\n\n\n\n23\n'
Websites are visible (in bold) and I want to remove these.
I have tried one code (from StackOverflow answer-Python code to remove HTML tags from a string) but it is not removing these websites.
Below is the codes:
def remove_web(text):
cleanr = re.compile('<.*?.*#>')
text = re.sub(cleanr, '', text)
return text
Thanks in advance!
so if you only want to remove this particularly URL, you could use this regex:
www\.[a-z]+\.at
(Go with David Amar's solution.)
www(\.\w+)+
Explanations :
- first it reads www
- then at least one block like this : a dot + some text (letters, numbers, undescores)
To match more chars in the url (hypens, for example), replace \w by a character set like [a-zA-Z0-9_-] for example

How to validate accented characters with coffeescript regex?

I need to validate alphabetical characters in a text field. What I have now works fine, but there is a catch, I need to allow accented characters (like āēīūčļ) and on a Latvian keyboard these are obtained by typing the singlequote first ('c -> č), so my validator fails is the user types the singlequote and a disallowed character like a number, obtaining '1.
I have this coffeescript-flavor jQuery webpage text entry field validator that only allows alphabetical characters (for entering a name).
allowAlphabeticalEntriesOnly = (target) ->
target.keypress (e) ->
regex = new RegExp("[a-zA-Z]")
str = String.fromCharCode((if not e.charCode then e.which else e.charCode))
return true if regex.test(str)
e.preventDefault()
false
And it gets called with:
allowAlphabeticalEntriesOnly $("#user_name_text")
The code and regex work fine, denying input of most everything except small and large letters and the singlequote, where things get tricky.
Is there a way to allow accented characters with the singlequote layout, but deny entry of forbidden characters after the quote?
EDIT: If all else fails, one can implement back-end invalid character deletion a-la .gsub(/[^a-zA-Z\u00C0-\u017F]/, ''), which is what I ended up doing
Try using [a-zA-Z\u00C0-\u017F] to match a-z and accented characters (all characters within unicode range specified).
See: Matching accented characters with Javascript regexes

ElasticSearch Regexp Filter

I'm having problems correctly expressing a regexp for the ElasticSearch Regexp Filter. I'm trying to match on anything in "info-for/media" in the url field e.g. http://mydomain.co.uk/info-for/media/press-release-1. To try and get the regex right I'm using match_all for now, but this will eventually be match_phrase with the user's query string.
POST to localhost:9200/_search
{
"query" : {
"match_all" : { },
"filtered" : {
"filter" : {
"regexp": {
"url":".*info-for/media.*"
}
}
}
},
}
This returns 0 hits, but does parse correctly. .*info.* does get results containing the url, but unfortunately is too broad, e.g. matching any urls containing "information". As soon as I add the hyphen in "info-for" back in, I get 0 results again. No matter what combination of escape characters I try, I either get a parse exception, or no matches. Can anybody help explain what I'm doing wrong?
First, to the extent possible, try to never use regular expressions or wildcards that don't have a prefix. The way a search for .*foo.* is done, is that every single term in the index's dictionary is matched against the pattern, which in turn is constructed into an OR-query of the matching terms. This is O(n) in the number of unique terms in your corpus, with a subsequent search that is quite expensive as well.
This article has some more details about that: https://www.found.no/foundation/elasticsearch-from-the-bottom-up/
Secondly, your url is probably tokenized in a way that makes "info-for" and "media" separate terms in your index. Thus, there is no info-for/media-term in the dictionary for the regexp to match.
What you probably want to do is to index the path and the domain separately, with a path_hierarchy-tokenizer to generate the terms.
Here is an example that demonstrates how the tokens are generated: https://www.found.no/play/gist/ecf511d4102a806f350b#analysis
I.e. /foo/bar/baz generates the tokens /foo/bar/baz, /foo/bar, /foo and the domain foo.example.com is tokenized to foo.example.com, example.com, com
A search for anything in below /foo/bar could then be a simple term filter matching path:/foo/bar. That's a massively more performant filter, which can also be cached.