Cloudsearch Fuzzy terms and phrases - amazon-web-services

I am trying to get my head around how fuzzy search works on AWS CloudSearch
I want to find "Star Wars" but in my search, I spell it
ster wers
The logic of my app will add fuzzy but it never returns Star Wars.
I have tried:
ster~1 wers~1
"ster wers"~2
"ster"~1 "wers"~1
What am I missing here?

The reason your query doesn't work is because of how CloudSearch stems. If your field is indexed with the Analysis Scheme set to English, then wars will be stored in its stemmed form as war.
Here's a little demo of how stemming is affecting your query.
Searching with the un-stemmed query ('ster wers'):
Searching with the un-stemmed query requires you to match wers to war, which is off by 2 chars and requires this query: q=ster~1+wers~2.
Searching with the stemmed query ('ster wer'):
Searching with the stemmed version means you're matching wer to war and you're only off by 1 char. Thus ster~1 wer~1 will get the desired result (ie it matches star wars).
How to fix:
The use case you described will work if you configure the Analysis Scheme for the field in question to not use any stemming.
To do this, log into the AWS Web Console and go to Analysis Schemes --> Add Analysis Scheme:
Then go to Indexing Options and configure your field to use your new no-stemming analysis scheme:
Submit your changes and re-index.
That will address your issue but of course you'll lose the benefits of stemming. You can't have your cake and eat it too.

Related

Shopify regular expressions for checkout process

I am working with the Google Ads team in my company on a Shopify store and they asked me for some regular expressions for the several steps of the checkout process. I created them and everything was running fine, until the guys noticed that sometimes Analytics added a _ga paremeter to the URL query parameters.
My original expressions are:
1. When in cart - no problem here
\/cart
2. First step of checkout - Contact Information - In several lines for easier reading
(
\/([0-9]*)\/checkouts\/([a-z0-9\-]*)$
|
\/([0-9]*)\/checkouts\/([a-z0-9\-]*)\?step=contact_information
)
In this part I added the step=contact_information as an OR option. It isn't normally there except for when you go back to contact information it is added to the URL as step. I know this is not the ideal way, but I am far from fluent in regex.
3. Shipping information
(
\/([0-9]*)\/checkouts\/([a-z0-9\-]*)\?step=shipping_method
|
\/([0-9]*)\/checkouts\/([a-z0-9\-]*)?(.*)&step=shipping_method
)
In this part it always has step=shipping_method but it can also have previous_step=contact_information. This is also not ideal, but I am not sure how to do it.
4. Payment information
(
\/([0-9]*)\/checkouts\/([a-z0-9\-]*)\?step=payment_method
|
\/([0-9]*)\/checkouts\/([a-z0-9\-]*)?(.*)&step=payment_method
)
The same as step 3, in this case it always has step=payment_method but it can also have previous_step=shipping_method. As points 2 and 3, not ideal.
5. Processing - this part works fine, because I am not interested in the query parameters
\/([0-9]*)\/checkouts\/([a-z0-9\-]*)\/processing
6. Thank you page - this also works fine, because I am not interested in the query parameters
\/([0-9]*)\/checkouts\/([a-z0-9\-]*)\/thank_you
Issue with _ga parameter
Those regular expressions work fine with the regular URLs, but when I add the _ga parameter to the URL they don't match. I think there was a way to match query parameters, but I am not sure how to match certain and exclude others.
The _ga parameter normally persists on the next steps
The list of all the possible matches for points 2., 3. and 4.:
Contact information without and with _ga
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?step=contact_information
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?_ga=2.150710640.738515769.1576779089-71346777.1571176760%26_gac%3D1.16451458.1576260301.EAIaIQobChMI9v2c5Zqz5gIVr__jBx1VAgxPEAAYBCAAEgLccPD_BwE&locale=es
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?_ga=2.150710640.738515769.1576779089-71346777.1571176760%26_gac%3D1.16451458.1576260301.EAIaIQobChMI9v2c5Zqz5gIVr__jBx1VAgxPEAAYBCAAEgLccPD_BwE&locale=es&step=contact_information
Shipping method without and with _ga
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?step=shipping_method
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?step=shipping_method&previous_step=contact_information
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?_ga=2.150710640.738515769.1576779089-71346777.1571176760%26_gac%3D1.16451458.1576260301.EAIaIQobChMI9v2c5Zqz5gIVr__jBx1VAgxPEAAYBCAAEgLccPD_BwE&locale=es&step=shipping_method
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?_ga=2.150710640.738515769.1576779089-71346777.1571176760%26_gac%3D1.16451458.1576260301.EAIaIQobChMI9v2c5Zqz5gIVr__jBx1VAgxPEAAYBCAAEgLccPD_BwE&locale=es&step=shipping_method&previous_step=contact_information
Payment method without and with _ga
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?step=payment_method
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?step=payment_method&previous_step=shipping_method
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?_ga=2.150710640.738515769.1576779089-71346777.1571176760%26_gac%3D1.16451458.1576260301.EAIaIQobChMI9v2c5Zqz5gIVr__jBx1VAgxPEAAYBCAAEgLccPD_BwE&locale=es&step=payment_method
/25931284564/checkouts/df24e48ecc81f767583c4a26680bcb82?_ga=2.150710640.738515769.1576779089-71346777.1571176760%26_gac%3D1.16451458.1576260301.EAIaIQobChMI9v2c5Zqz5gIVr__jBx1VAgxPEAAYBCAAEgLccPD_BwE&locale=es&step=payment_method&previous_step=shipping_method
Any ideas how I could solve this? I am pretty sure it's simple, but my brain just doesn't get around more complex regular expressions :)
UPDATE
Just to clear this up a bit more, what I need to achieve with the regular expressions is to identify specifically the step of the funnel.
The Google Ads guys from my team are creating a funnel in Analytics and they add the corresponding steps from the checkout as stages of the funnel.
So basically I just need my regexes to be able to work with or without the _ga query, BUT always detecting a specific step.
UPDATE 2
I added all the possible matches. I need to be able to identify the specific step through the regular expression. So basically I need one regular expression for contact information, one for shipping method and one for payment method, each identifying only the specific step with or without _ga in the URL.
I believe for the checkout url you can simply use this regex:
/([0-9]+)/checkouts/([a-z0-9-]+)(?:.*step=([a-z0-9_-]+))?
no matter if the url is with/without _ga parameter.
Basically it will provide you three groups in a match, the third group will contain step parameter value, e.g.: contact_information
Example:
https://regex101.com/r/C1GuDY/1

REGEX in GA view filter (search and replace) to output a numeric ID from URI

I'd like to search and replace in Google Analytics view filter all my Request URIs in such a way that just the article id remains (plus, to add an "a-" before the ID).
Example URIs:
/raksts/sievietem/281750-ilona-balode-par-dzivi-ar-udriti-no-mums-beg-ka-no-grimstosa-kuga
/raksts/zinas/281427-video-izskatas-ka-spelu-automatu-atkariba-baibu-strautmani-nelaiz-vala
/raksts/arzemes/282070-pasauli-savilno-mazas-princeses-sarlotes-emocijas-karaliskajas-kazas
/raksts/izklaide/280379-turpinas-tirisana-jrt-maru-kimeli-atlaiz-hermana-sieva-aiziet-pati
The result I'm after:
a-281750
a-281427
a-282070
a-280379
In Regex checking sites it works like a charm, Regex being:
\/raksts\/\D+(\d+).+
Substitution being:
a-$1
But when I apply them to a GA filter, the checker tells me that the filter wouldn't have changed any data.
Not sure if I need to do this in GA - Data Studio would do too, the endgame being exported article IDs for our IT guys to implement in our editorial interface through Google API.
Probably a dumb question for which I apologize, but I feel quite stuck, so even a hint in the right direction will be greatly appreciated.
Try to add backslash before 1:
a-\1

Google Analtyics search and replace filter URI with Regex with an exception

Currently our registration form tracks UTM and SEM codes, plus you get very long string with Social sign ins. I end up with roughly 4k enrollment variations, very hard to track outside of goals.
In order to better trouble shoot channels, I've created a separate view where I want to combine everything into just /enrollment while excluding thank you page. So i would have a list like this:
www.mysite/enrollment
www.mysite/enrollment/
www.mysite/enrollment/sem01
www.mysite/enrollment/sem02
www.mysite/enrollment?adsforefacebook
www.mysite/enrollment?utmforemail
www.mysite/enrollment/thank-you
I've tried using this filter which works in the goal section, but I can't get it to work under filters.
Find
www\.mysite\.com\/enrollment(?!/thank\-you)
Replace
www.mysite.com/enrollment
Theoretically, this should catch everything with enrollment except thank you pages and replace with the new string.
I've tried several variations that include .*, but no go.
Oops. Nevermind, I think I figured it out right after I posted this. I don't think the normal find and replace works with exclusion patters and you have to use the Advanced filter... which worked exactly as expected with the above code.

Match all characters in group except for first and last occurrence

Say I request
parent/child/child/page-name
in my browser. I want to extract the parent, children as well as page name. Here are the regular expressions I am currently using. There should be no limit as to how many children there are in the url request. For the time being, the page name will always be at the end and never be omitted.
^([\w-]{1,}){1} -> Match parent (returns 'parent')
(/(?:(?!/).)*[a-z]){1,}/ -> Match children (returns /child/child/)
[\w-]{1,}(?!.*[\w-]{1,}) -> Match page name (returns 'page-name')
The more I play with this, the more I feel how clunky this solution is. This is for a small CMS I am developing in ASP Classic (:(). It is sort of like the MVC routing paths. But instead of calling controllers and functions based on the URL request. I would be travelling down the hierarchy and finding the appropriate page in the database. The database is using the nested set model and is linked by a unique page name for each child.
I have tried using the split function to split with a / delimiter however I found I was nested so many split statements together it became very unreadable.
All said, I need an efficient way to parse out the parent, children as well as page name from a string. Could someone please provide an alternative solution?
To be honest, I'm not even sure if a regular expression is the best solution to my problem.
Thank you.
You could try using:
^([\w-]+)(/.*/)([\w-]+)$
And then access the three matching groups created using Match.SubMatches. See here for more details.
EDIT
Actually, assuming that you know that [\w-] is all that is used in the names of the parts, you can use ^([\w-]+)(.*)([\w-]+)$ instead and it will handle the no-child case fine by itself as well.

Solr Query Syntax

I just got started looking at using Solr as my search web service. I don't know whether Solr supports these query types:
Startswith
Exact Match
Contain
Doesn't Contain
In the range
Could anyone guide me how to implement those features in Solr?
Cheers,
Samnang
Solr is capable of all those things but to adequately explain how to do each of time an answer would become a mini-manual for Solr.
I'd suggest you read the actual manual and tutorials linked from the Solr homepage.
In short though:
Startswith can be implemented using Lucene wildcards.
Exact matches will only be found if a field is not tokanized. I.e. the entire field is viewed as a single token.
Contain is the default search format. I.e. a search for "John" will find any document's whose search field contains the value "John". Prefixing with - (e.g. "-John" will only find documents that do not contain John).
Ranges (be they date or integer) are possible and quite powerful, example date:[* TO NOW] would find any document whose date is not in the future.