ColdFusion cfsearch special characters - coldfusion

I am currently using CF11 cfsearch on our website to retrieve products. However, if the product name has ANY special characters it generates an error.
We have products that have a hash, plus, exclamation signs and some with backslashes. How do I get cfsearch to search using these special symbols?

I used something like this back in the day to solve <cfsearch> special character issues: https://cflib.org/udf/verityClean
You can start with dropping this in, and then tweak the regex if you continue to see issues with other special characters. It also depends if you are using Verity/Solr/etc. to know what characters will cause issues, so you may want to look into what characters cause issues for the engine/collection you are using.

Related

Amazon CloudSearch accented words

I have an index with documents with accented words.
For example this document in Portuguese:
title => 'Ponte metálica'
If i search "metálica" it matches, so no problem.
But usually people search without accents, so it's very usual to search just for "metalica" (note the "a" without accent "á").
But it's not returning any results. I tested it in the AWS console and via endpoint /search. Im using the 2013 API.
I think the Synonyms can't solve this issue since they aren't full words
It looks like you posted the same question in AWS forums and got a reply:
The CloudSearch Portuguese stemmer does not remove accents, so á won't match a, and it does not currently have an option to remove them.
Two work-arounds I can think of:
Remove accents before uploading. (possibly to a different field)
Use a copy field, and the "mulitiple languages" analysis mode. This won't stem words by Portuguese rules, unfortunately, but it does remove accents!
I like the idea of removing the accent before upload, but I also have two other ideas:
Use fuzzy matching, so that you can tolerate one or maybe two "wrong" characters. Might have performance drawback to consider.
Provide an auto-complete/suggestor solution similar to a "did you mean?" type of experience.
I found this Stack Overflow thread from around 2014 that discusses these two possibilities, still using CloudSearch: Implementing "Did you mean?" using Amazon CloudSearch
About the fuzzy matching operator:
You can also perform fuzzy searches with the simple query parser. To perform a fuzzy search, append the ~ operator and a value that indicates how much terms can differ from the user query string and still be considered a match. For example, the specifying planit~1 searches for the term planit and allows matches to differ by up to one character, which means the results will include hits for planet.
And about auto-complete, with fuzzy matching option:
When you request suggestions, Amazon CloudSearch finds all of the documents whose values in the suggester field start with the specified query string—the beginning of the field must match the query string to be considered a match. The return data includes the field value and document ID for each match. You can configure suggesters to find matches for the exact query string, or to perform approximate string matching (fuzzy matching) to correct for typographical errors and misspellings.

Regular Expression to match a specific URL broken up by arbitrary characters

I run a Django-based forum (the framework is probably not important to the question, but still) and it has been increasingly getting spammed with posts that link to a specific website constantly (www.solidwoodkitchen.co.uk - these people are apparently the worst).
I've implemented a string blocking system that stops them posting to the forum if the URL of the website is included in the post, but as spam bots usually do, it has figured out a way around that by breaking up the URL with other characters (eg. w_w_w.s*olid_wood*kit_ch*en._*co.*uk .). So a couple of questions:
Is it even possible to build a regex capable of finding the specific URL within a block of text even when it has been modified like that?
If it is, would this cause a performance hit?
Description
You could break the url into a string of characters, then join them together with [^a-z0-9]*?. So in this case with www.solidwoodkitchen.co.uk the resulting regex would look like:
w[^a-z0-9]*?w[^a-z0-9]*?w[^a-z0-9]*?[.][^a-z0-9]*?s[^a-z0-9]*?o[^a-z0-9]*?l[^a-z0-9]*?i[^a-z0-9]*?d[^a-z0-9]*?w[^a-z0-9]*?o[^a-z0-9]*?o[^a-z0-9]*?d[^a-z0-9]*?k[^a-z0-9]*?i[^a-z0-9]*?t[^a-z0-9]*?c[^a-z0-9]*?h[^a-z0-9]*?e[^a-z0-9]*?n[^a-z0-9]*?[.][^a-z0-9]*?c[^a-z0-9]*?o[^a-z0-9]*?[.][^a-z0-9]*?u[^a-z0-9]*?k
Edit live on Debuggex
This could would basically search for the entire string of characters seperated by zero or more non alphanumeric characters.
Or you could take the input text and strip out all punctuation then simply search for wwwsolidwoodkitchencouk.

hl.regex.pattern not working in solr

I am using solr to fetch data.
I was using below parameters to fetch data:
http://testURL/solr/core0/select?start=10&rows=10&hl.fl=CC&hl.requireFieldMatch=true&hl=on&hl.maxAnalyzedChars=1&hl.fragsize=145&hl.snippets=99&sort=COlumn1+desc&q=CC%3a%28%22test%22~2%29&fl=title120%2ccolumn2%2ccolumn3%2cRL_DateTime%2cSid%2ccolumn4%2cguid%2chour&hl.regex.pattern=^\d+%20%3E%3E%20
Above query is not working with hl.regex.pattern parameter.
If I remove "hl.regex.pattern" than it is providing results in highlight section.
If I provide that regex pattern than it will not.
Regex is working in my c# code.
So am I missing anything here?
It's almost certainly the ^\. Those aren't valid in a URI, so you'll have to escape them.
From RFC 1738:
only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
This is a little dated, since non-Roman alphanumerics like λάμδα are allowed now, but the gist is the same.
Try hl.regex.pattern=%5E%5Cd+%20%3E%3E%20 instead.

Quick regexp question for Railo

I guess the following line of code looks familiar to many...
[^A-Za-z0-9]
And what I'd like to do is to keep a block of "text", alphnumeric as stated above and plus punctuations and other special characters that sql for MS Access can handle, also, the special character of # sign would be replaced with ## (a double of it to escape a single #) for the underlying scripting language I'm using (Railo). So, to sum up, I'd like to remove any character that Access and Railo can't handle prior to writing the string into a db table.
The above alphnumeric is a start. Can you help to make it complete?
Thanks.
Railo can handle every character that I'm aware of. Not sure what the author of the question is implying. The pound character is the only character that I'm aware of that needs to be escaped. If the user used the proper <cfqueryparam> correctly, he should be able to insert just about anything he wants in Access.

How to reject names (people and companies) using whitelists with C# regex's?

I've run into a few problems using a C# regex to implement a whitelist of allowed characters on web inputs. I am trying to avoid SQL injection and XSS attacks. I've read that whitelists of the allowable characters are the way to go.
The inputs are people names and company names.
Some of the problems are:
Company names that have ampersands. Like "Jim & Sons". The ampersand is important, but it is risky.
Unicode characters in names (we have asian customers for example), that enter their names using their character sets. I need to whitelist all these.
Company names can have all kinds of slashes, like "S/A" and "S\A". Are those risky?
I find myself wanting to allow almost every character after seeing all the data that is in the DB already (and being entered by new users).
Any suggestions for a good whitelist that will handle these (and other) issues?
NOTE: It's a legacy system, so I don't have control of all the code. I was hoping to reduce the number of attacks by preventing bad data from getting into the system in the first place.
This SO thread has a lot of good discussion on protecting yourself from injection attacks.
In short:
Filter your input as best as you can
Escape your strings using framework based methods
Parameterize your sql statements
In your case, you can limit the name field to a small character set. The company field will be more difficult, and you need to consider and balance your users need for freedom of entry with your need for site security. As others have said, trying to write your own custom sanitation methods is tricky and risky. Keep it simple and protect yourself through your architecture - don't simply rely on strings being "safe", even after sanitization.
EDIT:
To clarify - if you're trying to develop a whitelist, it's not something that the community can hand out, since it's entirely dependent on the data you want. But let's look at a example of a regex whitelist, perhaps for names. Say I've whitelisted A-Z and a-z and space.
Regex reWhiteList = new Regex("^[A-Za-z ]+$")
That checks to see if the entire string is composed of those characters. Note that a string with a number, a period, a quote, or anything else would NOT match this regex and thus would fail the whitelist.
if (reWhiteList.IsMatch(strInput))
// it's ok, proceed to step 2
else
// it's not ok, inform user they've entered invalid characters and try again
Hopefully this helps some more! With names and company names you'll have a tough-to-impossible time developing a rigorous pattern to check against, but you can do a simple allowable character list, as I showed here.
Do not try to sanitize names, especially with regex!
Just make sure that you are properly escaping the values and saving them safely in your DB, and them escaping them back when presenting in HTML
Company names might have almost any kind of symbol in them, so I don't know how well this is going to work for you. I'd concentrate on shielding yourself directly from various attacks, not hoping that your strings are "naturally" safe.
(Certainly they can have ampersands, colons, semicolons, exclamation points, hyphens, percent signs, and all kinds of other things that could be "unsafe" in a host of contexts.)
Why filter or regex the data at all, or even escape it, you should be using bind variables to access the database.
This way, the customer could enter something like: anything' OR 'x'='x
And your application doesn't care because your SQL code doesn't parse the variable because it's not set when you prepare the statement. I.e.
'SELECT count(username) FROM usertable WHERE username = ? and password = ?'
then you execute that code with those variables set.
This works in PHP, PERL, J2EE applications, and so on.
I think writing your own regexp is not a good idea: it would be very hard. Try leveraging existing functions of your web framework, there is lots of resources on the net. If you say C#, I assume you are using ASP.NET, try the following article:
How To: Protect From Injection Attacks in ASP.NET
This is my current regex WHITELIST for a company name. Any input outside of these characters is rejected:
"^[0-9\p{L} '\-\.,\/\&]{0,50}$"
The \p{L} matches any unicode "letter". So, the accents and asian characters are whitelisted.
The \& is a bit problematic because it potentially allows javascript special characters.
The \' is problematic if not using parameterized queries, because of SQL injection.
The \- could allow "--", also a potential for SQL injection if not using parameterized queries.
Also, the \p{L} won't work client-side, so you can't use it in the ASP.NET regular expression validator without disabling clientside validation:
EnableClientScript="False"