Google Translate adds spaces around special characters

Google Translate adds spaces around special characters - google-cloud-platform

If I enter **Dit is mijn vraag** into Google Translate, whether it's through the API or on the web view, it will translate it into ** This is my question **. It adds spaces around the ** which messes up the Markdown parser we run it through later...
Has anyone else encountered this, and if so, found a solution?

I have replicated your case, with different target languages and different special characters symbols. In all cases, I indeed get similar behavior as yours. I haven't found any information that could explain the reason of this to happen, but there is an active Issue Tracker with similar behavior.
This happens for html and text format API calls. You can follow the link for more information. As a workaround for now, if you are using the API call, after you get the response, process it by finding all **[SPACE], [SPACE]** char sequences and replacing them with **.

Related

Working with strings in GCP-Workflows and GCP-Admin

I'm integrating a project in GCP-Workflows with GCP-Admin, but I'm having trouble working with some data, when extracting a date it is delivered in this format: 2020-12-28T11: 20: 05.000Z, so I can't turn the string into int, and apparently there is no function in GCP like substring() either. I need to use the date with an IF, checking if it is greater or less than the reference.
How can I do this?

There is some lack of function implementation for now in Workflows. New ones are coming very soon. But I don't know if they will solve your problem
Anyway, with workflows, the correct pattern, if a built-in function isn't implemented, is to call an endpoint, for example a Cloud Function or a Cloud Run, which perform the transformation for you and return the expected result.
Quite boring to do, but don't hesitate to open feature request on the issues tracker product team is very reactive and love user feedbacks!

The Workflows standard library now includes a text module with functions for searching (including regular expressions), splitting, substrings and case transformations.

solr PatternReplaceCharFilterFactory working unexpectedly

I am relatively new to Solr so please forgive me if I'm missing something obvious. I have an application that allows users to search for musical artists. The indexing comes from a read-only database with correct spellings so on the index side I have it figured out.
On the query side however I need to anticipate various spelling errors/differences and want to help solr find those instances. From our old home-grown search solution, I have a list of regex's and the artists they apply to. When I was trying to translate those to solr using the PatternReplaceCharFilterFactory, I noticed that some worked perfectly, while others didn't work at all ... with seeming no rhyme nor reason between them.
For example:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="em[ei]n[ei]m" replacement="Eminem"/>
accurately captures the common misspellings of Eminem. But for the band 311:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Tt]hree [Ee]leven" replacement="311"/>
Does not work. Another example is Nine Inch Nails:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="((nine|9).*inch.*nails\b)|(n\.? ?i\.? ?n\.?\b)" replacement="Nine Inch Nails"/>
works perfectly for finding the most common patterns for the band's name. But for Eve 6:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[Ee]ve.{0,4}([Ss]ix|6)" replacement="Eve 6"/>
Is there something fundamental I'm missing on the usage of this filter? I've tried a number of variations on the regex's I've mentioned above (even going so far as using literals like 'three eleven'), but still with no success. I've tried making the filter in question the only PatternReplaceCharFilterFactory in the analyzer. I also know for sure that these items are in the index correctly because when I search for the correct spelling it returns the proper results.
Any suggestions?
Snowdall

I suspect the problem is not with your Char Factory, but with what comes after all, specifically the tokenizer. If you use standard tokenizer, it will get rid of the numbers you have just put into your stream. If you don't need the text to be split into tokens, you could look at KeywordTokenizerFactory instead.
In general, the best way to troubleshoot this in Solr 4+ is the Analysis screen in the Admin WebUI. It allows you to enter your text against particular field type and see what happens with it after each component in the analysis chain.

I would recommend using the SynonymFilter for the kind of application you describe. It allows you to provide an external file where you list words and their synonyms, like:
eminem <=> emenem
nine <=> 9
If you precede this with a LowerCaseFilter, you won't have to fuss about case normalization in your synonyms. You should be able to handle the 311 case too as long as you don't tokenize (ie use a KeywordTokenizer as Alexander Rafalovitch suggested).

Once something is HTML or URL encoded should it ever be decoded? Is encoding enough?

First time AntiXSS 4 user here. In order to make my application more secure, I've used Microsoft.Security.Application.Encoder.UrlEncode on QueryString parameters and
Microsoft.Security.Application.Encoder.HtmlEncode on a parameter entered into a form field.
I have a multiple and I would appreciate it if you could try to answer all of them (doesn't have to be at once or by the same person - any abswers at all would be very helpful).
My first question is am I using these methods appropriately (that is am I using an appropriate AntiXSS method for an appropriate situation)?
My second question is once I've encoded something, should it ever be decoded. I am confused because I know that HttpUtility class provides ways to both encode and decode so why isn't the same done in AntiXSS? If this helps, the parameters that I've encoded are never going to be treated as anything other then text inside the application.
My third question is related to the third one but I wanted to emphasize it because it's important (and is probably the source of my overall confusion). I've heard that the .NET framework automatically decodes things like QueryStrings, hence no no need for explicit decode method. If that is so, then what is the point of HTML encoding something in the first place if it is going to be undone. It just... doesn't seem safe? What am I missing, especially since, as mentioned the HttpUtility class provides for decoding.
And the last question, does AntiXSS help against SQL injection at all or does it only protext against XSS attacks?

It's hard to say if you're using it correctly. If you use UrlEncode when building a query string which is then output as a link in a page then yes that's correct. If you're Html Encoding when you write something out as a value then yes, that's correct (well kind of, if it's set via an HTML attribute you ought to use HtmlAttributeEncode, but they're pretty much the same.)
The .NET decoders work with AntiXSS's encoded values, so there was no point in me rewriting them grin
The point of encoding is that you do it when you output. So, for example, if a user has, on a form, input window.alert('Numpty!) and you just put that input raw in your output the javascript would run. If you encoded it first you would see < become < and so on.
No, SQL injection is an entirely different problem.

Allowing code snippets in form input while preventing XSS and SQL injection attacks

How can one allow code snippets to be entered into an editor (as stackoverflow does) like FCKeditor or any other editor while preventing XSS, SQL injection, and related attacks.

Part of the problem here is that you want to allow certain kinds of HTML, right? Links for example. But you need to sanitize out just those HTML tags that might contain XSS attacks like script tags or for that matter even event handler attributes or an href or other attribute starting with "javascript:". And so a complete answer to your question needs to be something more sophisticated than "replace special characters" because that won't allow links.
Preventing SQL injection may be somewhat dependent upon your platform choice. My preferred web platform has a built-in syntax for parameterizing queries that will mostly prevent SQL-Injection (called cfqueryparam). If you're using PHP and MySQL there is a similar native mysql_escape() function. (I'm not sure the PHP function technically creates a parameterized query, but it's worked well for me in preventing sql-injection attempts thus far since I've seen a few that were safely stored in the db.)
On the XSS protection, I used to use regular expressions to sanitize input for this kind of reason, but have since moved away from that method because of the difficulty involved in both allowing things like links while also removing the dangerous code. What I've moved to as an alternative is XSLT. Again, how you execute an XSL transformation may vary dependent upon your platform. I wrote an article for the ColdFusion Developer's Journal a while ago about how to do this, which includes both a boilerplate XSL sheet you can use and shows how to make it work with CF using the native XmlTransform() function.
The reason why I've chosen to move to XSLT for this is two fold.
First validating that the input is well-formed XML eliminates the possibility of an XSS attack using certain string-concatenation tricks.
Second it's then easier to manipulate the XHTML packet using XSL and XPath selectors than it is with regular expressions because they're designed specifically to work with a structured XML document, compared to regular expressions which were designed for raw string-manipulation. So it's a lot cleaner and easier, I'm less likely to make mistakes and if I do find that I've made a mistake, it's easier to fix.
Also when I tested them I found that WYSIWYG editors like CKEditor (he removed the F) preserve well-formed XML, so you shouldn't have to worry about that as a potential issue.

The same rules apply for protection: filter input, escape output.
In the case of input containing code, filtering just means that the string must contain printable characters, and maybe you have a length limit.
When storing text into the database, either use query parameters, or else escape the string to ensure you don't have characters that create SQL injection vulnerabilities. Code may contain more symbols and non-alpha characters, but the ones you have to watch out for with respect to SQL injection are the same as for normal text.
Don't try to duplicate the correct escaping function. Most database libraries already contain a function that does correct escaping for all characters that need escaping (e.g. this may be database-specific). It should also handle special issues with character sets. Just use the function provided by your library.
I don't understand why people say "use stored procedures!" Stored procs give no special protection against SQL injection. If you interpolate unescaped values into SQL strings and execute the result, this is vulnerable to SQL injection. It doesn't matter if you are doing it in application code versus in a stored proc.
When outputting to the web presentation, escape HTML-special characters, just as you would with any text.

The best thing that you can do to prevent SQL injection attacks is to make sure that you use parameterized queries or stored procedures when making database calls. Normally, I would also recommend performing some basic input sanitization as well, but since you need to accept code from the user, that might not be an option.
On the other end (when rendering the user's input to the browser), HTML encoding the data will cause any malicious JavaScript or the like to be rendered as literal text rather than executed in the client's browser. Any decent web application server framework should have the capability.

I'd say one could replace all < by <, etc. (using htmlentities on PHP, for example), and then pick the safe tags with some sort of whitelist. The problem is that the whitelist may be a little too strict.
Here is a PHP example
$code = getTheCodeSnippet();
$code = htmlentities($code);
$code = str_ireplace("<br>", "<br>", $code); //example to whitelist <br> tags
//One could also use Regular expressions for these tags
To prevent SQL injections, you could replace all ' and \ chars by an "innofensive" equivalent, like \' and \, so that the following C line
#include <stdio.h>//'); Some SQL command--
Wouldn't have any negative results in the database.

Regular expressions: Differences between browsers

I'm increasingly becoming aware that there must be major differences in the ways that regular expressions will be interpreted by browsers.
As an example, a co-worker had written this regular expression, to validate that a file being uploaded would have a PDF extension:
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.pdf)$
This works in Internet Explorer, and in Google Chrome, but does NOT work in Firefox. The test always fails, even for an actual PDF. So I decided that the extra stuff was irrelevant and simplified it to:
^.+\.pdf$
and now it works fine in Firefox, as well as continuing to work in IE and Chrome.
Is this a quirk specific to asp:FileUpload and RegularExpressionValidator controls in ASP.NET, or is it simply due to different browsers supporting regex in different ways? Either way, what are some of the latter that you've encountered?

Regarding the actual question: The original regex requires the value to start with a drive letter or UNC device name. It's quite possible that Firefox simply doesn't include that with the filename. Note also that, if you have any intention of being cross-platform, that regex would fail on any non-Windows system, regardless of browser, as they don't use drive letters or UNC paths. Your simplified regex ("accept anything, so long as it ends with .pdf") is about as good of a filename check as you're going to get.
However, Jonathan's comment to the original question cannot be overemphasized. Never, ever, ever trust the filename as an adequate means of determining its contents. Or the MIME type, for that matter. The client software talking to your web server (which might not even be a browser) can lie to you about anything and you'll never know unless you verify it. In this case, that means feeding the received file into some code that understands the PDF format and having that code tell you whether it's a valid PDF or not. Checking the filename may help to prevent people from trying to submit obviously incorrect files, but it is not a sufficient test of the files that are received.
(I realize that you may know about the need for additional validation, but the next person who has a similar situation and finds your question may not.)

As far as I know firefox doesn't let you have the full path of an upload. Interpretation of regular expressions seems irrelevant in this case. I have yet to see any difference between modern browsers in regular expression execution.

If you're using javascript, not enclosing the regex with slashes causes error in Firefox.
Try doing var regex = /^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.pdf)$/;

As Dave mentioned, Firefox does not give the path, only the file name. Also as he mentioned, it doesn't account for differences between operating systems. I think the best check you could do would be to check if the file name ends with PDF. Also, this doesn't ensure it's a valid PDF, just that the file name ends with PDF. Depending on your needs, you may want to verify that it's actually a PDF by checking the content.

I have not noticed a difference between browsers in regards to the pattern syntax. However, I have noticed a difference between C# and Javascript as C#'s implementation allows back references and Javascript's implementation does not.

I believe JavaScript REs are defined by the ECMA standard, and I doubt there are many differences between JS interpreters. I haven't found any, in my programs, or seen mentioned in an article.
Your message is actually a bit confusing, since you throw ASP stuff in there. I don't see how you conclude it is the browser's fault when you talk about server-side technology or generated code. Actually, we don't even know if you are talking about JS on the browser, validation of upload field (you can no longer do it, at least in a simple way, with FF3) or on the server side (neither FF nor Opera nor Safari upload the full path of the uploaded file. I am surprised to learn that Chrome does like IE...).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js