Google Cloud Dataflow removing accents and special chars with '??' - google-cloud-platform

This is going to be quite a hit or miss question as I don't really know which context or piece of code to give you as it is a situation of it works in local, which does!
The situation here is that I have several services, and there's a step where messages are put in a PubSub topic awaiting for the Dataflow consumer to handle them and save as .parquet files (I also have another one which sends that payload to a HTTP endpoint).
The thing is, the message in that service prior sending it to that PubSub topic seems to be correct, Stackdriver logs show all the chars as they should be.
However, when I'm going to check the final output in .parquet or in the HTTP endpoint I just see, for example h?? instead of hí, which seems pretty weird as running everything in local makes the output be correct.
I can only think about encoding server-wise when deploying the Dataflow as a job and not running in local.
Hope someone can shed some light in something this abstract.

The strange thing is that it works locally.
But as a workaround, the first thing that comes to mind is to use encoding.
Are you using at some point a function to convert your string input as bytes?
If yes, you could try to force getBytes() to use utf-8 encoding by passing by the argument like in the following example from this Stackoverflow thread:
byte[] bytes = string.getBytes("UTF-8");
// feed bytes to Base64
// get bytes from Base64
String string = new String(bytes, "UTF-8");
Also:
- Have you tried setting the parquet.enable.dictionary option?
- Are your original files written in utf-8 before conversion?

Google Cloud Dataflow (at least the Java SDK) replaces Spanish characters like 'ñ' or accents 'á','é',' etc with the symbol � since the default charset of the JVM installed on service workers is US-ASCII. So, if UTF-8 is not explicitly declared when you instantiate strings or their relative byte-arrays transformation, the platform default encoding will be used.

Related

'TypeError at /api/chunked_upload/ Unicode-objects must be encoded before hashing' errorwhen using botocore in Django project

I have hit a dead end with this problem. My code works perfectly in development but when I deploy my project and configure DigitalOcean Spaces & S3 bucket I get the following error when uploading media:
TypeError at /api/chunked_upload/
Unicode-objects must be encoded before hashing
I'm using django-chucked-uploads and it doesn't play well with Botocore
I'm using Python 3.7
My code is taken from this demo: https://github.com/juliomalegria/django-chunked-upload-demo
Any help will be massively helpful
This library was implemented for Python 2, so there might be a couple of things that don't work out of the box with Python 3.
This issue that you're facing is one of them since files in Python 3 are read directly as Unicode (since now py3's str is py2's unicode). The md5 hashing is the part of the code triggering this exception (this line) because it doesn't expect Unicode strings.
If you have created your own model inheriting from AbstractChunkedUpload, you can override the md5 property to encode the chunks before updating the hash. See this other SO question on how to solve this specific.
Hopefully this helped!
Disclaimer: I'm the creator of this library. However, I haven't maintained it in a long time to the point that it might be no longer usable.

What are the ramifications of null bytes and multipart/form-data?

A third party is sending us a flat file that is supposed to contain exclusively printable ASCII characters. However, we've discovered that there's a string of about 50 0x00 bytes in the middle of the file.
We want to be able to upload the file to our web application, but I've discovered that Django doesn't seem to like the null characters in the multipart/form-data. If I remove the null characters, the upload succeeds. (Sorry I don't have the stack trace available at the moment, but will produce one if necessary)
We can pre-process the file to remove the null characters and/or work with our third party to fix their file generator, but I don't like to leave mystical problems like this.
Does this sound like a bug in Django or is there some aspect of multipart/form-data that I don't fully understand? Do I need to set a transfer encoding of some sort so Django doesn't get hung up on the null characters?
Nope, no transfer-encoding is needed (or ever used by browsers) on form-data. It's perfectly valid to include a run of 50 null bytes in a multipart/form-data value... indeed given that most binary files contain a lot of nulls that situation should arise as often as not with file uploads!
Which makes me question whether it's really a Django bug, or whether there's not something else going on. Let's have that stacktrace!

How can I download a utf-8-encoded web page with libcurl, preserving the encoding?

Im trying to get libcurl to download a webpage that is encoded in UTF-8, which is working fine, except for the fact that it converts it to ASCII and screws up some of the characters. Is there an easy way to get it to keep it in UTF-8?
libcurl doesn't translate/convert the data at all so there's actually nothing particular you need to do. Just get it.
Check the CURL options for conversion. They might have been defined at compilation time.

How can I get Django to output bad characters instead of returning an error

i've got some weird characters in my database, which seem to mess up django when returning a page. I get this error come up:
TemplateSyntaxError at /search/legacy/
Caught an exception while rendering: Could not decode to UTF-8 column 'maker' with text 'i� G�r'
(the actual text is slightly different, but since it is a company name i've changed it)
how can i get django to output this text? i'm currently running the site from sqlite (fast dev), is this the issue?
Also, on a completely unrelated note, is it possible to use a database view?
thanks
Probably not.
Django is using UTF-8 Strings internally, and it seems that your database returns some invalid string. You should fix the data in the database and use exclusively UTF-8 in all your application (data import, database, templates, source files, ...).
I have a related problem with a site owner who uses Apple's iPages for article creation, then does a copy-paste into a Django admin textbox. This process creates 'funny characters' that screw up Django and/or MySQL (you wouldn't believe the number of different double-left/right quote characters there are). I can't 'fix' the customer so I have a function that looks for known strangeness and translates it to something useful before. A complete PITA.
That's a bit of a confusing error message, and without knowing more details I'm not clear what the source of the problem is (the error message phrasing "decode to UTF-8" seems wrong, as normally you would encode to UTF-8). Perhaps Django is expecting to find data in some other encoding and is trying to decode it and re-encode as UTF-8, but is choking on some characters that aren't valid for the encoding it's expecting?
In general, you want to make sure that you're storing UTF-8 in your database, and that internally you're using unicode objects (not str objects) everywhere in your code.
Some other reading that may be helpful:
Unicode in the real world
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Django Tips: UTF-8, ASCII Encoding Errors, Urllib2, and MySQL

BOM not expected in CF but sent by IIS/SharePoint

I'm trying to consume a SharePoint webservice from ColdFusion via cfinvoke ('cause I don't want to deal with (read: parse) the SOAP response itself).
The SOAP response includes a byte-order-mark character (BOM), which produces the following exception in CF:
"Cannot perform web service invocation GetList.
The fault returned when invoking the web service operation is:
'AxisFault
faultCode: {http://www.w3.org/2003/05/soap-envelope}Server.userException
faultSubcode:
faultString: org.xml.sax.SAXParseException: Content is not allowed in prolog."
The standard for UTF-8 encoding optionally includes the BOM character (http://unicode.org/faq/utf_bom.html#29). Microsoft almost universally includes the BOM character with UTF-8 encoded streams . From what I can tell there’s no way to change that in IIS. The XML parser that JRun (ColdFusion) uses by default doesn’t handle the BOM character for UTF-8 encoded XML streams. So, it appears that the way to fix this is to change the XML parser used by JRun (http://www.bpurcell.org/blog/index.cfm?mode=entry&entry=942).
Adobe says that it doesn't handle the BOM character (see comments from anoynomous and halL on May 2nd and 5th).
http://livedocs.adobe.com/coldfusion/8/htmldocs/Tags_g-h_09.html#comments
I'm going to say that the answer to your question (is it possible?) is no. I don't know that definitively, but the poster who commented just above halL (in the comments on this page) gave a work-around for the problem -- so I assume it is possible to deal with when parsing manually.
You say that you're using CFInvoke because you don't want to deal with the soap response yourself. It looks like you don't have any choice.
As Adam Tuttle said already, the workaround is on the page that you linked to
<!--- Remove BOM from the start of the string, if it exists --->
<cfif Left(responseText, 1) EQ chr(65279)>
<cfset responseText = mid(xmlText, 2, len(responseText))>
</cfif>
It sounds like ColdFusion is using Apache Axis under the covers.
This doesn't apply exactly to your solution, but I've had to deal with this issue once before when consuming a .NET web service with Apache Axis/Java. The only solution I was able to find (since the owner of the web service was unwilling to change anything on his end) was to write a Handler class that Axis would plug into the pipeline which would delete the BOM from the message if it existed.
So perhaps it's possible to configure Axis through ColdFusion? If so you can add additional Handlers to the message handling flow.