Can UTF-16LE be converted into a MySQL LOAD DATA INFILE type format without garbling Chinese and other languages? If so how? - amazon-web-services

I'm working on a big dataset encoded in UTF-16LE that holds 1 Billion records containing text strings in over 50 languages ( not all known to me).
I need to get these into our database MySql 5.7 using LOAD DATA INFILE(for import speed) but i just found out that MySql does not support UTF-16LEtext encoding while trying to load this using the workbench import too and also querying this data with Athena gives me no records back with this encoding.
Best encoding relative to MySQl 5.7 that handles multi language and can LOAD DATA INFILE?
Will this keep the text safe and not garble the text strings?

Related

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a (string or BINARY) column.
And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ#mail.gmail.com%3E
which uses scala instead of pyspark as an example:
Configuration conf = new Configuration();
+ conf.set("parquet.strings.signed-min-max.enabled", "true");
Path inputPath = new Path(input);
FileStatus inputFileStatus =
inputPath.getFileSystem(conf).getFileStatus(inputPath);
List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);
I've been unable to set this value in pyspark (perhaps I'm setting it in the wrong place?)
example dataframe
import random
import string
from pyspark.sql.types import StringType
r = []
for x in range(2000):
r.append(u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)))
df = spark.createDataFrame(r, StringType())
I've tried a few different ways of setting this option:
df.write.format("parquet").option("parquet.strings.signed-min-max.enabled", "true").save("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", "true").parquet("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", True).parquet("s3a://test.bucket/option")
But all of the saved parquet files are missing the ST/STATS for the BINARY column. Here is an example output of the metadata from one of the parquet files:
creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"value","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
value: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:33 TS:515
---------------------------------------------------------------------------------------------------
Also, based on this email chain https://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C9DEF4C39-DFC2-411B-8987-5B9C33842974#videoamp.com%3E and question: Specify Parquet properties pyspark
I tried sneaking the config in through the pyspark private API:
spark.sparkContext._jsc.hadoopConfiguration().setBoolean("parquet.strings.signed-min-max.enabled", True)
So I am still unable to set this conf parquet.strings.signed-min-max.enabled in parquet-mr (or it is set, but something else has gone wrong)
Is it possible to configure parquet-mr from pyspark
Does pyspark 2.3.x support BINARY column stats?
How do I take advantage of the PARQUET-686 feature to add min/max metadata for string columns in a parquet file?
Since historically Parquet writers wrote wrong min/max values for UTF-8 strings, new Parquet implementations skip those stats during reading, unless parquet.strings.signed-min-max.enabled is set. So this setting is a read option that tells the Parquet library to trust the min/max values in spite of their known deficiency. The only case when this setting can be safely enabled is if the strings only contain ASCII characters, because the corresponding bytes for those will never be negative.
Since you use parquet-tools for dumping the statistics and parquet-tools itself uses the Parquet library, it will ignore string min/max statistics by default. Although it seems that there are no min/max values in the file, in reality they are there, but get ignored.
The proper solution for this problem is PARQUET-1025, which introduces new statistics fields min-value and max-value. These handle UTF-8 strings correctly.

Import cyrillic csv-file to Vertica via TOAD

I am struggling to import csv-file with cyrillic symbols to a table at Hp Vertica and every time I get the error
[Vertica][ODBC] (10170) String data right truncation on data from data source: String data is too big for the driver's data buffer
I tried to import utf8-saved .csv file and cp1251-saved .csv file, but the error is still there.
Any ideas?
The "String data right truncation" error can have two causes:
Target column is not big enough. Remember: (a) Vertica uses a byte-oriented length semantic for CHAR/VARCHAR and (b) it wants UTF-8 in input. So, for example, if you want to store the Euro sign (one, single, character) the target column should be - at least - CHAR(3) because the Euro sign requires three bytes when encoded UTF-8
The other possibility is that your ODBC loader does not allocate enough memory to store your string and/or does not understand field separators
Forget about CP-1251. Vertica wants UTF-8 in input.

Excel international date formatting

I am having problems formatting Excel datetimes, so that it works internationally. Our program is written in C++ and uses COM to export data from our database to Excel, and this includes datetime fields.
If we don't supply a formatting mask, some installations of Excel displays these dates as Serial numbers (days since 1900.01.01 followed by time as a 24-hour fraction). This is unreadable to a human, so we ha found out that we MUST supply a date formatting mask to be sure that it displays readable.
The problem - as I see it - is that Excel uses international formatting masks. For example; the UK datetime format mask might be "YYYY-MM-DD HH:MM".
But if the format mask is sent to an Excel that is installed in Sweden, it fails since the Swedish version of the Excel uses "ÅÅÅÅ-MM-DD tt:mm".
It's highly impractical to have 150 different national datetime formatting masks in our application to support different countries.
Is there a way to write formatting masks so that they include locale, such that we would be allowed to use ONE single mask?
Unless you are using the date functionality in Excel, the easiest way to handle this is to decide on a format and then create a string yourself in that format and set the cell accordingly.
This comic: http://xkcd.com/1179/ might help you choose a standard to go with. Otherwise, clients that open your file in different countries will have differently formatted data. Just pick a standard and force your data to that standard.
Edited to add: There are libraries that can make this really easy for your as well... http://www.libxl.com/read-write-excel-date-time.html
Edited to add part2: Basically what I'm trying to get at is to avoid asking for the asmk and just format the data yourself (if that makes sense).
I recommend doing the following: Create an excel with date formatting on a specific cell and save this for your program to use.
Now when the program runs it will open this use this excel file to retrieve the local date formatting from the excel and the specified cell.
When you have multiple formats to save just use different cells for them.
It is not a nice way but will work afaik.
Alteratively you could consider creating an xla(m) file that will use vba and a command to feed back the local formatting characters through a function like:
Public Function localChar(charIn As Range) As String
localChar = charIn.NumberFormatLocal
End Function
Also not a very clean method, but it might do the trick for you.

Can I force MySQL to transmit numerical values as binary?

I've been investigating network performance of our MySQL/C++ application. I found that our clients read all values (ints, doubles, etc.) from the MYSQL C-Connector in ASCII literals. That means that the server is transmitting numbers to the clients as text (eg. the String "123" for the int 123). All of our data in this application consists of numerical values (ints and doubles), so I am wondering:
Is there a way to force the MySQL server to transmit the numerical values in binary format rather than ASCII?
Use a compressed connection. Look for UseCompression in the docs

binary-to-text encoding, non-printing characters, protocol buffers, mongodb and bson

I have a candidate key (mongodb candidate key, __id) thats looks like the following in protocol buffers :
message qrs_signature
{
required uint32 region_id = 1;
repeated fixed32 urls = 2;
};
Naturally I can't use a protocol buffers encoded string (via ParseToString(std::string)) in my bson document since it can contain non-printing characters. Therefore, I am using the ascii85 encoding to encode the data (using this library). I have two questions.
Is b85 encoding bson-safe.
What is bson's binary type for ? is there some way that I can implant my (binary) string into that field using a mongodb API call , or is it just syntactic sugar to denote a value-type that needs to be processed in some form (--i.e., not a native mongodb entity)?
edit
The append binary api's show's data being encoded as hex(OMG!), base85 is therefore more space efficient (22 bytes per record in my case).
BSON safe, yes. The output of ASCII85 encoding is also valid utf-8 iirc.
It's used to store chunks of binary data. Binary data is an officially supported type and you should be able to push binary values to BSON fields using the appropriate driver code, BSONObj in your case. Refer to your driver docs or the source code for details.