Azure Web Job bad encoding downloaded data from data lake store

Azure Web Job bad encoding downloaded data from data lake store - azure-webjobs

Currently, I'm just want to download files from data lake store and store data into my sql database but I have problem with strings that shoudl containt characters like (ę, ą, ć, ł) but it is replaced by (e,a,c,l). Currently I'm tried changing Culture Information and Encoding in Stream Reader but it doesn't give me any better result (still getting replaced characters in my string values). So is there any work around or any place where I can globally set encoding parameters for my app service and web jobs included in web app service?

The issue is not related to WebJob. We could read any special character from any place and write it to another place due to the read and write work at byte level.
strings that shoudl containt characters like (ę, ą, ć, ł) but it is replaced by (e,a,c,l).
What column type did you define for the special characters in your SQL Server? If the column type is char or varchar. It will lost data if you store special characters. Change the column type to nchar or nvarchar will solve this issue.
Here is the test from my side.
Step 1, Define a table using following SQL statement.
CREATE TABLE [dbo].[mytable]
(
[id] INT NOT NULL PRIMARY KEY,
[text1] varchar(50),
[text2] nvarchar(50)
)
Step 2, Insert a row using following SQL statement.
insert into mytable (id, text1, text2) values(1, 'ę, ą, ć, ł', 'ę, ą, ć, ł')
Step 3, Query data from mytable using following SQL statement.
select * from dbo.mytable
Here is the result I got.
According to the result, the value of text1 was changed to 'e,a,c,l' due to the column type is varchar.

Related

Google Big Query, convert legacy sql to standard sql

I'm trying to convert a LegacySql Query to StandardSQL
SELECT * FROM
TABLE_QUERY([prod-chap_out],'REGEXP_MATCH(table_id, r"OUT\d+$")')
This query works fine in Legacy SQL, however it can't be converted to a json response when used in API's. I would rather serialize this into json than have to work with a bunch of data tables and converting values.
How can this be converted to standardSQL?
I've tried
REGEXP_CONTAINS(table_id, r"OUT\d+$"))
but I get the error \d is an illegal character.

You can use the wildcard * in your FROM and the resulting pseudo column _table_suffix in your WHERE:
SELECT
*
FROM
`<project-id>.<dataset-id>.<table-prefix>*`
WHERE
REGEXP_MATCH(_table_suffix, r"OUT\d+$")
I'm not entirely sure how you table names look like - here is the official documentation on transitioning to standard SQL: https://cloud.google.com/bigquery/docs/reference/standard-sql/wildcard-table-reference#the_table_query_function

How to change the data in a column in SAS Data Integration?

I have an existing ETL solution built-in SAS Data Integration, where one of the columns in initially set to have all null values. I want to populate that column with actual data. The original column in that table was set to receive numeric values with specific format and in format. After I am changing the code (that is the easy part), I notice that the column doesn't accept character values (I did not get an error, I just noticed the column still having all NULL values).
Can anyone help ?

So you have a table that is defined in Data Integration studio (1) and created by running the job (2) a long time ago with a numeric column. Let us call that table THE_TABLE that field the_field and the job, The_Job, that loads data into THE_TABLE
You must be aware of the fundamental difference
defining a THE_TABLE in DI studio, which creates a description of the table in meta data
creating THE_TABLE by running The_Job, which creates a file in a folder with data
If The_Job really creates THE_TABLE from scratch each time (which is typical for ETL jobs), it is sufficient to do change THE_TABLE and The_Job in DI studio. Your edits will only change the meta data, but the next time you run The_job, THE_TABLE wil be created with the the right structure.
However, if The_Job updates THE_TABLE or appends to it, your edits will not change the structure of THE_TABLE and your job will not be fit for the structure of the file THE_TABLE like it still exists in the folder, so you must convert THE_TABLE before running The_Job.
This can be done with a simple program like
data THE_TABLE;
set THE_TABLE (drop=the_field); /* forget about the numeric field */
attrib the_field length=$200 format=$200.; /* and create the character field */
run;
The correct attrib statement might well be in the code generated for The_Job somewhere.
Mind that in a typical setup with a development, test and production environment, you will need that program once in each environment.

Athena shows no value against boolean column, table created using glue crawler

I am using aws glue csv crawler to crawl s3 directory containing csv files. Crawler works fine in the sense that it creates the schema with correct data types for each column, however, when I query data from athena, it doesn't show value under boolean type column.
A csv looks like this:
"val","ts","cond"
"1.2841974","15/05/2017 15:31:59","True"
"0.556974","15/05/2017 15:40:59","True"
"1.654111","15/05/2017 15:41:59","True"
And the table created by crawler is:
Column name Data type
val string
ts string
cond boolean
However, when I run say select * from <table_name> limit 10 it returns:
val ts cond
1 "1.2841974" "15/05/2017 15:31:59"
2 "0.556974" "15/05/2017 15:40:59"
3 "1.654111" "15/05/2017 15:41:59"
Does any one has any idea what might be the reason?
I forgot to add, if I change the data type of cond column to string, it does show data as string e.g. "True" or "False"

I don't know why Glue classifies the cond column as boolean, because Athena will not understand that value as a boolean. I think this is a bug in Glue, or an artefact of it not targeting Athena exclusively. Athena expects boolean values to be either true or false. I don't remember if that includes different capitalizations of the strings or not, but either way yours will fail because they are quoted. The actual bug is that Glue has not configured your table so that it strips the quotes from the strings, and therefore Athena sees a boolean column containing "True" with quotes and all, and that is not a supported boolean value. Instead you get NULL values.
You could try changing your tables to use the OpenCSVSerDe instead, it supports quoted values.
It's surprising that Glue continues to stumble on basic things like this. Glue is unfortunately rarely worth the effort over writing some basic scripts yourself.

Retrieving NULL values from S3 while Select query AWS Redshift spectrum

I am able to unload data to S3, and query the results with Spectrum, but NOT when using the delimiter defined below. This is our standard delimiter that works with all of our processing today related to Redshift COPY and UNLOAD commands, so I believe the UNLOAD is working fine. But somewhere between the table definition and the SQL query to retrieve the data, this is not working. We just receive NULLS for all of the fields. Can you look at our example below in order to determine next steps.
unload ('select * from db.test')
to 's3://awsbucketname/ap_cards/'
iam_role 'arn:aws:iam::123456789101:role/redshiftaccess'
delimiter '\325'
manifest;
CREATE EXTERNAL TABLE db_spectrum.test (
cost_center varchar(100) ,
fleet_service_flag varchar(1)
)
row format delimited
fields terminated by '\325'
stored as textfile
location 's3://awsbucketname/test/';
select * from db_spectrum.test

Got a response from AWS Support center as:
Unfortunately you will need to either process the data externally to change the delimiter or UNLOAD the data again with a different delimiter.
The docs say to Specify a single ASCII character for 'delimiter'.
The ASCII range only goes up to 177 in octal.
We will clarify the docs to note that 177 is the max permissible octal for a delimiter. I can confirm that this is the same in Athena as well.
Thank you for bringing this to our attention.

You might try using Spectrify for this. It automates a lot of the nastiness involved currently with moving redshift table to spectrum.

Using a single ADO Query to copy data from a text file into another ODBC source

This may seem a odd question as I have a solution, I just dont understand why and that limits me.
I am copying data from various sources into SQL and am using a ADO connection in C++ Builder XE2.
When the data is from MSAccess or MSExcel the code is similar to the following:
//SetupADO..
ADOConn->ConnectionString="Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:/temp/testdb.mdb";
//Then open it..
ADOConn->Connected = true;
//Build SQL
UnicodeString sSQL = "SELECT * INTO [ODBC;DSN=PostgreSQL30;DATABASE=admin_db;SERVER=192.168.1.10;PORT=5432;UID=user1;PWD=pass1;SSLmode=disable;ReadOnly=0;Protocol=7.4;].[table1] FROM [accesstb]";
//And finally I use the EXCEUTE() function of the ADO Connection
ADOConn->Execute(sSQL, iRA, TExecuteOptions() << TExecuteOption::eoExecuteNoRecords);
This works fine for Excel too but not for CSV files. I'm using the same driver must can only get it working by changing the syntax around.
//SetupADO..
ADOConn->ConnectionString="Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\\temp;Extended Properties=\"Text;HDR=Yes;\";Persist Security Info=False";
//Then open it..
ADOConn->Connected = true;
//Build SQL with the IN keyword and start internal ODBC connection with 2 single quotes
UnicodeString sSQL = "SELECT * INTO [table1] IN '' [ODBC;DSN=PostgreSQL30;DATABASE=admin_db;SERVER=192.168.1.10;PORT=5432;UID=user1;PWD=pass1;SSLmode=disable;ReadOnly=0;Protocol=7.4;] FROM [test.csv]";
//And finally EXCEUTE() again
ADOConn->Execute(sSQL, iRA, TExecuteOptions() << TExecuteOption::eoExecuteNoRecords);
When using the same SQL as the Access query the error "Query input must contain at least one table or query" would be returned.
Intrestingly, one escaped quote, i.e. \' fails when used in place of the 2 single ones. I have also tried writing to another Access database in case the problem was with PG but I had the same results.
Can someone tell me why the IN keywork is required and what the single quotes do?

Extended Properties=\"Text;HDR=Yes;\" specifies text as the datasource, so the connection string is different. IN '' tells the database to map table1 to the first column of the CSV file, since there is no relational model in CSV.
References
Importing CSV Data and saving it in database - CodeProject

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Azure Web Job bad encoding downloaded data from data lake store - azure-webjobs

Related

Google Big Query, convert legacy sql to standard sql

How to change the data in a column in SAS Data Integration?

Athena shows no value against boolean column, table created using glue crawler

Retrieving NULL values from S3 while Select query AWS Redshift spectrum

Using a single ADO Query to copy data from a text file into another ODBC source

Categories

Resources