Azure Data Warehouse PolyBase File format - azure-sqldw

We have a file that looks like this:
Col1,Col2,Col3,Col4,Col5
"Hello,",I,",am",some,data!
It therefore has the following 'properties':
Comma-separated
Double-quote column delimiter
Commas in some of the columns
Now, I am not sure if it's actually possible to ingest this with PolyBase, but wondered if there was a way?
The error we are seeing at present is "Could not find a delimiter after quote".. which i guess is because after the double quote it is hitting what is an expected delimiter..
Here is our current file format, for completeness:
CREATE EXTERNAL FILE FORMAT Comma
WITH (FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
)
)

Specify it in hex instead.
STRING_DELIMITER = '0x22'
(Based on the problem that someone described at the end of https://msdn.microsoft.com/en-au/library/dn935026.aspx )

Sorted this out in the end by adding an intermediary step to convert the file from csv to ORC format..
It's a bit clunky (as it leaves a mess of a copy behind), but the PolyBase then does work with the fileformat:
CREATE EXTERNAL FILE FORMAT Orc
WITH (FORMAT_TYPE = ORC)
works for now, until it is addressed by the product team: https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10600132-polybase-allow-field-row-terminators-within-strin

Related

Format timestamp inside a set-column sentence

I'm developing a data fusion pipeline. It contains a wrangler node where I'm trying to create a new field that will contain the system date in timestamp format (yyyy-MM-dd'T'HH-mm-ss).
I've tried using the sentence:
set-column :sysdate (${logicalStartTime(yyyy-MM-dd'T'HH-mm-ss)})
But I receive the error:
Caused by: io.cdap.wrangler.api.DirectiveParseException: Error encountered while parsing 'set-column' : Error encountered while compiling '( 2022 -12-01T16-29-32 ) ' at line '1' and column '14'. Make sure a valid jexl transformation is provided.
Which would be the correct sentence?
I've tried:
set-column :sysdate (${logicalStartTime(yyyy-MM-ddHH-mm-ss)})
Which will result in something like "1877", as it substracts the numbers, and also tried:
set-column :sysdate (${logicalStartTime(yyyyMMddHHmmss)})
but the format isn't correct and can only be written if the field is a String.
You have the correct method, just incorrect syntax. The syntax you are looking for is set-column :sysdate ${logicalStartTime(yyyy-MM-dd'T'HH-mm-ss)}, you have to remove (). Then you can convert the string in datetime pattern in this format parse-as-datetime :sysdate "yyyy-MM-dd'T'HH-mm-ss".

Azure Data Lake - Conditions

I would like to use if-else statement to decide at what location I have to export data.
My case is:
I extract several files from the azure blob storage (it's possible that there are no files!!).
I calculate count of records in file set.
If count of records is > 20 then I export files into specific report location
If in file set are no records, I have to output dummy empty file into different location, because I don't want replace existing report by empty report.
The solution may be IF..ELSE confition. The problem is that if I calculate count of records I got rowset variable and I cannot compare it with scalar variable.
#RECORDS =
SELECT COUNT(id) AS IdsCount
FROM #final;
IF #RECORDS <= 20 THEN //generate dummy empty file
OUTPUT #final_result
TO #EMPTY_OUTPUT_FILE
USING Outputters.Text(delimiter : '\t', quoting : true, encoding : Encoding.UTF8, outputHeader : true, dateTimeFormat : "s", nullEscape : "NULL");
ELSE
OUTPUT #final_result
TO #OUTPUT_FILE
USING Outputters.Text(delimiter : '\t', quoting : true, encoding : Encoding.UTF8, outputHeader : true, dateTimeFormat : "s", nullEscape : "NULL");
END;
U-SQL's IF statement is currently only used during compilation time. So you can do something like
IF FILE.EXIST() THEN
But if you want to output different files depending on the number of records you would have to write it at the SDK/CLI level:
The first job writes a temp file output (and maybe a status file that contains number of rows). Then you check (for example in Powershell) whether the file is empty (or whatever criteria you want to use) and if not, copy the result over otherwise create the empty output file.

Why must I run this code a few times before my entire .csv file is converted into a .yaml file?

I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.

'~' leading to null results in python script

I am trying to extract a dynamic value (static characters) from a csv file in a specific column and output the value to another csv.
The data element I am trying to extract is '12385730561818101591' from the value 'callback=B~12385730561818101591' located in a specific column.
I have written the below python script, but the output results are always blank. The regex '=(~[0-9]+)' was validated to successfully pull out the '12385730561818101591' value. This was tested on www.regex101.com.
When I use this in Python, no results are displayed in the output file. I have a feeling the '~' is causing the error. When I tried searching for '~' in the original CSV file, no results were found, but it is there!
Can the community help me with the following:
(1) Determine root cause of no output and validate if '~' is the problem. Could the problem also be the way I'm splitting the rows? I'm not sure if the rows should be split by ';' instead of ','.
import csv
import sys
import ast
import re
filename1 = open("example.csv", "w")
with open('example1.csv') as csvfile:
data = None
patterns = '=(~[0-9]+)'
data1= csv.reader(csvfile)
for row in data1:
var1 = row[57]
for item in var1.split(','):
if re.search(patterns, item):
for data in item:
if 'common' in data:
filename1.write(data + '\n')
filename1.close()
Here I have tried to write sample code. Hope this will help you in solving the problem:
import re
str="callback=B~12385730561818101591"
rc=re.match(r'.*=B\~([0-9A-Ba-b]+)', str)
print rc.group(1)
You regex is wrong for your example :
=(~[0-9]+) will never match callback=B~12385730561818101591 because of the B after the = and before the ~.
Also you include the ~ in the capturing group.
Not exatly sure what's your goal but this could work. Give more details if you have more restrictions.
=.+~([0-9]+)
EDIT
Following the new provided information :
patterns = '=.+~([0-9]+)'
...
result = re.search(patterns, item):
number = result.group(0)
filename1.write(number + '\n')
...
Concerning your line split on the \t (tabulation) you should show an example of the full line

How can I convert an Excel table into Trac Wiki Table format?

I've seen copy & paste converters to MediaWiki or HTML format. But I could not find one that converts to the Trac Wiki format (WikiFormattting) which uses pipes to separate cells, such as:
||Cell 1||Cell 2||Cell 3||
||Cell 4||Cell 5||Cell 6||
You could save your excel sheet as a CSV file. Then from a command prompt (assuming you are running Windows XP or newer) type this command:
for /f "tokens=1,2,3 delims=," %a in (mycsvfile.csv) do ((echo ^|^|%a^|^|%b^|^|%c^|^|) >> mywikifile.txt)
The number of tokens depends on how many columns you have. You could do up to 26 columns this way in a single pass by increasing the number of tokens and adding the corresponding number of variable names %d, %e, etc.
I made a jsFiddle to do just this, just put your CSV test in the HTML box and run the script. The content that you would paste into a TracWiki page would be in the Result box then.
In case something happens to the jsFiddle, here is the JavaScript I used (I probably didn't need to use jQuery, but it was faster for me then to have to think of the non-jQuery way to do it:
var csv = $('body').html().trim();
csv = csv.replace(/,/g, "||");
csv = csv.replace(/$/gm, "||<br />");
csv = csv.replace(/^/gm, "||");
// set to false if you don't want empty cells
if (true) {
while (csv.indexOf("||||") > -1) {
csv = csv.replace(/\|\|\|\|/g, "|| ||");
}
}
$('body').html(csv);
My port of Shan Carter's Mr. Data Converter now supports Wiki in the format you specified. You can copy & paste directly from Excel or from a CSV file.
http://thdoan.github.io/mr-data-converter/
UPDATE: I've added Trac-specific formatting under the "Trac" option.