S3 to Redshift input data format - amazon-web-services

S3 to Redshift input data format - amazon-web-services

I'm trying to run a simple chain s3-pipeline-redshift, but I've got completely stucked with input data format. Here's my file:
1,Toyota Park,Bridgeview,IL
2,Columbus Crew Stadium,Columbus,OH
3,RFK Stadium,Washington,DC
4,CommunityAmerica Ballpark,Kansas City,KS
5,Gillette Stadium,Foxborough,MA
6,New York Giants Stadium,East Rutherford,NJ
7,BMO Field,Toronto,ON
8,The Home Depot Center,Carson,CA
9,Dick's Sporting Goods Park,Commerce City,CO
10,Pizza Hut Park,Frisco,TX
and here's the table I'm using:
create table venue_new(
venueid smallint not null,
venuename varchar(100) not null,
venuecity varchar(30),
venuestate char(2),
venueseats integer not null default '1000');
When I use | as a delimiter, I'm getting error 1214 - Delimiter not found , when I use comma - same thing, when I converted file to utf-8, I'm getting "Invalid digit, Value '.', Pos 0, Type: Short'.
I ran out of ideas. What a heck is wrong with that thing? Can somebody please give me the example of the input file or tell what I'm doing wrong? Thanks in advance.
P.S. I also found that sample files are available in bucket awssampledb, but I have no idea how to get them.

Based on the data in the file example. You need to remember that you have 5 fields in your table, and there is no 5th field in any of your data - yet it is a not null field. Your Copy command needs to reference the 4 columns you are providing at the start of the statement.
copy venue_new(venueid, venuename, venuecity, venuestate)
from 's3://mybucket/data/venue_noseats.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter ',';
I found the above command (from AWS Docs COPY examples worked successfully for me, leaving me with the default 1000 in the 'venueseats' column.

Related

Parsing a name from a complex string in Tableau

I have a series of values in Tableau that are long strings intermixed with letters and numbers. I am unable to control the data output, but would like to parse the names from these strings. They follow the following format:
Potato 1TByte 4.5 NFA
Board 256GByte 553 NCA
Launch 4 512GByte 4.5 NFA
Launch 4S 512GByte 4.5 NCA
From each of these, I am attempting to capture the following:
"Potato"
"Board"
"Launch 4"
"Launch 4S"
Each string follows the same format: the name, followed by size, followed by some extra information we don't really care about.
I've tried to put together some text parsing strings, but am coming up short, and am still trying to learn regular expressions.
The Tableau calculated field I was trying to work with was something like the following:
LEFT([String], FIND([String], "Byte") - 2)
The issue is that the text and numbers preceding Byte can be anywhere from 4 to 2 characters and I need a way to identify the length of that.
Any help would be greatly appreciated!

One option which uses a regex replacement:
REGEXP_REPLACE('Launch 4 512GByte 4.5 NFA', ' \d+[A-Z]Byte .*$', '')
This strips off everything from the Byte term to the right, leaving us with only the product name.

You could try the following - this seems to work - Screenshot of Tableau output. Find below the formulas for the various derived columns you see in the screenshot (Your source column is called [Name])
Step1 = LEFT([Name],FIND([Name],"Byte")-1)
Step2 = LEN([Step1])-LEN(REPLACE([Step1]," ",""))
Step3 = FINDNTH([Step1]," ",[Step2])
Step4 = LEFT([Step1],[Step3]-1)
And of course you can nest all these in a single calculated field - kept them as separate columns for easier understanding

Char length exceeds DDL length

I am attempting to copy data into redshift from an S3 bucket, however I am getting a 1204 error code 'char length exceeds DDL length'.
copy table_name from '[data source]'
access_key_id '[access key]'
secret_access_key '[secret access key]'
region 'us-east-1'
null as 'NA'
delimiter ','
removequotes;
The error occurs in the very first row, where it tries to put the state abbreviation 'GA' into the data_state column which is defined with the data type char(2). When I query the stl_load_errors table I get the following result:
line_number colname col_length type raw_field_value err_code err_reason
1 data_state 2 char GA 1204 Char length exceeds DDL length
As far as I can tell that shouldn't exceed the length as it is two characters and it is set to char(2). Does anyone know what could be causing this?

Got it to work by changing the data type to char(3) instead, however still not sure why char(2) wouldn't work

Mine did this as well, for a state column too. Redshift defaults char to char(1) - so I had to specify char(2) - are you sure it didn't default back to char(1) because mine did

Open the file up with a Hex editor, or use an online one here, and look at the GA value in the data_state column.
If it has three dots before it like so:
...GA
Then the file (or when it was orignally created) was UTF-8-BOM not just UTF-8.
You can open the file in something like Notepad++ and go to Encoding in the top bar then select Convert to UTF-8.

Azure Data Warehouse PolyBase File format

We have a file that looks like this:
Col1,Col2,Col3,Col4,Col5
"Hello,",I,",am",some,data!
It therefore has the following 'properties':
Comma-separated
Double-quote column delimiter
Commas in some of the columns
Now, I am not sure if it's actually possible to ingest this with PolyBase, but wondered if there was a way?
The error we are seeing at present is "Could not find a delimiter after quote".. which i guess is because after the double quote it is hitting what is an expected delimiter..
Here is our current file format, for completeness:
CREATE EXTERNAL FILE FORMAT Comma
WITH (FORMAT_TYPE = DELIMITEDTEXT,
FORMAT_OPTIONS(
FIELD_TERMINATOR = ',',
STRING_DELIMITER = '"',
)
)

Specify it in hex instead.
STRING_DELIMITER = '0x22'
(Based on the problem that someone described at the end of https://msdn.microsoft.com/en-au/library/dn935026.aspx )

Sorted this out in the end by adding an intermediary step to convert the file from csv to ORC format..
It's a bit clunky (as it leaves a mess of a copy behind), but the PolyBase then does work with the fileformat:
CREATE EXTERNAL FILE FORMAT Orc
WITH (FORMAT_TYPE = ORC)
works for now, until it is addressed by the product team: https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10600132-polybase-allow-field-row-terminators-within-strin

Access 2010 Query add text to end of existing text if condition is met

I have a column of data, diagnosis codes to be exact. the problem is that when the data is imported it turns 111.0 into 111 (or any whole number). I am wondering if there is an update query I can run that will add the ".0" to the end of any value that is 3 characters long. I had a problem of it stripping a value from 008.45 to 8.45 but I figured that part out using:
UPDATE Master SET DIAGNOSIS01 = LEFT("00", 3-LEN(DIAGNOSIS01)) + DIAGNOSIS01
WHERE LEN(DIAGNOSIS01)<3 AND Len(DIAGNOSIS01)>0;
I got that from here on stackoverflow. Is there a variation of this update query I can use to add to the right if it's only 3 digits?
Additional info... formats of the values in this column include xxx.x or xxx.xx with x being a number
When it comes to sql I am very new so please treat me like I'm 3... ;)

UPDATE Master
SET Master.DIAGNOSIS01 = IIf(Len([Master].[DIAGNOSIS01])=3,[Master].[DIAGNOSIS01] & ".0",[Master].[DIAGNOSIS01]);

Format mask for number field items: trailing and 'leading' zero

I'm having some trouble with displaying numbers in apex, but only when i fill them in through code. When numbers are fetched through an automated row fetch, they're fine!
Leading Zero
For example, i have a report where a user can click a link, which runs a javascript function. There i get detailed values for that record through an application process. The returned values are in JSON. Several fields are number fields.
My response looks as follows (fe):
{"AVAILABLE_STOCK": "15818", "WEIGHT": ".001", "VOLUME": ".00009", "BASIC_PRICE": ".06", "COST_PRICE": ".01"}
Already the numbers here 'not correct': values less than one do not have a zero before the .
I kind of hoped that the format mask on the items would catch this. If i specify FM999G990D000 for the item weight, i'd expect it to show '0.001' .
But okay, i suppose it only works that way when it comes through session state, and not when you set an item value through $("#").val() ?
Where do i go wrong? Is my only option to change my select in the app process?
Now:
SELECT '"AVAILABLE_STOCK": "' || AVAILABLE_STOCK ||'", '||
'"WEIGHT": "' || WEIGHT ||'", '||
'"VOLUME": "' || VOLUME ||'", '||
'"BASIC_PRICE": "' || BASIC_PRICE ||'", '||
Do i need to provide my numberfields a to_char with the format mask here (to_char(available_stock, 'FM999G990D000')) ?
Right now i need to put my numbers between quotes ofcourse, or i get invalid json when i parse it.
Trailing Zero
I have an application process on a page on the after header point, right after an automated row fetch. Several fields are calculated here (totals). The variables used are all specified as number(10, 2). All values are correct and rounded to 2 values after the comma. My format masks on the items are also specified as FM999G999G990D00.
However, when one of the calculated values has only one meaningfull value after the comma, the trailing zeros get dropped. Instead of '987.50', it is displayed as '987.5'.
So, i have a number variable, and assign it like this: :P12_NDB_TOTAL_INCL := v_totI;
Would i need to convert my numbers here too, with format mask?
What am i doing wrong, or what am i missing?

If you aren't doing math on it and are more concerned with formatting, I suggest treating it as a varchar/string instead of as a number wherever you can.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

S3 to Redshift input data format - amazon-web-services

Related

Parsing a name from a complex string in Tableau

Char length exceeds DDL length

Azure Data Warehouse PolyBase File format

Access 2010 Query add text to end of existing text if condition is met

Format mask for number field items: trailing and 'leading' zero

Categories

Resources