Char length exceeds DDL length - amazon-web-services

I am attempting to copy data into redshift from an S3 bucket, however I am getting a 1204 error code 'char length exceeds DDL length'.
copy table_name from '[data source]'
access_key_id '[access key]'
secret_access_key '[secret access key]'
region 'us-east-1'
null as 'NA'
delimiter ','
removequotes;
The error occurs in the very first row, where it tries to put the state abbreviation 'GA' into the data_state column which is defined with the data type char(2). When I query the stl_load_errors table I get the following result:
line_number colname col_length type raw_field_value err_code err_reason
1 data_state 2 char GA 1204 Char length exceeds DDL length
As far as I can tell that shouldn't exceed the length as it is two characters and it is set to char(2). Does anyone know what could be causing this?

Got it to work by changing the data type to char(3) instead, however still not sure why char(2) wouldn't work

Mine did this as well, for a state column too. Redshift defaults char to char(1) - so I had to specify char(2) - are you sure it didn't default back to char(1) because mine did

Open the file up with a Hex editor, or use an online one here, and look at the GA value in the data_state column.
If it has three dots before it like so:
...GA
Then the file (or when it was orignally created) was UTF-8-BOM not just UTF-8.
You can open the file in something like Notepad++ and go to Encoding in the top bar then select Convert to UTF-8.

Related

How to copy data in Redshift using DELIMITER option which has delimiter in the data itself?

I have pipe delimited data which I am trying to copy from S3 to RedShift but COPY operation is failing with error code 1202 : Extra column(s) found error.
When I looked into stl_load_errors, copy operation failed for rows which have | delimiter in them.
Sample data:
1|hello world|how|are you|
2|"hope|you|are|doing|good"|thank you|
3|I am fine|thank you|
For above mentioned data, 2|"hope|you|are|doing|good"|thank you| fails to get copied because it has | delimiter in itself even though it is inside double quotes.
My copy command looks like below:
COPY <DATABASE.TABLE NAME>
FROM 's3://path/to/file'
iam_role 'arn:aws:iam:my_role'
delimiter '|'
dateformat 'auto'
IGNOREHEADER 1
MAXERROR 5;
AWS RedShift documentation has one example to load this type of data but they have used CSV option, not DELIMITER option.
How can I solve this issue?
You should add the REMOVEQUOTES parameter.
From Data Conversion Parameters - Amazon Redshift:
Removes surrounding quotation marks from strings in the incoming data. All characters within the quotation marks, including delimiters, are retained.

Multiple To clauses in Data step

I have a data step where I have a few columns that need tied to one other column.
I have tried using multiple "from" statements and " to" statements and a couple other permutations of that, but nothing seems to do the trick. The code looks something like this:
data analyze;
set css_email_analysis;
from = bill_account_number;
to = customer_number;
output;
from = bill_account_number;
to = email_addr;
output;
from = bill_account_number;
to = e_customer_nm;
output;
run;
I would like to see two columns showing bill accounts in the "from" column, and the other values in the "to", but instead I get a bill account and its customer number, with some "..."'s for the other values.
Issue
This is most likely because SAS has two datatypes and the first time the to variable is set up, it has the value of customer_number. At your second to statement you attempt to set to to have the value of email_addr. Assuming email_addr is a character variable, two things can happen here:
Customer_number is a number - to has already been set up as a number, so SAS cannot force to to become a character, an error like this may appear:
NOTE: Invalid numeric data, 'me#mywebsite.com' , at line 15 column 8. to=.
ERROR=1 N=1
Customer_number is a character - to has been set up as a character, but without explicitly defining its length, if it happens to be shorter than the value of email_addr then the email address will be truncated. SAS will not show an error if this happens:
Code:
data _NULL_;
to = 'hiya';
to = 'me#mydomain.com';
put to=;
run;
short=me#m
to is set with a length of 4, and SAS does not expand it to fit the new data.
Detail
The thing to bear in mind here is how SAS works behind the scenes.
The data statement sets up an output location
The set statement adds the variables from first observation of the dataset specified to a space in memory called the PDV, inheriting lengths and data types.
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm
===================================================================
010101 | 758|me#my.com |John Smith
The to statement adds another variable inheriting the characteristics of customer_number
PDV:
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |758
(to is either char length 3 or a numeric)
Subsequent to statements will not alter the characteristics of the variable and SAS will continue processing
PDV (if customer_number is character = TRUNCATION):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |me#
PDV (if customer_number is numeric = DATA ERROR, to set to missing):
bill_account_number|customer_number|email_addr|e_customer_nm|to
===================================================================
010101 | 758|me#my.com |John Smith |.
Resolution
To resolve this issue it's probably easiest to set the length and type of to before your first to statement:
data analyze;
set css_email_analysis;
from = bill_account_number;
length to $200;
to = customer_number;
output;
...
You may get messages like this, where SAS has converted data on your behalf:
NOTE: Numeric values have been converted to character
values at the places given by: (Line):(Column).
27:8
N.B. it's not necessary to explicitly define the length and type of from, because as far as I can see, you only ever get the values for this variable from one variable in the source dataset. You could also achieve this with a rename if you don't need to keep the bill_account_number variable:
rename bill_account_number = from;

How to avoid conversion to ASCII when reading

I'm using Python to read values from SQL Server (pypyodbc) and insert them into PostgreSQL (psycopg2)
A value in the NAME field has come up that is causing errors:
MontaƱo
The value is existing in my MSSQL database just fine (SQL_Latin1_General_CP1_CI_AS encoding), and can be inserted into my PostgreSQL database just fine (UTF8) using PGAdmin and an insert statement.
The problem is selecting it using python causes the value to be converted to:
Monta\xf1o
(xf1 is ASCII for 'Latin small letter n with tilde')
...which is causing the following error to be thrown when trying to insert into PostgreSQL:
invalid byte sequence for encoding "UTF8": 0xf1 0x6f 0x20 0x20
Is there any way to avoid the conversion of the input string to the string that is causing the error above?
Under Python_2 you actually do want to perform a conversion from a basic string to a unicode type. So, if your code looks something like
sql = """\
SELECT NAME FROM dbo.latin1test WHERE ID=1
"""
mssql_crsr.execute(sql)
row = mssql_crsr.fetchone()
name = row[0]
then you probably want to convert the basic latin1 string (retrieved from SQL Server) to the type unicode before using it as a parameter to the PostgreSQL INSERT, i.e., instead of
name = row[0]
you would do
name = unicode(row[0], 'latin1')

S3 to Redshift input data format

I'm trying to run a simple chain s3-pipeline-redshift, but I've got completely stucked with input data format. Here's my file:
1,Toyota Park,Bridgeview,IL
2,Columbus Crew Stadium,Columbus,OH
3,RFK Stadium,Washington,DC
4,CommunityAmerica Ballpark,Kansas City,KS
5,Gillette Stadium,Foxborough,MA
6,New York Giants Stadium,East Rutherford,NJ
7,BMO Field,Toronto,ON
8,The Home Depot Center,Carson,CA
9,Dick's Sporting Goods Park,Commerce City,CO
10,Pizza Hut Park,Frisco,TX
and here's the table I'm using:
create table venue_new(
venueid smallint not null,
venuename varchar(100) not null,
venuecity varchar(30),
venuestate char(2),
venueseats integer not null default '1000');
When I use | as a delimiter, I'm getting error 1214 - Delimiter not found , when I use comma - same thing, when I converted file to utf-8, I'm getting "Invalid digit, Value '.', Pos 0, Type: Short'.
I ran out of ideas. What a heck is wrong with that thing? Can somebody please give me the example of the input file or tell what I'm doing wrong? Thanks in advance.
P.S. I also found that sample files are available in bucket awssampledb, but I have no idea how to get them.
Based on the data in the file example. You need to remember that you have 5 fields in your table, and there is no 5th field in any of your data - yet it is a not null field. Your Copy command needs to reference the 4 columns you are providing at the start of the statement.
copy venue_new(venueid, venuename, venuecity, venuestate)
from 's3://mybucket/data/venue_noseats.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter ',';
I found the above command (from AWS Docs COPY examples worked successfully for me, leaving me with the default 1000 in the 'venueseats' column.

Format mask for number field items: trailing and 'leading' zero

I'm having some trouble with displaying numbers in apex, but only when i fill them in through code. When numbers are fetched through an automated row fetch, they're fine!
Leading Zero
For example, i have a report where a user can click a link, which runs a javascript function. There i get detailed values for that record through an application process. The returned values are in JSON. Several fields are number fields.
My response looks as follows (fe):
{"AVAILABLE_STOCK": "15818", "WEIGHT": ".001", "VOLUME": ".00009", "BASIC_PRICE": ".06", "COST_PRICE": ".01"}
Already the numbers here 'not correct': values less than one do not have a zero before the .
I kind of hoped that the format mask on the items would catch this. If i specify FM999G990D000 for the item weight, i'd expect it to show '0.001' .
But okay, i suppose it only works that way when it comes through session state, and not when you set an item value through $("#").val() ?
Where do i go wrong? Is my only option to change my select in the app process?
Now:
SELECT '"AVAILABLE_STOCK": "' || AVAILABLE_STOCK ||'", '||
'"WEIGHT": "' || WEIGHT ||'", '||
'"VOLUME": "' || VOLUME ||'", '||
'"BASIC_PRICE": "' || BASIC_PRICE ||'", '||
Do i need to provide my numberfields a to_char with the format mask here (to_char(available_stock, 'FM999G990D000')) ?
Right now i need to put my numbers between quotes ofcourse, or i get invalid json when i parse it.
Trailing Zero
I have an application process on a page on the after header point, right after an automated row fetch. Several fields are calculated here (totals). The variables used are all specified as number(10, 2). All values are correct and rounded to 2 values after the comma. My format masks on the items are also specified as FM999G999G990D00.
However, when one of the calculated values has only one meaningfull value after the comma, the trailing zeros get dropped. Instead of '987.50', it is displayed as '987.5'.
So, i have a number variable, and assign it like this: :P12_NDB_TOTAL_INCL := v_totI;
Would i need to convert my numbers here too, with format mask?
What am i doing wrong, or what am i missing?
If you aren't doing math on it and are more concerned with formatting, I suggest treating it as a varchar/string instead of as a number wherever you can.