Redshift command - Copy adding a column with random numbers - amazon-web-services

I am doing unload and copy from Redshift to S3 -
unload (select * from tbl)
to <S3 location>
credentials <creds>
addquotes escape
copy tbl2
from <S3 location>
credentials <creds>
removequotes escape
My table is like - int, text,text,text .
Copy command is adding random numbers in first int column and shifting further columns to right, removing last column.
Does anyone have any idea why this could happen?
Original table -
col1 col2 col3 col4
1 abc def ghi jkl
2 mno pqr stu vwx
Copy Table -
col1 col2 col3 col4
123 1 abc def ghi
456 2 mno pqr stu
Unloaded table is correct

At a guess, two things might be wrong. The first is that your to and from column order is different.
I would try
opening the file at 'S3 location' in S3
copy the header column (line 1)
edit the column text changing your delimiter to "," if not already
paste the edited column header into your copy command
copy tbl2('column list from file')
from
credentials
removequotes escape
If your S3 file lacks a header than go back to your original export process and figure out what it is.
Less likely, you may be missing the
IGNOREHEADER 1
parameter on your copy. Let us know what you find.

Related

regular expression in hive for a specific string

I have a column in hive table which is a address column and i want to split that into 2.
There are 2 scenarios to take care of.
Example:
Scenario 1:
Input column value:
ABC DEF123 AD
Output column values:
Column 1 should have ABC DEF
Column 2 should have 123 AD
Another example can be like below.
MICHAEL POSTON875 HYDERABAD
In this case separation should be based on a number which is part of a string value, if a string is having number in it then both should separate
Scenario 2:
Input value: ABC DEFPO BOX 5232
Output:
Column 1:- ABC DEF
Column 2:- PO BOX 5232
Another example can be like below.
Hyderabad jhillsPO BOX 522002
In this case separation should be based on PO BOX
Both the data is in same column and i would like to update the data into target based on the string format..like a case statement not sure about the approach.
NOTE:- The string length can be varied as this is address column.
Can some one please help me to provide a hive query and pyspark for the same?
Using CASE expression you can check which template does it match and using regexp_replace insert some delimiter, then split by the same delimiter.
Demo (Hive):
with mytable as (
select stack(4,
'ABC DEF123 AD',
'MICHAEL POSTON875 HYDERABAD',
'ABC DEFPO BOX 5232',
'Hyderabad jhillsPO BOX 522002'
) as str
) --Use your table instead of this
select columns[0] as col1, columns[1] as col2
from
(
select split(case when (str rlike 'PO BOX') then regexp_replace(str, 'PO BOX','|||PO BOX')
when (str rlike '[a-zA-Z ]+\\d+') then regexp_replace(str,'([a-zA-Z ]+)(\\d+.*)', '$1|||$2')
--add more cases and ELSE part
end,'\\|{3}') columns
from mytable
)s
Result:
col1 col2
ABC DEF 123 AD
MICHAEL POSTON 875 HYDERABAD
ABC DEF PO BOX 5232
Hyderabad jhills PO BOX 522002

Redshift table update using two tables

I have a requirement to work on an update. The requirement is to update table 2 using the data from table 1. Please find below sample records from the two tables:
TABLE A
-----------------
colA | colB | colC
-----------------
1 AAA ABC
2 BBB DEF
3 CCC GHI
3 CCC HIJ
TABLE B
-----------------
colA1 | colB1 | colC1
-----------------
1 AAA
2 BBB
3 CCC
3 CCC
I need to update the colC1 with values of ColC. Expected output is shown below
TABLE B
-----------------
colA1 | colB1 | colC1
-----------------
1 AAA ABC
2 BBB DEF
3 CCC GHI
3 CCC HIJ
Do we need to use a cursor for this or a simple update statement like shown below would do?
Update table B
set colC1 = table A.colC
from TABLE A, TABLE B
where colA1 = colA
and colB1 = colB;
Your SQL seems perfectly fine.
Cursors are normally used for programmatic access to a database, where the program is stepping through the results one-at-a-time, with the cursor pointing to the 'current record'. That isn't needed in normal SQL update statements.
One thing to note... In Amazon Redshift, using an UDPATE on a row causes the existing row to be marked for deletion and a new row is created. (This is a side-effect of using a columnar database.) If many rows are updated, it means that the disk storage becomes less efficient. It can be improved by occasionally running VACUUM tablename, which will remove the deleted storage.

SAS: Adding observation and fill forward

I want to add an observation in SAS per group at a certain time and fill forward all values (except the time). I don't want to do it manually with datalines and proc append. Is there another way?
In the example: always insert a row per security at exactly 10:00am and use the value from the one above:
Security Time Value
ABC 9:59 2
ABC 10:01 3
.
.
.
DCE 9:58 9
DCE 10:01 3
.
.
Output:
Security Time Value
ABC 9:59 2
ABC 10:00 2
ABC 10:01 3
.
.
.
DCE 9:58 9
DCE 10:00 9
DCE 10:01 3
.
.
Thankful for any help!
Best
Also you can use proc sql to insert row:
PROC SQL;
INSERT INTO table_name
VALUES (value1,value2,value3,...);
QUIT;
OR
PROC SQL;
INSERT INTO table_name (column1,column2,column3,...)
VALUES (value1,value2,value3,...);
QUIT;

hive sql add sort or distribute then the result file size bigger than before

My hive tables are all lzo compressed type. I have two hive-sql like this:
[1]
set hive.exec.compress.output=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
insert overwrite table a partition(dt='20160420')
select col1, col2 ... from b where dt='20160420';
because the [1] sql will has no reduce, it will create many small files.
[2]
set hive.exec.compress.output=true;
set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
insert overwrite table a partition(dt='20160420')
select col1, col2 ... from b where dt='20160420'
sort by col1;
The only diffrent is the last line, sql [2] has the "sort by ".
The data count and content is same, But the file size of [2] is more bigger than [1], our hdfs file size is almost 1times greater than before.
Can you help me find the reason.

Compare columns in CSV and overide newlist

I'm new to python. I have two .csv files with identical columns, but my oldlist.csv has been edited for row[9] with employee names, the newlist.csv when generated defaults to certain domains for names. I want to be able to take the oldlist.csv compare to newlist.csv and override columns in newlist.csv with the data in row[9] from oldlist.csv. Thanks for your help.
Example: (oldlist) col1, col2 (newlist) col1, col2
1234, Bob 1234, Jane
I want to read oldlist, if col1 == col1 in newlist override col2 and I want to contine to write.write(row) for everything matching in col from oldlist