Redshift Copy from s3 using for loop

Redshift Copy from s3 using for loop - amazon-web-services

I have many files to load in S3.
And I have created manifest file at each prefix of the files.
for instance, at s3://my-bucket/unit_1
I have files like below.
chunk1.csv.gz
chunk2.csv.gz
chunk3.csv.gz
cunkk4.csv.gz
unit.manifest
so with copy command, I can load the unit_1 files to redshift
However, I got more than 1000 units so I want to do it with loop.
So I want to make loop that iterate from 1 to 1000 to change just prefix of the manifest file.
So I did like below,
create or replace procedure copy_loop()
language plpgsql
as $$
BEGIN
FOR i in 1..1000 LOOP
COPY mytable
FROM 's3://my-bucket/unit_%/unit.manifest', i
credentials 'aws_iam_role=arn:aws:iam::myrolearn'
MANIFEST
REGION 'ap-northeast-2'
REMOVEQUOTES
IGNOREHEADER 1
ESCAPE
DATEFORMAT 'auto'
TIMEFORMAT 'auto'
GZIP
DELIMITER '|'
ACCEPTINVCHARS '?'
COMPUPDATE FALSE
STATUPDATE FALSE
MAXERROR 0
BLANKSASNULL
EMPTYASNULL
NULL AS '\N'
EXPLICIT_IDS;
END LOOP;
END;
$$;
But I got this message
SQL Error [500310] [42601]: Amazon Invalid operation: syntax error at or near ",";
How can I handle this?

This is my solution.
create or replace procedure copy_loop(i1 int, i2 int)
language plpgsql
as $$
DECLARE
prefix TEXT := 's3://mybucket/unit_';
manifest TEXT := '/unit.manifest' ;
manifest_location TEXT ;
copy_commands VARCHAR(2000) ;
copy_options VARCHAR(2000) := 'credentials '|| quote_literal('aws_iam_role=myrolearn')
|| ' MANIFEST '
|| ' REGION ' || quote_literal('ap-northeast-2')
|| ' REMOVEQUOTES '
|| ' IGNOREHEADER 1 '
|| ' ESCAPE '
|| ' DATEFORMAT ' || quote_literal('auto')
|| ' TIMEFORMAT ' || quote_literal('auto')
|| ' GZIP '
|| ' DELIMITER ' || quote_literal('|')
|| ' ACCEPTINVCHARS ' || quote_literal('?')
|| ' COMPUPDATE FALSE '
|| ' STATUPDATE FALSE '
|| ' MAXERROR 0 '
|| ' BLANKSASNULL '
|| ' EMPTYASNULL '
|| ' NULL AS ' || quote_literal('\N')
|| ' EXPLICIT_IDS ';
BEGIN
FOR i in i1..i2 LOOP
manifest_location := prefix || i || manifest;
copy_commands := 'COPY mytable FROM' || quote_literal(manifest_location) || copy_options;
execute copy_commands;
END LOOP;
END;
$$;
using this procedure, I could copy files from more than 1000 units.
also set starting number and end number of the loop helped to divide the loading jobs. Since large amount loading takes few hours, I think it is better to do load job with some chunks.

Related

Handling different escaping sequences?

I'm using ANTLR with Presto grammar in order to parse SQL queries.
This is the original string definition I've used to parse queries:
STRING
: '\'' ( '\\' .
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
This worked ok for most queries until I saw queries with different escaping rules. For example:
select
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features
from table1
So I've modified my String definition and now it looks like:
STRING
: '\'' ( '\\' .
| '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}? // match \ followed by any char
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;
However, this won't work for the query mentioned above as I'm getting
'\\'',''),'
as a single string.
The predicate returns True for the following query.
Any idea how can I handle this query as well?
Thanks,
Nir.

In the end I was able to solve it. This is the expression I was using:
STRING
: '\'' ( '\\\\' . {HelperUtils.isNeedSpecialEscaping(this)}?
| '\\' (~[\\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
| ~[\\'] // match anything other than \ and '
| '\'\'' // match ''
)*
'\''
;

grammar Question;
sql
#init {System.out.println("Question last update 2352");}
: replace+ EOF
;
replace
: REPLACE '(' expr ')'
;
expr
: ( replace | ID ) ',' STRING ',' STRING
;
REPLACE : 'replace' DIGIT? ;
ID : [a-zA-Z0-9_]+ ;
DIGIT : [0-9] ;
STRING : '\'' '\\\\\'' '\'' // '\\''
| '\'' '\'\'' '\'' // ''''
| '\'' ~[\\']* '\'\'' ~[\\']* '\'' // 'it is 8 o''clock'
| '\'' .*? '\'' ;
NL : '\r'? '\n' -> channel(HIDDEN) ;
WS : [ \t]+ -> channel(HIDDEN) ;
File input.txt (not having more examples, I can only guess) :
replace1(replace(some_col,'\\'',''),'\"' ,'')
replace2(some_col,'''','')
replace3(some_col,'abc\tdef\tghi','xyz')
replace4(some_col,'abc\ndef','xyz')
replace5(some_col,'it is 8 o''clock','8')
Execution :
$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Question*.java
$ grun Question sql -tokens input.txt
[#0,0:7='replace1',<REPLACE>,1:0]
[#1,8:8='(',<'('>,1:8]
[#2,9:15='replace',<REPLACE>,1:9]
[#3,16:16='(',<'('>,1:16]
[#4,17:24='some_col',<ID>,1:17]
[#5,25:25=',',<','>,1:25]
[#6,26:30=''\\''',<STRING>,1:26]
[#7,31:31=',',<','>,1:31]
[#8,32:33='''',<STRING>,1:32]
[#9,34:34=')',<')'>,1:34]
[#10,35:35=',',<','>,1:35]
[#11,36:39=''\"'',<STRING>,1:36]
[#12,40:40=' ',<WS>,channel=1,1:40]
[#13,41:41=',',<','>,1:41]
[#14,42:43='''',<STRING>,1:42]
[#15,44:44=')',<')'>,1:44]
[#16,45:45='\n',<NL>,channel=1,1:45]
[#17,46:53='replace2',<REPLACE>,2:0]
[#18,54:54='(',<'('>,2:8]
[#19,55:62='some_col',<ID>,2:9]
[#20,63:63=',',<','>,2:17]
[#21,64:67='''''',<STRING>,2:18]
[#22,68:68=',',<','>,2:22]
[#23,69:70='''',<STRING>,2:23]
[#24,71:71=')',<')'>,2:25]
[#25,72:72='\n',<NL>,channel=1,2:26]
[#26,73:80='replace3',<REPLACE>,3:0]
[#27,81:81='(',<'('>,3:8]
[#28,82:89='some_col',<ID>,3:9]
[#29,90:90=',',<','>,3:17]
[#30,91:105=''abc\tdef\tghi'',<STRING>,3:18]
[#31,106:106=',',<','>,3:33]
[#32,107:111=''xyz'',<STRING>,3:34]
[#33,112:112=')',<')'>,3:39]
[#34,113:113='\n',<NL>,channel=1,3:40]
[#35,114:121='replace4',<REPLACE>,4:0]
[#36,122:122='(',<'('>,4:8]
[#37,123:130='some_col',<ID>,4:9]
[#38,131:131=',',<','>,4:17]
[#39,132:141=''abc\ndef'',<STRING>,4:18]
[#40,142:142=',',<','>,4:28]
[#41,143:147=''xyz'',<STRING>,4:29]
[#42,148:148=')',<')'>,4:34]
[#43,149:149='\n',<NL>,channel=1,4:35]
[#44,150:157='replace5',<REPLACE>,5:0]
[#45,158:158='(',<'('>,5:8]
[#46,159:166='some_col',<ID>,5:9]
[#47,167:167=',',<','>,5:17]
[#48,168:185=''it is 8 o''clock'',<STRING>,5:18]
[#49,186:186=',',<','>,5:36]
[#50,187:189=''8'',<STRING>,5:37]
[#51,190:190=')',<')'>,5:40]
[#52,191:191='\n',<NL>,channel=1,5:41]
[#53,192:191='<EOF>',<EOF>,6:0]
Question last update 2352

Python remove empty element by list

i have this list that contains an empty element:
list = ['Caramanico Terme', ' ', 'Castellafiume', ' ', 'Castelvecchio Subequo', ' ', 'Falesia di ovindoli', ' ', 'Fara San Martino', ' ', "L'Aquila - Madonna d'Appari", ' ', 'La Palma Pazza (Bisegna AQ)', ' ', 'Liscia Palazzo', ' ', 'Luco dei marsi', ' ', 'Montebello di Bertona', ' ', 'Monticchio', ' ', 'Palena', ' ', 'Pennadomo', ' ', 'Pennapiedimonte', ' ', 'Pescomarrino', ' ', 'Petrella', ' ', 'Pianezza', ' ', 'Pietrasecca', ' ', ' ', 'PietrePiane', ' ', 'Pizzi di Lettopalena (loc. Fonte della Noce)', ' ', 'Placche di Bini', ' ', 'Roccamorice', ' ', 'Sasso di Lucoli', ' ', 'Villetta Barrea', ' ']
how i can remove this '' empty element?
I have try in this way:
[x for x in list if all(x)]
but the elements are not delete
Any help?
Thanks

First of all. Make sure to not call your list list. That's a built-in type and will cause problems later. I renamed it to lst. Then you can filter the list the following way:
lst = ['Caramanico Terme', ' ', 'Castellafiume', ' ', 'Castelvecchio Subequo', ' ', 'Falesia di ovindoli', ' ', 'Fara San Martino', ' ', "L'Aquila - Madonna d'Appari", ' ', 'La Palma Pazza (Bisegna AQ)', ' ', 'Liscia Palazzo', ' ', 'Luco dei marsi', ' ', 'Montebello di Bertona', ' ', 'Monticchio', ' ', 'Palena', ' ', 'Pennadomo', ' ', 'Pennapiedimonte', ' ', 'Pescomarrino', ' ', 'Petrella', ' ', 'Pianezza', ' ', 'Pietrasecca', ' ', ' ', 'PietrePiane', ' ', 'Pizzi di Lettopalena (loc. Fonte della Noce)', ' ', 'Placche di Bini', ' ', 'Roccamorice', ' ', 'Sasso di Lucoli', ' ', 'Villetta Barrea', ' ']
filtered = [x for x in lst if len(x.strip()) > 0]
This will remove all kinds of whitepace elements like ' ' or ' ' etc.
EDIT:
As corn3lius pointed out, this would work too:
filtered = [x for x in lst if x.strip()]

You can add a condition in comprehension list:
l = ['Caramanico Terme', ' ', 'Castellafiume', ' ', 'Castelvecchio Subequo', ' ', 'Falesia di ovindoli', ' ', 'Fara San Martino', ' ', "L'Aquila - Madonna d'Appari", ' ', 'La Palma Pazza (Bisegna AQ)', ' ', 'Liscia Palazzo', ' ', 'Luco dei marsi', ' ', 'Montebello di Bertona', ' ', 'Monticchio', ' ', 'Palena', ' ', 'Pennadomo', ' ', 'Pennapiedimonte', ' ', 'Pescomarrino', ' ', 'Petrella', ' ', 'Pianezza', ' ', 'Pietrasecca', ' ', ' ', 'PietrePiane', ' ', 'Pizzi di Lettopalena (loc. Fonte della Noce)', ' ', 'Placche di Bini', ' ', 'Roccamorice', ' ', 'Sasso di Lucoli', ' ', 'Villetta Barrea', ' ']
print([l for l in list if l != ' '])

Removing all items that not is ' ' i.e. the empty string is the same thing as building a set with all elements from the first set that has length > 0. This one liner takes care of that:
a = ['', 'apple', '', 'peach']
b = [i for i in a if i != '']

Removing empty items from list. Here empty items might be in single space or multiple space within quotes. So, use strip() function in list comprehension.
Ex:
temp_str = ' || 0X0C || 0X00000 || 0X00094 || 0X00E8C || IN_OPER || 000000e8cff7e000 || '
temp_str.split('||')
# result: [' ', ' 0X0C ', ' 0X00000 ', ' 0X00094 ', ' 0X00E8C ', ' IN_OPER ', ' 000000e8cff7e000 ', ' ']
temp_list = [ x for x in temp_str.split('||') if x]
temp_list
# result: [' ', ' 0X0C ', ' 0X00000 ', ' 0X00094 ', ' 0X00E8C ', ' IN_OPER ', ' 000000e8cff7e000 ', ' ']
temp_list = [ x for x in temp_str.split('||') if x.strip()]
temp_list
# result: [' 0X0C ', ' 0X00000 ', ' 0X00094 ', ' 0X00E8C ', ' IN_OPER ', ' 000000e8cff7e000 ']
temp_list = [ x.strip() for x in temp_str.split('||') if x.strip()]
temp_list
# result: ['0X0C', '0X00000', '0X00094', '0X00E8C', 'IN_OPER', '000000e8cff7e000']

How to add an orderby to a typo3 flow queryBuilder query?

This is my code, all I did to the exeisting working code is add the orderby:
$queryBuilder->select('pa1')
->from('\SeeThroughWeb\Shop\Domain\Model\ProductArticle', 'pa1')
->join('pa1.productPrices', 'pp1')
->join('pa1.product', 'p')
->where('pp1.salePrice IN (' . $subQueryBuilder . ') AND pa1.status = ' . \SeeThroughWeb\Shop\Domain\Model\ProductArticle::STATUS_ACTIVE . ' AND (pa1.stock > 0 OR pa1.displayOutOfStock = 1) AND p.status = ' . \SeeThroughWeb\Shop\Domain\Model\Product::STATUS_ACTIVE . ' AND p.isFeatured = 1 AND p.deleted = 0')
->groupBy('p')
->orderBy('p.isgiftcard', 'ASC');
$result = $query->execute();
doesn't seem to work, it gives me the exception:
MetaDataController.php line 176
What am I doing wrong?

string split in LINQ

This is wat i tried using split
string[] req_info_texts = Regex.Matches(model_file_string_qts_corrected,
"RequirementInfo.*\"")
.OfType<Match>()
.Select(m=> m.Groups[0].Value.Split('\'').ToString())
.ToArray();
RequirementInfo.*\" Lines in the string "model_file_string_qts_corrected" is similar to
RequirementInfo "{'1' 2' 3'4 '5' 6'7' 8'syed_syed' 'SRDD_PFC_047602' } %GIDa_033022bd_8058_4216_8b9d_71454ba5f896"
There were n no of lines like above in the string .
I need syed_syed in the array req_info_texts .
But wat i get is index out of range exception.
Can u say wat the mistake is?

string[] req_info_texts = Regex.Matches(input,#"RequirementInfo.*\"")
.Cast<Match>()
.Select(m=> m.Value
.Split(''')
.Where(x=>x.Contains("syed_syed"))
.Single()
).ToArray();

Given your input string is
RequirementInfo "{'other' ' ' '' 'true' 'syed_syed_GRP001' 'klajdskfjadklsjfklsa' } %GIDa_ed66dae7_2d68_4d07_9c67_a1cf1cb614cc" RequirementInfo "{'other' ' ' '' 'true' 'syed_syed_GRP001' 'klajdskfjadklsjfklsa' } %GIDa_b9a766f9_2b2b_4ca8_98f4_f693055b4792" RequirementInfo "{'other' ' ' '' 'true' 'syed_syed_GRP004' 'klajdskfjadklsjfklsa' } %GIDa_271d5326_cb57_4d87_8cd9_66687c0a1d32" RequirementInfo "{'other' ' ' '' 'true' 'syed_syed_GRP03' 'klajdskfjadklsjfklsa' } %GIDa_07ed6119_91d2_41f9_94dc_69d518503d64"
just with newlines, as you said in a comment on another question, you just need two splits:
var infosString = "RequirementInfo \"{'other' ' ' '' 'true' 'syed_syed_GRP001' 'klajdskfjadklsjfklsa' } %GIDa_ed66dae7_2d68_4d07_9c67_a1cf1cb614cc\"\nRequirementInfo \"{'other' ' ' '' 'true' 'syed_syed_GRP001' 'klajdskfjadklsjfklsa' } %GIDa_b9a766f9_2b2b_4ca8_98f4_f693055b4792\"\n RequirementInfo \"{'other' ' ' '' 'true' 'syed_syed_GRP004' 'klajdskfjadklsjfklsa' } %GIDa_271d5326_cb57_4d87_8cd9_66687c0a1d32\"\n RequirementInfo \"{'other' ' ' '' 'true' 'syed_syed_GRP03' 'klajdskfjadklsjfklsa' } %GIDa_07ed6119_91d2_41f9_94dc_69d518503d64";
var result = infosString.Split('\n').Select(line => line.Split('\'')[9]).ToArray();
result is now
The first Split creates an array with the strings starting with RequirementInfo, and the Select splits these strings again and takes the 10th items (the ones starting with syed_syed).

regex doubt in gawk

my csv data file is like this
title,name,gender
MRS.,MADHU,Female
MRS.,RAJ KUMAR,male
MR.,N,Male
MRS.,SHASHI,Female
MRS.,ALKA,Female
now as you can see i wanna avoid all data like line 2 and 3 (i.e no white space or data length >= 3 )
MRS.,RAJ KUMAR,male
MR.,N,Male
and place it in a file called rejected_list.csv, rest all go in a file called clean_list.csv
hence here is my gawk script for it
gawk -F ',' '{
if( $2 ~ /\S/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
My problem is this script is not recognising '\S' character set( all alphabets except space).. it is selecting all words starting with S or has a S and rejecting the rest
a simple regex like /([A-Z])/ in place of /s works perfectly but as i place a limit of {3,} the script fails..
gawk -F ',' '{
if( $2 ~ /([A-Z]){3,}/ &&
$1 ~ /MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF./ &&
$3 ~ /M|F|Male|Female/)
print $1","$2","$3 > "clean_list.csv";
else
print $1","$2","$3 > "rejected_list.csv" } ' \
< DATA_file.csv
i have tried all sorts of combination of the regex with '*','+' etc but i cant get what i want...
can anyone tell me what is the problem?

Use [:graph:] instead of \S for all printable and visible characters. GAWK does not recognize \S as [:graph:] so it will not work.
Additionally, the {3,} interval expression only works in posix or re-interval modes.

I added a rejection condition: not exactly 3 fields
gawk -F, '
BEGIN {
titles = "MRS.|MR.|MS.|MISS.|MASTER.|SMT.|DR.|BABY.|PROF."
genders = "M|F|Male|Female"
}
$1 !~ titles || $2 ~ /[[:space:]]/ || length($2) < 3 || $3 !~ genders || NF != 3 {
print > "rejected_list.csv"
next
}
{ print > "clean_list.csv" }
' < DATA_file.csv

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Redshift Copy from s3 using for loop - amazon-web-services

Related

Handling different escaping sequences?

Python remove empty element by list

How to add an orderby to a typo3 flow queryBuilder query?

string split in LINQ

regex doubt in gawk

Categories

Resources