How to export SnowFlake S3 data file to my AWS S3? - amazon-web-services

Snowflake S3 data is in .txt.bz2, I need to export the data files present in this SnowFlake S3 to my AWS S3, the exported results must be the same format as in the source location.This is wat I tried.
COPY INTO #mystage/folder from
(select $1||'|'||$2||'|'|| $3||'|'|| $4||'|'|| $5||'|'||$6||'|'|| $7||'|'|| $8||'|'|| $9||'|'|| $10||'|'|| $11||'|'|| $12||'|'|| $13||'|'|| $14||'|'||$15||'|'|| $16||'|'|| $17||'|'||$18||'|'||$19||'|'|| $20||'|'|| $21||'|'|| $22||'|'|| $23||'|'|| $24||'|'|| $25||'|'||26||'|'|| $27||'|'|| $28||'|'|| $29||'|'|| $30||'|'|| $31||'|'|| $32||'|'|| $33||'|'|| $34||'|'|| $35||'|'|| $36||'|'|| $37||'|'|| $38||'|'|| $39||'|'|| $40||'|'|| $41||'|'|| $42||'|'|| $43
from #databasename)
CREDENTIALS = (AWS_KEY_ID = '*****' AWS_SECRET_KEY = '*****' )
file_format=(TYPE='CSV' COMPRESSION='BZ2');
PATTERN='*/*.txt.bz2

Right now Snowflake does not support exporting data to file in bz2.
My suggestion is to set COMPRESSION='gzip', then you can export the Data to your S3 in gzip.
If exporting file in bz2 is high priority for you, please contact Snowflake support.
If you want to unload bz2 file from a Snowflake stage to your own S3, you can do something like this.
COPY INTO #myS3stage/folder from
(select $1||'|'||$2||'|'|| $3||'|'|| $4||'|'|| $5||'|'||$6||'|'|| $7||'|'|| $8||'|'|| $9||'|'|| $10||'|'|| $11||'|'|| $12||'|'|| $13||'|'|| $14||'|'||$15||'|'|| $16||'|'|| $17||'|'||$18||'|'||$19||'|'|| $20||'|'|| $21||'|'|| $22||'|'|| $23||'|'|| $24||'|'|| $25||'|'||26||'|'|| $27||'|'|| $28||'|'|| $29||'|'|| $30||'|'|| $31||'|'|| $32||'|'|| $33||'|'|| $34||'|'|| $35||'|'|| $36||'|'|| $37||'|'|| $38||'|'|| $39||'|'|| $40||'|'|| $41||'|'|| $42||'|'|| $43
from #snowflakeStage(PATTERN => '*/*.txt.bz2'))
CREDENTIALS = (AWS_KEY_ID = '*****' AWS_SECRET_KEY = '*****' )
file_format=(TYPE='CSV');

Related

How to upload files to S3 bucket using mime type

I trying to figure out using AWS CLI how to upload only certain files to S3 bucket based on the mime type.
Currently I am using terraform script to do that.
locals {
mime_types = jsondecode(file("${path.module}/mime.json"))
}
resource "aws_s3_object" "frontend_bucket_objects" {
for_each = fileset("${var.build_folder}", "**")
bucket = "frontend-bucket"
key = each.value
source = "${var.build_folder}\\${each.value}"
content_type = lookup(local.mime_types, regex("\\.[^.]+$", "${each.value}"), null)
etag = filemd5("${var.build_folder}\\${each.value}")
}
mime.json
{
".aac": "audio/aac",
".abw": "application/x-abiword",
".arc": "application/x-freearc",
".avif": "image/avif",
".avi": "video/x-msvideo",
".azw": "application/vnd.amazon.ebook",
".bin": "application/octet-stream",
".bmp": "image/bmp",
".css": "text/css",
".csv": "text/csv",
".doc": "application/msword",
..
.. etc etc
}
I want to upload using aws-cli but not able to figure out how to include files based on mime.
This is what I have right now which uploads entire source folder.
`aws s3 cp build/ s3://frontend-bucket --recursive`

How to determine if a string is located in AWS S3 CSV file

I have a CSV file in AWS S3.
The file is very large 2.5 Gigabytes
The file has a single column of strings, over 120 million:
apc.com
xyz.com
ggg.com
dddd.com
...
How can I query the file to determine if the string xyz.com is located in the file?
I only need to know if the string is there or not, I don't need to return the file.
Also it will be great if I can pass multiple strings for search and return only the ones that were found in the file.
For example:
Query => ['xyz.com','fds.com','ggg.com']
Will return => ['xyz.com','ggg.com']
The "S3 Select" SelectObjectContent API enables applications to retrieve only a subset of data from an object by using simple SQL expressions. Here's a Python example:
res = client.select_object_content(
Bucket="my-bucket",
Key="my.csv",
ExpressionType="SQL",
InputSerialization={"CSV": { "FileHeaderInfo": "NONE" }}, # or IGNORE, USE
OutputSerialization={"JSON": {}},
Expression="SELECT * FROM S3Object s WHERE _1 IN ['xyz.com', 'ggg.com']") # _1 refers to the first column
See this AWS blog post for an example with output parsing.
If you use the aws s3 cp command you can send the output to stdout:
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com'
- The dash will send the output to stdout.
this are two examples of grep checking on multiple patterns:
aws s3 cp s3://yourbucket/foo.csv - | grep -e 'apc.com' -e 'dddd.com'
aws s3 cp s3://yourbucket/foo.csv - | grep 'apc.com\|dddd.com'
To learn more about grep, please look at the manual: GNU Grep 3.7

Read/write to AWS S3 from Apache Spark Kubernetes container via vpc endpoint giving 400 Bad Request

I am trying to read and write data to AWS S3 from Apache Spark Kubernetes Containervia vpc endpoint
The Kubernetes container is on premise (data center) in US region . Following is the Pyspark code to connect to S3:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
conf = (
SparkConf()
.setAppName("PySpark S3 Example")
.set("spark.hadoop.fs.s3a.endpoint.region", "us-east-1")
.set("spark.hadoop.fs.s3a.endpoint","<vpc-endpoint>")
.set("spark.hadoop.fs.s3a.access.key", "<access_key>")
.set("spark.hadoop.fs.s3a.secret.key", "<secret_key>")
.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.executor.extraJavaOptions","-Dcom.amazonaws.services.s3.enableV4=true")
.set("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enforceV4=true")
.set("spark.fs.s3a.path.style.access", "true")
.set("spark.hadoop.fs.s3a.server-side-encryption-algorithm","SSE-KMS")
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider")
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
data = [{"key1": "value1", "key2": "value2"}, {"key1":"val1","key2":"val2"}]
df = spark.createDataFrame(data)
df.write.format("json").mode("append").save("s3a://<bucket-name>/test/")
Exception Raised:
py4j.protocol.Py4JJavaError: An error occurred while calling o91.save.
: org.apache.hadoop.fs.s3a.AWSBadRequestException: doesBucketExist on <bucket-name>
: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: <requestID>;
Any help would be appreciated
unless your hadoop s3a client is region aware (3.3.1+), setting that region option won't work. There's an aws sdk option "aws.region which you can set as as a system property instead.

Send cloud-init script to gcp with terraform

How do I send cloud-init script to a gcp instance using terraform?
Documentation is very sparse around this topic.
You need the following:
A cloud-init file (say 'conf.yaml')
#cloud-config
# Create an empty file on the system
write_files:
- path: /root/CLOUD_INIT_WAS_HERE
cloudinit_config data source
gzip and base64_encode must be set to false (they are true by default).
data "cloudinit_config" "conf" {
gzip = false
base64_encode = false
part {
content_type = "text/cloud-config"
content = file("conf.yaml")
filename = "conf.yaml"
}
}
A metadata section under the google_compute_instance resource
metadata = {
user-data = "${data.cloudinit_config.conf.rendered}"
}

Cannot load lzop-compressed files from S3 into Redshift

I am attempting to copy an lzop-compresed file from S3 to Redshift. The file was originally generated by using S3DistCp with the --outputCodec lzo option.
The S3 file seems to be compressed correctly, since I can successfully download and inflate it at the command line:
lzop -d downloaded_file.lzo
But when I attempt to load it into Redshift, I get an error:
COPY atomic.events FROM 's3://path-to/bucket/' CREDENTIALS 'aws_access_key_id=xxx;aws_secret_access_key=xxx' REGION AS 'eu-west-1' DELIMITER '\t' MAXERROR 1 EMPTYASNULL FILLRECORD TRUNCATECOLUMNS TIMEFORMAT 'auto' ACCEPTINVCHARS LZOP;
ERROR: failed to inflate with lzop: unexpected end of file.
DETAIL:
-----------------------------------------------
error: failed to inflate with lzop: unexpected end of file.
code: 9001
context: S3 key being read : s3://path-to/bucket/
query: 244
location: table_s3_scanner.cpp:348
process: query0_60 [pid=5615]
-----------------------------------------------
Any ideas on what might be causing the load to fail?
Try specifying the exact file name.
s3://path-to/bucket/THE_FILE_NAME.extension
The code you used will iterate through all the files available there. Looks like there may be other type of files in the same folder (ex: manifest)
COPY atomic.events
FROM 's3://path-to/bucket/THE_FILE_NAME.extension'
CREDENTIALS 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
REGION AS 'eu-west-1'
DELIMITER '\t'
MAXERROR 1
EMPTYASNULL
FILLRECORD
TRUNCATECOLUMNS
TIMEFORMAT 'auto'
ACCEPTINVCHARS
LZOP;