I am trying to send data to Amazon Redshift through Kinesis via S3 using the copy command. The data format is ok in S3 in JSON, but when it reaches Redshift, lots of the column values are empty. I have tried everything and haven't been able to fix this for the past few days.
Data format in s3:
{"name":"abc", "clientID":"ahdjxjxjcnbddnn"}
{"name":"def", "clientID":"uhrbdkdndbdbnn"}
Redshift table structure:
User(
name: varchar(20)
clientID: varchar(25)
)
In Redshift I get only one of the two fields populated.
I have used JSON auto in the copy command.
Related
Problem Statement: CSV data currently stored in S3 (extracted from Postgresql RDS), I need to query this S3 data using Athena. To achieve this, I created AWS Glue DB and running a crawler on this S3 bucket but the data in Athena query is broken (starts from columns with large text content). I tried changing the data type of Glue Table schema from string to varchar(1000) and recrawl, but still it breaks.
Data stored in S3 bucket :
Data coming out of Athena query on same bucket (using SELECT *) [note the missing row]
Also tested loading the S3 data using jupyter notebook in AWS Glue Studio with this code snippet and the output data looks correct here :
dynamicFrame = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"paths": ["s3://..."]},
format="csv",
format_options={
"withHeader": True,
# "optimizePerformance": True,
},
)
Any help on this would be greatly appreciated!
I'm query data from glue catalog. For some of table I can see the data and some of table getting below error:
Error opening Hive split s3://test/sample/run-1-part-r-03 (offset=0, length=1156) using org.apache.hadoop.mapred.TextInputFormat: Permission denied on S3 path: s3://test/sample/run-1-part-r-03
I have give full access to Athena.
Amazon Athena adopts the permissions from the user when accessing Amazon S3.
If the user can access the objects in Amazon S3, then they can access them via Amazon Athena.
Does the user who ran the command have access to those objects?
I am following AWS documentation on how to transfer DDB table from one account to another. There are two steps:
Export DDB table into Amazon S3
Use a Glue job to read the files from the Amazon S3 bucket and write them to the target DynamoDB table
I was able to do the first step. Unfortunately the instructions don't say how to do the second step. I have worked with Glue a couple of times, but the console UI is very user un-friendly and I have no idea how to achieve it.
Can somebody please explain how to import the data from S3 into the DDB?
You could use Glue studio to generate a script.
Log into AWS
Go to Glue
Go to Glue studio
Set up the source , basically point it to S3
and then use something like below this is for a dynamo db with pk and sk as a composite primary key
This is just the mapping to a Dataframe and writing it to DynamoDB
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[
("Item.pk.S", "string", "Item.pk.S", "string"),
("Item.sk.S", "string", "Item.sk.S", "string")
],
transformation_ctx="ApplyMapping_node2"
)
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="dynamodb",
connection_options={"dynamodb.output.tableName": "my-target-table"}
}
I am using the copy command to populate a Redshift database from an s3 bucket. They are in different regions, so I inserted
'...FORMAT AS PARQUET REGION AS 'us-east-1'
but this gives the error:
psycopg2.errors.FeatureNotSupported: REGION argument is not supported for PARQUET based COPY
Can someone suggest a solution for this?
Its true region option is ot formatted for COPY from columnar data formats: ORC and PARQUET.
docs says : The Amazon S3 bucket must be in the same AWS Region as the Amazon Redshift cluster.
Only the following COPY parameters are supported:
FROM
IAM_ROLE
CREDENTIALS
STATUPDATE
MANIFEST
ACCESS_KEY_ID, SECRET_ACCESS_KEY, and SESSION_TOKEN.
my suggestions:- either transfer data from one s3 to another in different region. can follow this blog from aws https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/.
or launch your cluster in the region where your data is, but copy data from s3 is much feasible
Currently, I have several thousand headerless, pipe-delimited, GZIP compressed files in S3, totaling ~10TB, with the same schema. What is the best way, in AWS Glue, to (1) add a header file, (2) convert to parquet format partitioned by week using a "date" field in the files, (3) have the files be added to the Glue Data Catalog for accessibility for querying in AWS Athena?
1) Create an athena table pointing your data on S3 :
Create external table on athena
2) Create a dynamic frame from glue catalog, using the table you created in above step.
from awsglue.context import GlueContext
glueContext = GlueContext(SparkContext.getOrCreate())
DyF = glueContext.create_dynamic_frame.from_catalog(database="{{database}}", table_name="{{table_name}}")
3) Write the data back to new S3 location in whatever format you like:
glueContext.write_dynamic_frame.from_options(
frame = DyF,
connection_type = "s3",
connection_options = {"path": "path to new s3 location"},
format = "parquet")
4) Create an athena table pointing your parquet data on S3 :
Create external table on athena
Note : Instead of creating athena table manually, you can also use glue crawler to create one for you. However, that will incur some charges.