I have successfully exported a table with a single row from DynamoDB to S3. I then cleared the table and tried to import the same file back in but I can't get it to work.
Rephrasing what import data from Amazon S3 to DynamoDB paragraph (5.a) says that I should put the file in s3://bucket[/prefix]/tablename/YYYY-MM-DD_HH.MM.
The export generates a different layout of the data, so I moved the file where the documentation says. I.e. s3://mybucket/dynamodb/mytable/2014-05-29_14.32, and I configured the pipeline to look in s3://mybucket/dynamodb.
I then setup an import job which ran without returning any error, yet the table was left empty.
The logs generated by the pipeline are not clear unfortunately.
Did anyone managed to import data in DynamoDB format from S3?
While exporting DynamoDB table, backup data is put to s3 path with format : output_s3_path/region/tableName/time
where
tableName is dynamoDB table which is being backed up
region is region of the table
output_s3_path is "S3 Output Folder" field on console/UI
Example let
tableName = test
region = us-east-1
output_s3_path = s3://test-bucket
Backup is generated in s3 path s3://test-bucket/us-east-1/test/2014-05-30_06.08
For importing this data, set value of "S3 Input Folder" same as generated path i.e. "s3://test-bucket/us-east-1/test/2014-05-30_06.08" . "S3 Input Folder" should be the s3 prefix where data files exists.
Related
I created a table in Athena without a crawler from S3 source. It is showing up in my datacatalog. However, when I try to access it through a python job in Glue ETL, it shows that it has no column or any data. The following error pops up when accessing a column: AttributeError: 'DataFrame' object has no attribute '<COLUMN-NAME>'.
I am trying to access the dynamic frame following the glue way:
datasource = glueContext.create_dynamic_frame.from_catalog(
database="datacatalog_database",
table_name="table_name",
transformation_ctx="datasource"
)
print(f"Count: {datasource.count()}")
print(f"Schema: {datasource.schema()}")
The above logs output: Count: 0 & Schema: StructType([], {}), where the Athena table shows I have around ~800,000 rows.
Sidenotes:
The ETL job concerned has AWSGlueServiceRole attached.
I tried Glue Visual Editor as well, it showed the datacatalog database/table concerned but sadly, same error.
It looks like the S3 bucket has multiple nested folders inside it. For Glue to read these folders you need to add a flag adding additional_options = {"recurse": True} to your from_catalog(). This will help to recursively read records from s3 files.
I got 2Gb csv file (pipe separated) in s3,
Run a glue crawler on it, created new table.
When run a query from aws-athena it found zero record (even though it return the columns correctly)
didn't applied any partition, just run the crawler as default as possible.
any suggestion?
note - used aws consol for all actions
The possibility that the query is not returning the data is
If you have specified the file name in the bucket name while Adding
the crawler.
Let's say your bucket name is testbucket and the csv file is test.csv,
so while adding the crawler you need to specify your bucket name as s3://testbucket/
and not s3://testbucket/test.csv
Also, if the fields are separated by pipe, then they will be displayed under single column only as the file extension is .csv (comma separated). So ideally the fields should be comma separated in order to fetch the proper output.
Hence, try specifying the bucket name as mentioned above. Hope this will return the data.
If the data is still not returned, try creating a new crawler and while creating the new crawler do not use the existing IAM role. Create a new role. Sometimes, IAM policies make a glitch while fetching the data.
I'm look for a manual and automatic way to use SQL Workbench to import/load a LOCAL csv file to a AWS Redshift database.
The manual way could be a way that click a navigation bar and select a option.
The automatic way could be some query codes to load the data, just run it.
here's my attempt:
there's an error "my target table in AWS is not found." but I'm sure the table exists, anyone know why?
WbImport -type=text
-file ='C:\myfile.csv'
-delimiter = ,
-table = public.data_table_in_AWS
-quoteChar=^
-continueOnError=true
-multiLine=true
You can use wbimport in SQL Workbench/J to import data
For more info : http://www.sql-workbench.net/manual/command-import.html
Like it was mentioned in the comments COPY command provided by Redshift is the optimal solution. You can use copy from S3, EC2 etc.
S3 Example:
copy <your_table>
from 's3://<bucket>/<file>'
access_key_id 'XXXX'
secret_access_key 'XXXX'
region '<your_region>'
delimiter '\t';
For more examples:
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html
I have data in a table
select * from my_table
It contains 10k observations.How do I export data in the table as CSV to s3 bucket .
(I dont want to export the data to my local machine and then push to s3).
Please, please, please STOP labeling your questions with both PostgreSQL and Greenplum. The answer to your question is very different if you are using Greenplum versus PostgreSQL. I can't stress this enough.
If you are using Greenplum, you should the S3 protocol in External Tables to read and write data to S3.
So your table:
select * from my_table;
And your external table:
CREATE EXTERNAL TABLE ext_my_table (LIKE my_table)
LOCATION ('s3://s3_endpoint/bucket_name')
FORMAT 'TEXT' (DELIMITER '|' NULL AS '' ESCAPE AS E'\\');
And then writing to your s3 bucket:
INSERT INTO ext_my_table SELECT * FROM my_table;
You will need to do some configuration on your Greenplum cluster so that you have an s3 configuration file too. This goes in every segment directory too.
gpseg_data_dir/gpseg-prefixN/s3/s3.conf
Example of the file contents:
[default]
secret = "secret"
accessid = "user access id"
threadnum = 3
chunksize = 67108864
More information on S3 can be found here: http://gpdb.docs.pivotal.io/5100/admin_guide/external/g-s3-protocol.html#amazon-emr__s3_config_file
I'll suggest to first load data into your master node using WINSCP or File transfer.
Then move this file from your master node to S3 storage.
Because, moving data from Master node to S3 storage utilises Amazon's bandwidth and it will be much faster than our local connection bandwidth used to transfer file from local machine to S3.
Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in Athena works fine.
Part Two (Scenario:)
Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?
I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,
I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .
Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler)
important note
one CSV file run it we can view the records in Athena.
I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").
Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.
The structure of the s3 bucket / folder is very important :
s3://<bucketname>/<data-folder>/
/<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
/<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
...
/<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>
and specify in the "include path" of the Glue Crawler:
s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>
Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.
So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.
This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue.
You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )
I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.
So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json
And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.
for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html
Pointing glue crawler to the S3 folder and not the acutal file did the trick.
Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.
csv/allergies.csv -> fails
csv/allergies/allergies.csv -> succeeds
Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.