I got 2Gb csv file (pipe separated) in s3,
Run a glue crawler on it, created new table.
When run a query from aws-athena it found zero record (even though it return the columns correctly)
didn't applied any partition, just run the crawler as default as possible.
any suggestion?
note - used aws consol for all actions
The possibility that the query is not returning the data is
If you have specified the file name in the bucket name while Adding
the crawler.
Let's say your bucket name is testbucket and the csv file is test.csv,
so while adding the crawler you need to specify your bucket name as s3://testbucket/
and not s3://testbucket/test.csv
Also, if the fields are separated by pipe, then they will be displayed under single column only as the file extension is .csv (comma separated). So ideally the fields should be comma separated in order to fetch the proper output.
Hence, try specifying the bucket name as mentioned above. Hope this will return the data.
If the data is still not returned, try creating a new crawler and while creating the new crawler do not use the existing IAM role. Create a new role. Sometimes, IAM policies make a glitch while fetching the data.
Related
I have multiple files stored into an S3 bucket under uniquely named folders which I would expect AWS Glue to put into a single table - instead it creates one per file. Any ideas how to configure the crawler to get a single table ?
The current tS3 structure is s3://bucket_name/YYYYMMDDUUID/data.json:
20210801123123cfec/data.json
20210808876551cedc/data.json
....
20210810112313feed/data.json
The json schema is definitely not a problem, it is similar - for example when I change the folder names from the custom names to "1", "2", ... etc I get a single table with multiple partitions.
I'm running a SELECT Athena query on an S3 bucket manifest. I then want to use the results of that query, in .csv format, in an S3 Batch operation.
My query runs fine and I am able to access the .csv output via S3 Batch, but since the first row is actually column headers, S3 Batch to throws an unrecoverable error because it thinks that the manifest is now referring to multiple buckets.
How can I easily strip the column headers out of my results? I would prefer to just do it in SQL. The file size makes using standard unix tools prohibitive. I could use AWS Glue, but this seems like overkill for just suppressing headers in a SQL query.
Here's a hacky way to get around it
SELECT bucket as "my-bucket-name", key as "fakekey"
from your_athena_table
This will make your header look like the rest of the file which will not break the S3 Batch copy job. You will have just one failed record of fakekey
I am trying query AWS S3 Invetory List using Athena. I can do this if I have only one source bucket. I am not sure how to configure this to work with multiple source buckets.
we are using all the default configuration options with data format as CSV. S3 Inventory destination bucket name pattern for hive is like this:
destination-prefix/source-bucket/config-ID/hive/dt=YYYY-MM-DD-HH-MM/symlink.txt
So when I am creating an Athena table I have to use static hive path.
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (dt string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
So if I want to query inventory data for multiple source buckets, it seems like I have to create a table for each "source-bucket".
Alternatively, with out using Athena, I am trying to do this with AWS CLI
aws s3 ls s3://our-bucket-name/prefix/abc --recursive | awk ‘$1 > “2019-04-01”
But this gets every single file first as there is no option to set --include or --exclude with "S3 ls"
finally, the questions are:
Can I configure AWS Inventory to generate inventory for multiple S3 buckets to that it puts everything into the same "hive" directory (i.e. ignore "source-bucket" prefix while generating Invetory)?
Is it possible to configure Athena read from multiple hive locations? But with the possibility of new buckets getting created and dropped, I guess this gets ugly.
Is there any alternative way to query inventory list instead of Athena or AWS CLI or writing a custom code to use manifest.json file to get these csv files.
You can't make S3 Inventory create one inventory for multiple buckets, unfortunately. You can however splice the inventories together into one table.
The guide you link to says to run MSCK REPAIR TABLE … to load your inventories. I would recommend you to not do that, because it will create weird tables with partitions that each represent the inventory of some point in time, which is something you might want if you want to compare what's in a bucket from day to day or week to week, but probably not what you want most of the time. Most of the time you want to know what's in the bucket right now. To get multiple inventories into the same table you should also not run that command.
First you change how you create the table slightly:
CREATE EXTERNAL TABLE your_table_name(
//column names
)
PARTITIONED BY (bucket_name string)
//options ignored
LOCATION 's3://destination-prefix/source-bucket/config-ID/hive/';
Notice that I changed the partitioning from dt string to bucket_name string.
Next you add the partitions manually:
ALTER TABLE your_table_name
ADD PARTITION (bucket_name = 'some-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID1/hive/dt=YYYY-MM-DD/'
ADD PARTITION (bucket_name = 'another-bucket') LOCATION 's3://destination-prefix/source-bucket/config-ID2/hive/dt=YYYY-MM-DD/';
The locations should be the S3 URIs to the latest dates under the "hive" prefix of the inventory for each bucket.
The downside of this is that when new inventories are delivered you will need to update the table to point to these new locations. You can do this by first dropping the partitions:
ALTER TABLE your_table_name
DROP PARTITION (bucket_name = 'some-bucket')
DROP PARTITION (bucket_name = 'another-bucket');
and then adding them again using the same SQL as above, but with new S3 URIs.
I am trying to crawl some files having different sachems(Data compatible ) using AWS Glue.
As I read in the AWS documentation that Glue crawlers update the catalog tables for any change in the schema(add new columns and remove missing columns).
I have checked the "Update the table definition in the Data Catalog" and "Create a single schema for each S3 path" while creating the crawler.
Example:
let's say I have a file "File1.csv" as shown below:
name,age,loc
Ravi,12,Ind
Joe,32,US
Say I have another file "File2.csv" as shown below:
name,age,height
Jack,12,160
Jane,32,180
After crawlers run in the schema was updated as:
name,age,loc,height -This is as expcted
but When I tried to read the files using Athena or tried writing the content of both the files to csv using Glue ETL job,I have observed that:
the output looks like:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,160,,
Jane,32,180,,
last two rows should have blank for loc as the second file didn't have loc column.
where as expected:
name,age,loc,height
Ravi,12,Ind,,
Joe,32,US,,
Jack,12,,160
Jane,32,,180
In short glue is trying to fill up the column in contiguous manner in the combined output.Is there any way I can get the expected output?
I got the expected output with Parquet files. Initially, I was using CSV, but csv deserializer doesn't understand how to put the elements into the correct position when schema changes.
Changing the individual csvs into parquet and then crawling them one after another helped me in incorporating the changing schema.
Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in Athena works fine.
Part Two (Scenario:)
Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?
I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,
I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .
Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler)
important note
one CSV file run it we can view the records in Athena.
I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").
Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.
The structure of the s3 bucket / folder is very important :
s3://<bucketname>/<data-folder>/
/<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
/<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
...
/<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>
and specify in the "include path" of the Glue Crawler:
s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>
Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.
So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.
This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue.
You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )
I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.
So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json
And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.
for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html
Pointing glue crawler to the S3 folder and not the acutal file did the trick.
Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.
csv/allergies.csv -> fails
csv/allergies/allergies.csv -> succeeds
Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.