I need some help in combining multiple files in different company partition in S3 into one file with company name in the file as one of the column.
I am new and I am not able to find any information also I did spoke to support and they say it is not supported. But in DataStage it is a basic function to combin multiple files into one.
Please throw some light
Regards,
Prakash
If the Column names are same in the file and number of columns are also same, Glue will automatically combine them.
Make sure the files you want to combine are in same folder on s3 and your glue crawler is pointing to the folder.
Review the AWS Glue examples, particularly the Join and Rationalize Data in S3 example. It shows you how to use a Python script to do joins and filters with transforms.
Related
I need to pull two companies' data from their respective AWS S3 buckets, map their columns in Glue, and export them to a specific schema in a Microsoft SQL database. The schema is to have one table, with the companies' data being distinguished with attributes for each of their sites (each company has multiple sites).
I am completely new to AWS and SQL, would someone mind explaining to me how to add an attribute to the data, or point me to some good literature on this? I feel like manipulating the .csv in the Python script I'm already running to automatically download the data from another site then upload it to S3 could be an option (deleting NaN columns and adding a column for site name), but I'm not entirely sure.
I apologize if this has already been answered elsewhere. Thanks!
I find this website to generally be pretty helpful with figuring out SQL stuff. I've linked to the ALTER TABLE commands that would allow you to do this through SQL.
If you are running a python script to edit the .csv to start, then I would edit the data there, personally. Depending on the size of the data sets, you can run your script as a Lambda or Batch job to grab, edit, and then upload to s3. Then you can run your Glue crawler or whatever process you're using to map the columns.
Below are given my S3 paths under which multiple folders are present. Each folder contains a CSV file each with a different schema.
The values within the curly braces {} will be dynamic.
s3://test_bucket/{val1}/data/{val2}/input/latest/
s3://test_bucket/{val1}/data/{val2}/input/archived/timestamp={val3}/
I want to create the Athena tables using AWS Glue Crawler. We can have a separate database for input_data both for current and archive.
The tables formed should be such that it's partitioned over val1 and val2 both for the current and archive. And, an additional partition should be present in the table, that is, val3, in the case of the archived.
Kindly help me with any approach I can take to set the configuration for creating tables dynamically. I would really appreciate your time. Please let me know in case more information is needed.
the simplest and most efficient way would be to use partition projection. Ser the docs: https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
My comment, use the api to create the crawlers with the specific s3 paths to read, and the database name to write.
I've tried to find similar issues on here/online, but came up short.
I have Athena pointing to a folder in Amazon S3 which itself contains folders/partitions each with a single .tsv inside (e.g. s3://my_bucket/partition/file.tsv). Athena is able to collect results for the majority of the files in the bucket, but doesn't collect results for a small number of them.
I've run the repair code (MSCK REPAIR TABLE) and I checked glue to make sure that it is seeing the partitions (it is). I also checked the Amazon knowledge center (https://aws.amazon.com/premiumsupport/knowledge-center/athena-empty-results/). Not sure what else might be causing the issue.
It turns out that the columns of the tables (pulled from an API) were in a different order for the files that were not working. Running the queries on a different field provided results. The solution was to enforce the order of the columns were consistent after collecting data from the API.
I'm testing this architecture: Kinesis Firehose → S3 → Glue → Athena. For now I'm using dummy data which is generated by Kinesis, each line looks like this: {"ticker_symbol":"NFLX","sector":"TECHNOLOGY","change":-1.17,"price":97.83}
However, there are two problems. First, a Glue Crawler creates a separate table per file. I've read that if the schema is matching Glue should provide only one table. As you can see in the screenshots below, the schema is identical. In Crawler options, I tried ticking Create a single schema for each S3 path on and off, but no change.
Files also sit in the same path, which leads me to the second problem: when those tables are queried, Athena doesn't show any data. That's likely because files share a folder - I've read about it here, point 1, and tested several times. If I remove all but one file from S3 folder and crawl, Athena shows data.
Can I force Kinesis to put each file in a separate folder or Glue to record that data in a single table?
File1:
File2:
Regarding the AWS Glue creating separate tables there could be some reasons based on the AWS documentation:
Confirm that these files use the same schema, format, and compression type as the rest of your source data. It seems this doesn't your issue but still to make sure I suggest you test it with smaller files by dropping all the rows except a few of them in each file.
combine compatible schemas when you create the crawler by choosing to Create a single schema for each S3 path. For this case, file schema should be similar, setting should be enabled, and data should be compatible. For more information, see How to Create a Single Schema for Each Amazon S3 Include Path.
When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, the crawler creates multiple tables
One another really important point is, you should have one folder at root and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.(mentioned by Sandeep in this Stackoverflow Question)
I hope this could help you to resolve your problem.
So, I've used Glue before, but it's been with a single file <> single folder relationship.
What I'm trying to do now is to have a structure like this create individual tables for each folder:
- Data Bucket
- Table 1 Folder
- file1.csv
- file2.csv
- Table 2 Folder
- file1.csv
- file2.csv
...and so on.
But every time I create the crawler and set the Data Bucket as the data source, I only get a single table created. I've tried every combo of the "create single schema ...etc" I can think of.
I'm hoping that I don't have to add each sub-folder as a separate data source as my ultimate goal is to translate it eventually into an RDS instance. Hoping to keep the high-level bucket as the single data source if possible. I can easily tweak folder/file structure if needed.
And yes, I'm aware of partitioning, but isn't that only applicable to individual tables?
Thanks!
I ran into the same issue and digging into Glue docs, I found that setting table level in crawler's output configurations do the trick.
Table level seems to be set from the bucket level, in your case, I believe setting table level to 2 (the first folder after the root), would do the trick. 2 means that the tables definition starts at that point
I've been trying to accomplish the same thing. I was hoping that Glue would magically see the different folders and automatically create separate tables. Glue seems to want to create a single table, especially when the schemas overlap. In my example, I'm using US census data so there are some common fields, especially in the beginning of each file.
In the end, I was able to get this to work by creating multiple data stores in the Glue Crawler. By doing this, it would create the five separate tables I wanted, but I had to add each folder manually. Still hoping to find a way to get Glue to discover them automatically.