Pyspark list all folders and subfolders mountpoint

Pyspark list all folders and subfolders mountpoint - list

I have an issue, i have tables stored on storage account, I'm able to create mountpoint. Tables have subfolders with different process runs, so I want to list all subfolders, not files but subfolders. I'm able to this with for each but have to do manually with multiple for each loops. Is there another way to do this.

Related

Is it required to have 1 table schema in 1 s3 folder , so that crawler can pick the data in AWS Glue?

When I try to have multiple files in an s3 folder(with different tables schemas) and use the location to create multiple tables using crawler and AWS glue , the athena doesnt detect any data and it gives blank data . However if we have files with only single table schema (tables with same column structure ) then , it detects the data well . So the question is , Is there a way athena can create multiple tables with different structures from the same s3 folder ?
I have tried creating different folders for different files and crawler picks up the table schema well and it gives us the exact result , However it is not feasible as creating different folders for 100's of files is not a Solution . Hence searching for another way.

When defining a table in Amazon Athena (and AWS Glue), the location parameter should point to a folder path in an Amazon S3 bucket.
When running a query, Athena will look in every file in that folder, including sub-folders.
Therefore, you should only keep files of the same format (and schema) in that directory and all of its subdirectories. All of these files will populate the one table.
Do not put multiple files in the same directory if they are meant to populate different tables or have different schemas.

Having trouble setting up multiple tables in AWS glue from a single bucket

So, I've used Glue before, but it's been with a single file <> single folder relationship.
What I'm trying to do now is to have a structure like this create individual tables for each folder:
- Data Bucket
- Table 1 Folder
- file1.csv
- file2.csv
- Table 2 Folder
- file1.csv
- file2.csv
...and so on.
But every time I create the crawler and set the Data Bucket as the data source, I only get a single table created. I've tried every combo of the "create single schema ...etc" I can think of.
I'm hoping that I don't have to add each sub-folder as a separate data source as my ultimate goal is to translate it eventually into an RDS instance. Hoping to keep the high-level bucket as the single data source if possible. I can easily tweak folder/file structure if needed.
And yes, I'm aware of partitioning, but isn't that only applicable to individual tables?
Thanks!

I ran into the same issue and digging into Glue docs, I found that setting table level in crawler's output configurations do the trick.
Table level seems to be set from the bucket level, in your case, I believe setting table level to 2 (the first folder after the root), would do the trick. 2 means that the tables definition starts at that point

I've been trying to accomplish the same thing. I was hoping that Glue would magically see the different folders and automatically create separate tables. Glue seems to want to create a single table, especially when the schemas overlap. In my example, I'm using US census data so there are some common fields, especially in the beginning of each file.
In the end, I was able to get this to work by creating multiple data stores in the Glue Crawler. By doing this, it would create the five separate tables I wanted, but I had to add each folder manually. Still hoping to find a way to get Glue to discover them automatically.

How to use multiple file format in Athena

I have multiple file with different formats (csv, json and parquet) in s3 bucket directory (All files are in same directory). All files have same structure. How can I use these files to create Athena table?
Do we have provision to provide different Serde while creating table?
Edit: Table gets created but there is no data when I preview table.

There are a few options, but in my opinion it is best to create the separate paths (folders) for each type of files and run Glue Crawler on each of them. You will have multiple tables, but you can consolidate them by using Athena views or you can convert these files to one format by using Glue (for instance).
If you want to have the files in one folder you can use include and exclude patterns in Glue Crawler. Also in this case you will have to create seperate table for each type of file.
https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html

how to combine multiple s3 files into one using Glue

I need some help in combining multiple files in different company partition in S3 into one file with company name in the file as one of the column.
I am new and I am not able to find any information also I did spoke to support and they say it is not supported. But in DataStage it is a basic function to combin multiple files into one.
Please throw some light
Regards,
Prakash

If the Column names are same in the file and number of columns are also same, Glue will automatically combine them.
Make sure the files you want to combine are in same folder on s3 and your glue crawler is pointing to the folder.

Review the AWS Glue examples, particularly the Join and Rationalize Data in S3 example. It shows you how to use a Python script to do joins and filters with transforms.

Automate data loading from folder to SAS lib

I want to automate the process of loading data from folder to SAS LASR Server (but I expect it must be similar to just loading data in normal SAS Lib). I have a folder where user can put their data (lets say *.csv files with the same structure). I want to create some sort of process which will automatically scan this folder, check if there are any new files and if any - append it to existing data and upload it to make available to all users for further analysis.
I know how to read separate CSV to SAS dataset and I'm looking for the easiest way to do solve two problems - comparing current CSVs with already uploaded and making this process scheduled.
Many thanks in advance for any help!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pyspark list all folders and subfolders mountpoint - list

Related

Is it required to have 1 table schema in 1 s3 folder , so that crawler can pick the data in AWS Glue?

Having trouble setting up multiple tables in AWS glue from a single bucket

How to use multiple file format in Athena

how to combine multiple s3 files into one using Glue

Automate data loading from folder to SAS lib

Categories

Resources