I'm trying to load several csv files from a folder that aren't connected.
Its not allowing me to load them without combining them.
Get Data > File > Folder
When they load it automatically tries to combine them which creates NULLs in one of the file rows.
When you get data from folder, it will combine all file into one table. To load each file into it's own table, you should get data from a Text/CSV file instead and import each of the files you want.
Related
We have an AWS Glue job that is attempting to read data from an Athena table that is being populated by HUDI. Unfortunately, we are running into an error that relates to create_dynamic_frame.from_catalog trying to read from these tables.
An error occurred while calling o82.getDynamicFrame. s3://bucket/folder/.hoodie/20220831153536771.commit is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [32, 125, 10, 125]
This appears to be a known issue on GitHub: https://github.com/apache/hudi/issues/5891
Unfortunately, no workaround was provided. We are attempting to see if we can exclude either the folder or file(s) of .hoodie or *.commit, respectively within the additional_options of the create_dynamic_frame.from_catalog connection. Unfortunately, we are not having any success either with exclusion a file or folder. Note: we have .hoodie files in the root directory as well as a .hoodie folder that contains a .commit file, among other files. We prefer to exclude them all.
Per AWS:
"exclusions": (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, "["**.pdf"]" excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see Include and Exclude Patterns.
Question: how do we exclude both file and folder from a connection?
Folder
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV']+"_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\".hoodie/**\"]"})
File
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=args['ENV']+"_some_database", table_name="some_table", transformation_ctx="datasource_x1", additional_options={"exclusions": "[\"**.commit\"]"})
Turns out the original attempted solution of {"exclusions": "[\"**.commit\"]"} worked. Unfortunately, I wasn't paying close enough attention and there were multiple tables that needed to be excluded. After hacking through all of the file types, here are two working solutions:
Exclude folder
additional_options={"exclusions": "[\"s3://bucket/folder/.hoodie/*\"]"}
Exclude file(s)
additional_options={"exclusions": "[\"**.commit\",\"**.inflight\",\"**.properties\"]"}
I'm building an app where the user uploads 1-4 files to shiny through fileinput. An issue arrises where, should the user not drag/select multiple files in one go, then the app will not be able to access them. For example, say the user has 4 files saved locally in 4 different folders and they try uploading them one by one, the app will not function. This happens because when the files are uploaded, fileinput creates a dataframe where one column (datapath) contains the path to a temp file which you can then reference in the server. In the documentation it states...
datapath
The path to a temp file that contains the data that was uploaded. This file may be deleted if the user performs another upload operation.
https://shiny.rstudio.com/reference/shiny/1.6.0/fileInput.html
Is there any way around this problem to prevent this datapath being deleted or perhaps find a way to store the temp file so it won't be lost should a user upload another file?
I had considered multiple fileinput boxes but that just makes the app messy.
There is a reproducible example in the example section of the documentation above.
I am trying to load csv files from a folder but I need to apply several custom steps to each file, including dropping the PromoteHeaders default.
I have a custom query that can load a single file successfully. How do I turn it into a query that loads all files in a folder?
By default, File.folder's "promoteHeaders" messes up my data because of a missing column name (which my custom query fixes).
The easiest way to create a function that reads a specific template of file is to actually do it. Just create the M to read it and by right click on the entity transform it to a function.
After that is really simple to transform your M so it uses parameters.
You can create a blank query and replace the code with this on as an example, customize with more steps to deal with your file requirements.
= (myFile) => let
Source = Csv.Document(myFile,[Delimiter=",", Columns=33, Encoding=1252, QuoteStyle=QuoteStyle.None])
in
Source
And then Invoke Custom Function for each file with the content as the parameter.
I am using models.Filefield to upload files to the server. When uploading a new file with name of an existing file, django append name with some random characters. But I need to retain filename. Hence I change upload_to entry in the models.Filefield to upload_to='str(uuid.uuid4())'. It works fine, but seldom the above problem occur. Is there any other methods to correct it ?
I have a talend job which is simple like below:
ts3Connection -> ts3Get -> tfileinputDelimeted -> tmap -> tamazonmysqloutput.
Now the scenario here is that some times I get the file in .txt format and sometimes I get it in a zip file.
So I want to use tFileUnarchive to unzip the file if it's in zip or process it bypassing the tFileUnarchive component if the file is in unzipped format i.e only in .txt format.
Any help on this is greatly appreciated.
The trick here is to break the file retrieval and potential unzipping into one sub job and then the processing of the files into another sub job afterwards.
Here's a simple example job:
As normal, you connect to S3 and then you might list all the relevant objects in the bucket using the tS3List and then pass this to tS3Get. Alternatively you might have another way of passing the relevant object key that you want to download to tS3Get.
In the above job I set tS3Get up to fetch every object that is iterated on by the tS3List component by setting the key as:
((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and then downloading it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
The extra bit I've added starts with a Run If conditional link from the tS3Get which links the tFileUnarchive with the condition:
((String)globalMap.get("tS3List_1_CURRENT_KEY")).endsWith(".zip")
Which checks to see if the file being downloaded from S3 is a .zip file.
The tFileUnarchive component then just needs to be told what to unzip, which will be the file we've just downloaded:
"C:/Talend/5.6.1/studio/workspace/S3_downloads/" + ((String)globalMap.get("tS3List_1_CURRENT_KEY"))
and where to extract it to:
"C:/Talend/5.6.1/studio/workspace/S3_downloads"
This then puts any extracted files in the same place as the ones that didn't need extracting.
From here we can now iterate through the downloads folder looking for the file types we want by setting the directory to "C:/Talend/5.6.1/studio/workspace/S3_downloads" and the global expression to "*.csv" in my case as I wanted to read in only the CSV files (including the zipped ones) I had in S3.
Finally, we then read the delimited files by setting the file to be read by the tFileInputDelimited component as:
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
And in my case I simply then printed this to the console but obviously you would then want to perform some transformation before uploading to your AWS RDS instance.