Informatica Developer 10.5.2.1 - To read multiple part parquet files from Azure DataLake Storage Gen2 - informatica

I have one folder in ADLS Gen2 which has more than one part parquet files. I need to read all these parquet files in one shot with Informatica Developer and i need to write all of them into another folder in ADLS Gen2.
Do you have any suggestion?
Thank you
Ozge
1- I took only last day's data from one folder under ADLS Gen2 which has only 1 file for each day. (with using parameterization) Since i run this mapping with Databricks, at the end i have multiple part parquet files.
2- As a 2.step i need to read these all part parquet files. If i use dataobject that i created at 1.step, i thought it will read all files. But it does not.

Related

inserting and reading data to/from athena tables

I apologise if the title is a bit misleading for the question I am going to ask. I am trying to understand how athena works a bit more clearly.
I have a daily job, which uploads files to a s3 location. I have created a athena table, which reads table from that s3 location. Every day the data gets updated and new files (i.e. new data) is uploaded to the location. (New necessarily doesn't mean overwriting but also adding more files).
My issue is, when I try to read the latest data from athena gui, it doesn't return anything but an empty table.
How do I read the latest data? Do I have to run another command like ALTER TABLE or INSERT INTO after uploading files to s3. My understanding was uploading files to that s3 location is akin to inserting data into table and vice versa i.e. running ALTER TABLE/INSERT INTO is akin to uploading files to s3?

Glue crawler is not combining data - also no visible data in tables

I'm testing this architecture: Kinesis Firehose → S3 → Glue → Athena. For now I'm using dummy data which is generated by Kinesis, each line looks like this: {"ticker_symbol":"NFLX","sector":"TECHNOLOGY","change":-1.17,"price":97.83}
However, there are two problems. First, a Glue Crawler creates a separate table per file. I've read that if the schema is matching Glue should provide only one table. As you can see in the screenshots below, the schema is identical. In Crawler options, I tried ticking Create a single schema for each S3 path on and off, but no change.
Files also sit in the same path, which leads me to the second problem: when those tables are queried, Athena doesn't show any data. That's likely because files share a folder - I've read about it here, point 1, and tested several times. If I remove all but one file from S3 folder and crawl, Athena shows data.
Can I force Kinesis to put each file in a separate folder or Glue to record that data in a single table?
File1:
File2:
Regarding the AWS Glue creating separate tables there could be some reasons based on the AWS documentation:
Confirm that these files use the same schema, format, and compression type as the rest of your source data. It seems this doesn't your issue but still to make sure I suggest you test it with smaller files by dropping all the rows except a few of them in each file.
combine compatible schemas when you create the crawler by choosing to Create a single schema for each S3 path. For this case, file schema should be similar, setting should be enabled, and data should be compatible. For more information, see How to Create a Single Schema for Each Amazon S3 Include Path.
When using CSV data, be sure that you're using headers consistently. If some of your files have headers and some don't, the crawler creates multiple tables
One another really important point is, you should have one folder at root and inside it, you should have partition sub-folders. If you have partitions at S3 bucket level, it will not create one table.(mentioned by Sandeep in this Stackoverflow Question)
I hope this could help you to resolve your problem.

AWS Athena - What happens when you add new files to S3 folder

I have a sample working where I put a file in S3.
What I'm confused about is what happens when I add new CSV files (with the same format) to that folder.
Are they instantly available in queries? Or do you have to run Glue or something to process them? So for example, what if set up a Lambda function to extract a new CSV every hour, or even every 5 minutes to that same S3 directory.
Does Athena actually load the data into some database somewhere in order to do fast performing queries?
If your table is not partitioned or you add a file to an existing partition the data will be available right away.
However, if you constantly add files you may want to consider partition your table to optimize query performance, see:
Table Location in Amazon S3
Partitioning Data
Athena itself doesn't have any caching, any query will hit the S3 location of the table.

Big Query can't query some csvs in Cloud Storage bucket

I created a permanent Big Query table that reads some csv files from a Cloud Storage Bucket sharing the same prefix name (filename*.csv) and the same schema.
There are some csvs anyway that make fail BigQuery queries with a message like the following one: "Error while reading table: xxxx.xxxx.xxx, error message: CSV table references column position 5, but line starting at position:10 contains only 2 columns.
Moving all the csvs one-by-one from the bucket I devised the one responsible for that.
This csv file doesn't have 10 lines...
I found this ticket BigQuery error when loading csv file from Google Cloud Storage, so I thought the issue was having an empty line at the end. But also others csvs in my bucket do, so this can't be the reason.
On the other hand this csv is the only one with content type text/csv; charset=utf-8, all the others being text/csv,application/vnd.ms-excel,application/octet-stream.
Furthermore downloading this csv to my local Windows machine and uploading it againt to Cloud Storage, content type is automatically converted to application/vnd.ms-excel.
Then even with the missing line Big Query can then query the permanent table based on filename*.csvs.
Is it possible that BigQuery had issues querying csvs with UTF-8 encoding, or is it just coincidence?
Use Google Cloud Dataprep to load your csv file. Once the file is loaded, analyze the data and clean it if requires.
Once all the rows are cleaned, you can then sink that data in BQ.
Dataprep is GUI based ETL tool and it runs a dataflow job internally.
Do let me know if any more clarification is required.
Just to remark the issue, the CSV file had gzip as encoding which was the reason that BigQuery doesn't interpret as a CSV file.
According to documentation BigQuery expects CSV data to be UTF-8 encoded:
"encoding": "UTF-8"
In addition, since this issue is relate to the metadata of the files in GCS you can edit the metadata directly from the Console.

BigQuery table: loading .7z file form cloud platform

I am trying to upload a compressed file from my GCS bucket into BigQuery.
In the new UI it is not clear how should I specify to uncompress the file.
I get an error specifying as if the gs://bucket/folder/file.7z is a .csv file.
Any help?
Unfortunately, .7z files are not supported by Bigquery, only gzip files and the decompression process is made automatically, after selecting the data format and creating the table.
If you consider that BigQuery should accept 7z files too, you could fill a feature request so the BigQuery engineers have it in mind for further releases.