What is AWS S3 dataset? - amazon-web-services

Looking at documentation of awswrangler.s3.to_csv or awswrangler.s3.to_parquet, there is a dataset parameter.
From testing, it looks like setting dataset=True allows, among other things, to append new data to an already existing set. It also looks like when dataset=True, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path.
Apart from that, I can't find more information on what dataset means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset and when should it be set to True?

The dataset=True option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments, concurrent_partitioning, projection_values, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False and output to a single file.

Related

Amazon Redshift: Check column names during COPY

Can I check columns names during copy from S3 to Redshift?
For example, I have "good" CSV:
name ,sur_name
BOB , FISCHER
And I have "wrong" CSV:
sur_name,name
FISCHER , BOB
Can I check names of columns during copy command?
I don't want to use AWS Glue or AWS Lambda for checks because I don't want to open/load/save the same file many times.
(The same problem for other files with columns names.)
This is very simple check so Redshift should allowed that but I can't find any information about that.
Or if this is not possible? Can you give me some idea how do it without reading all files?
(For example, a Lambda function that reads only headers without getting all file.)
From Column mapping options - Amazon Redshift:
You can specify a comma-separated list of column names to load source data fields into specific target columns. The columns can be in any order in the COPY statement, but when loading from flat files, such as in an Amazon S3 bucket, their order must match the order of the source data.
Therefore, the only way to read such files would be to specify the column names in their correct order. This requires you to look inside the file to determine the order of the columns.
When reading an object from Amazon S3, it is possible to specify the range of bytes to be read. So, instead of reading the entire file, it could read just the first 200 bytes (or whatever size would be sufficient to include the header row). An AWS Lambda function could read these bytes, extract the column names, then generate a COPY command that would import the columns in the correct order (without having to read the entire file first).

Why are tables segmented when exporting to parquet from AWS RDS

We use Python's boto3 library to execute start_export_task to trigger a RDS snapshot export (to S3). This successfully generates a directory in S3 that has a predicable, expected structure. Traversing down through that directory to any particular table directory (as in export_identifier/database_name/schema_name.table_name/) I see several .parquet files.
I download several of these files and convert them to pandas dataframes so I can look at them. They are all structured the same and seem to clearly be pieces of the same table. But they range in size from 100KB to 8MB in seemingly unpredictable size segments. Do these files/'pieces' of the table account for all its rows? Do they repeat/overlap at all? Why are they segmented so (seemingly) randomly? What parameters control this segmenting?
Ultimately I'm looking for documentation on this part of parquet folder/file structure. I've found plenty of information on how individual files are structured and partitioning. But I think this falls slightly outside of those topics.
You're not going to like this, but from AWS' perspective this is an implementation detail and according to the docs:
The file naming convention is subject to change. Therefore, when reading target tables we recommend that you read everything inside the base prefix for the table.
— docs
Most of the tools that work with Parquet don't really care about the number or file names of the parquet files. You just point something like Spark or Athena to the prefix of the table and it will read all the files and figure out how they fit together.
In the API there are also no parameters to influence this behavior. If you prefer a single file for aesthetic reasons or others, you could use something like a Glue Job to read the table prefixes, coalesce the data per table in a single file and write it to S3.

How do I ensure that the AWS Glue crawler I've written is using the OpenCSV SerDe instead of the LazySimpleSerDe?

For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html

Google Dataprep: Save GCS file name as one of the column

I have a Dataprep flow configured. The Dataset is a GCS folder (all files from it). Target is BigQuery table.
Since data is coming from multiple files, I want to have filename as of the columns in the resulting data.
Is that possible?
UPDATE: There's now a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage (starting at the top-level bucket). You can use this in formulas or add it to a new formula column and then do anything you want in additional recipe steps. (If your data source sample was created before this feature, you'll need to generate a new sample in order to see it in the interface)
Full notes for these metadata fields are available here: https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
Original Answer
This is not currently possible out of the box. IF you're manually merging datasets with UNION, you could first process them to add a column with the source so that it's then present in the combined output.
If you're bulk-ingesting files, that doesn't help—but there is an open feature request open that you can comment on and/or follow for updates:
https://issuetracker.google.com/issues/74386476

How do I manage SAS formats from various sources?

I am wondering how I can efficiently manage formats in SAS for a reporting office that takes in data from various sources, some with proper lookup tables / metadata, and some without.
For data sources that have proper metadata, joining tables for value descriptions works fine, but when metadata doesn't exist and needs to be maintained separately, how should that be done? Some straightforward examples/ideas:
Plain .sas files with a native PROC FORMAT step that is maintained separately.
External files (e.g., Excel, CSV) that are maintained separately and imported into SAS to create a format library.
Database tables maintained separately that can be read from to create a format library.
In addition to just the formatted values, managing value changes (i.e., effective dates for certian values) is also a concern.
Any help in conventions or standards that work well for this type of task is greatly appreciated.
I'm not sure there's a single best solution here - it depends largely on your environment, your users, etc.
If you have fairly naive users, then I'd definitely recommend a single complete repository if possible; whether that is a .sas7bcat file if you are using a single SAS version/OS/bitness, or a ready-made table/dataset to input into PROC FORMAT (and a .sas file included in their autoexec to do the importing). The biggest drawbacks to this are that you have to manage it actively (you cannot allow users to write their own formats to the master format dataset, for example, as they may overwrite other ones), and that there will be additional work to ensure format names do not conflict - YNF. might be 1=YES 2=NO or 1=YES 0=NO or something else. This also doesn't allow you to very easily handle effective dates; but it's possible this is better for your users (and then just handle the documentation separately).
If you have more advanced users, then you might consider a table/dataset that is more relational in nature. A hybrid approach might include a dataset with columns:
Dataset Name (qualified as needed to ensure uniqueness)
Format Name
Start
Label
Other elements (Type, HLO, etc.)
Effective date
That would allow users to make their own modifications (assuming you trust them enough to add dataset name properly, anyway - or set up a stored proc to do the adding from a temp table that checked for conflicts) and allow you to handle format names that conflicted. You'd still have to have a way for the user to handle using multiple datasets, if that's necessary (such as by adding some unique element to the format name by default, like 'dataset ID').
In my mind, however, the best option is using a data dictionary to handle the metadata, which combines self-documentation with metadata management. Similar to above, you have a table with dataset and format elements, but add columns for descriptive text (question description, for example) and other useful information, depending on your use cases. This can be maintained in a database table or dataset, or perhaps more usefully in an excel or similar document that can be shared with non-programmers and easily edited. I use this method for several projects, and it has paid off by allowing my users to help write the documentation for my code, keeping my programs accurate and up to date, while minimizing back-and-forth discussions of updates. I just import the spreadsheet and run a proc format each time I run my data.
You can then have one spreadsheet per dataset, one tab, or one full spreadsheet with all datasets in them - whichever is easiest to use. This easily handles 'effective date' type issues as well - or even versioning, as that can be handled in the spreadsheet.