What happens when after creating the table in AWS Athena for files on S3, the structure of the files on S3 change?
For eg:
If the files previously had 5 columns when the table was created and later the new files started getting 1 more column:
a) at the end?
b) in between?
What happens when some columns are not available in new files?
What happens when the columns remain the same but the column order changes?
Can we alter Athena tables to adjust to these changes?
1 - Athena is not a NoSQL solution. It is not dynamic schema either. If you change the schema, all your files in a particular folder should reflect that change. Athena wont magically update to have it included.
2 - Then it'll be a problem and it'll break. You should include NULL or ,, to force it to be okay.
3 - Athena picks it up by column order. Not by name, really. If your column orders change, it'll probably break (different types).
4 - Yes. You can always easily recreate Athena tables by dropping it and creating a new one.
If you have variable length files, then you should insert them into different folders so that each folder represents one consistent schema. You can then unify this later on in Athena with a union or similar to create a condensed, simplified table that you can apply the consistent schema to.
It depends on the files format you are using and the setup (if the schema is by field order or by field name). All the details are here: https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html
Take a big note that if the data is nested or in arrays, it will completely break your data, to quote from this page:
Schema updates described in this section do not work on tables with complex or nested data types, such as arrays and structs.
Related
Workflow
In a data import workflow, we are creating a staging table using CREATE TABLE LIKE statement.
CREATE TABLE abc_staging (LIKE abc INCLUDING DEFAULTS);
Then, we run COPY to import CSV data from S3 into the staging table.
The data in CSV is incomplete. Namely, there are fields partition_0, partition_1, partition_2 which are missing in the CSV file; we fill them in like this:
UPDATE
abc_staging
SET
partition_0 = 'BUZINGA',
partition_1 = '2018',
partition_2 = '07';
Problem
This query seems expensive (takes ≈20 minutes oftentimes), and I would like to avoid it. That could have been possible if I could configure DEFAULT values on these columns when creating the abc_staging table. I did not find any method as to how that can be done; nor any explicit indication that is impossible. So perhaps this is still possible but I am missing how to do that?
Alternative solutions I considered
Drop these columns and add them again
That would be easy to do, but ALTER TABLE ADD COLUMN only adds columns to the end of the column list. In abc table, they are not at the end of the column list, which means the schemas of abc and abc_staging will mismatch. That breaks ALTER TABLE APPEND operation that I use to move data from staging table to the main table.
Note. Reordering columns in abc table to alleviate this difficulty will require recreating the huge abc table which I'd like to avoid.
Generate the staging table creation script programmatically with proper columns and get rid of CREATE TABLE LIKE
I will have to do that if I do not find any better solution.
Fill in the partition_* fields in the original CSV file
That is possible but will break backwards compatibility (I already have perhaps hundreds thousands of files in there). Harder but manageable.
As you are finding you are not creating a table exactly LIKE the original and Redshift doesn't let you ALTER a column's default value. Your proposed path is likely the best (define the staging table explicitly).
Since I don't know your exact situation other paths might be better so me explore a bit. First off when you UPDATE the staging table you are in fact reading every row in the table, invalidating that row, and writing a new row (with new information) at the end of the table. This leads to a lot of invalidated rows. Now when you do ALTER TABLE APPEND all these invalidated rows are being added to your main table. Unless you vacuum the staging table before hand. So you may not be getting the value you want out of ALTER TABLE APPEND.
You may be better off INSERTing the data onto your main table with an ORDER BY clause. This is slower than the ALTER TABLE APPEND statement but you won't have to do the UPDATE so the overall process could be faster. You could come out further ahead because of reduced need to VACUUM. Your situation will determine if this is better or not. Just another option for your list.
I am curious about your UPDATE speed. This just needs to read and then write every row in the staging table. Unless the staging table is very large it doesn't seem like this should take 20 min. Other activity could be creating this slowdown. Just curious.
Another option would be to change your main table to have these 3 columns last (yes this would be some work). This way you could add the columns to the staging table and things would line up for ALTER TABLE APPEND. Just another possibility.
The easiest solution turned to be adding the necessary partition_* fields to the source CSV files.
After employing that change and removing the UPDATE from the importer pipeline, the performance has greatly improved. Imports now take ≈10 minutes each in total (that encompasses COPY, DELETE duplicates and ALTER TABLE APPEND).
Disk space is no longer climbing up to 100%.
Thanks everyone for help!
The main question:
I can't seem to find definitive info about how $path works when used in a where clause in Athena.
select * from <database>.<table> where $path = 'know/path/'
Given a table definition at the top level of a bucket, if there are no partitions specified but the bucket is organized using prefixes does it scan the whole table? Or does it limit the scan to the specified path in a similar way to partitions? Any reference to an official statement on this?
The specific case:
I have information being stored in s3, this information needs to be counted and queried once or twice a day, the prefixes are two different IDs (s3:bucket/IDvalue1/IDvalue2/) and then the file with the relevant data. On a given day any number of new folders might be created (on busy days it could be day tens of thousands) or new files added to existing prefixes. So, maintaining the partition catalog up to date seems a little complicated.
One proposed approach to avoid partitions is using $path when getting data from a know combination of IDs, but I cannot seem to find whether using such approach would actually limit the amount of data scanned per query. I read a comment saying it does not but I cannot find it in the documentation and was wondering if anyone knows how it works and can point to the proper reference.
So far googling and reading the docs has not clarified this.
Athena does not have any optimisation for limiting the files scanned when using $path in a query. You can verify this for yourself by running SELECT * FROM some_table and SELECT * FROM some_table WHERE $path = '…' and comparing the bytes scanned (they will be the same, if there was an optimisation they would be different – assuming there is more than one file of course).
See Query by "$path" field and Athena: $path vs. partition
For your use case I suggest using partition projection with the injected type. This way you can limit the prefixes on S3 that Athena will scan, while at the same time not have to explicitly add partitions.
You could use something like the following table properties to set it up (use the actual column names in place of id_col_1 and id_col_2, obviously):
CREATE EXTERNAL TABLE some_table
…
TBLPROPERTIES (
"projection.id_col_1.type" = "injected",
"projection.id_col_2.type" = "injected",
"storage.location.template" = "s3://bucket/${id_col_1}/${id_col_2}/"
)
Note that when querying a table that uses partition projection with the injected type all queries must contain explicit values for the the projected columns.
Hi I have a bunch of CSV's located in S3, a crawler setup via AWS Glue, this crawler builds about 10 tables as it scan 10 folders and only 1 of them where the headers are not being detected. The structure of the csv is the same as all the others. Advice please?
AWs glue crawler interprets header based on multiple rules. if the first line in your file doest satisfy those rules, the crawler wont detect the fist line as a header and you will need to do that manually. its a very common problem and we integrated a fix for this within our code to do it is part of our data pipeline.
Excerpt from aws doco
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has
content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex
requirements for a column name.
The header row must be sufficiently different from the data rows. To
determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
You can create the table yourself and instead of crawling point to an s3 path, you can crawl based on an existing table. This is the concept used when a crawler is not detecting the schema especially just column headings.
Also check if the skip.header.line.count=1 is being added automatically, if not you can add manually and it an update the schema to the correct one you require. On your subsequent runs for your crawler, you can change the properties so that it will ignore schema updates and only perform partition updates to your table.
You could use a custom classifier on your crawler to solve this problem: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html
Normally choosing Has headings in the classifier options Column Headings section will do the trick, if not, it may be necessary to enter in a list of headings in text box for that purpose.
because your columns are all classified as strings, it's likely that the columns violate the rules. in my case, i had a column name that was greater than 150 characters so Glue read the first row as data, as opposed to a header, and then assumed all columns were strings.
I have an Athena table of data in S3 that acts as a source table, with columns id, name, event. For every unique name value in this table, I would like to output a new table with all of the rows corresponding to that name value, and save to a different bucket in S3. This will result in n new files stored in S3, where n is also the number of unique name values in the source table.
I have tried single Athena queries in Lambda using PARTITION BY and CTAS queries, but can't seem to get the result that I wanted. It seems that AWS Glue may be able to get my expected result, but I've read online that it's more expensive, and that perhaps I may be able to get my expected result using Lambda.
How can I store a new file (JSON format, preferably) that contains all rows corresponding to each unique name in S3?
Preferably I would run this once a day to update the data stored by name, but the question above is the main concern for now.
When u write your spark/glue code you will need to partition the data using the name column. However this will result in a path having the below format
S3://bucketname/folder/name=value/file.json
This should give a separate set of files for each name value, but if u want to access that as a separate table u might need to get rid of that = sign from the key before u crawl the data and make it available via Athena
If u do use a lambda, the operation involves going through the data , similar to what glue does, and partitioning the data
I guess it all depends on the volume of data that it needs to process. Glue, if using spark may have a little bit of an extra start up time. Glue python shells have comparatively better start up times
For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html