I am able to unload data to S3, and query the results with Spectrum, but NOT when using the delimiter defined below. This is our standard delimiter that works with all of our processing today related to Redshift COPY and UNLOAD commands, so I believe the UNLOAD is working fine. But somewhere between the table definition and the SQL query to retrieve the data, this is not working. We just receive NULLS for all of the fields. Can you look at our example below in order to determine next steps.
unload ('select * from db.test')
to 's3://awsbucketname/ap_cards/'
iam_role 'arn:aws:iam::123456789101:role/redshiftaccess'
delimiter '\325'
manifest;
CREATE EXTERNAL TABLE db_spectrum.test (
cost_center varchar(100) ,
fleet_service_flag varchar(1)
)
row format delimited
fields terminated by '\325'
stored as textfile
location 's3://awsbucketname/test/';
select * from db_spectrum.test
Got a response from AWS Support center as:
Unfortunately you will need to either process the data externally to change the delimiter or UNLOAD the data again with a different delimiter.
The docs say to Specify a single ASCII character for 'delimiter'.
The ASCII range only goes up to 177 in octal.
We will clarify the docs to note that 177 is the max permissible octal for a delimiter. I can confirm that this is the same in Athena as well.
Thank you for bringing this to our attention.
You might try using Spectrify for this. It automates a lot of the nastiness involved currently with moving redshift table to spectrum.
Related
I have a file structure such as:
gs://BUCKET/Name/YYYY/MM/DD/Filename.csv
Every day my cloud functions are creating another path with another file innit corresponding to the date of the day (so for today's 5th of August) we would have gs://BUCKET/Name/2022/08/05/Filename.csv
I need to find a way to query this data to Big Query automatically so that if I want to query it for 'manual inspection' I can select for example data from all 3 months in one query doing CREATE TABLE with gs://BUCKET/Name/2022/{06,07,08}/*/*.csv
How can I replicate this? I know that BigQuery does not support more than 1 wildcard, but maybe there is a way to do so.
To query data inside GCS from Big Query you can use an external table.
Problem is this will fail because you cannot have a comma (,)
as part of the URI list
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = ['gs://bucket/2022/{1,2,3}/data.csv']
)
You have to specify the 3 CSV file locations like this:
CREATE EXTERNAL TABLE `bigquerydevel201912.foobar`
OPTIONS (
format='CSV',
uris = [
'gs://inigo-test1/2022/1/data.csv',
'gs://inigo-test1/2022/2/data.csv']
'gs://inigo-test1/2022/3/data.csv']
)
Since you're using this sporadically, probably makes more sense to create a temporal external table.
se I found a solution that works at least for my use case, without using the external table.
During the creation of table in dataset in BigQuery use create table from: GCS and then when using URI pattern I used gs://BUCKET/Name/2022/* ; As long as filename is the same in each subfolder and schema is identical, then BQ will load everything and then you can perform date operations directly in BQ (I have a column with ingestion date)
I am currently utilizing the UNLOAD feature in AWS Athena, where I query something like:
UNLOAD (SELECT * FROM sometable) TO 's3://<location>' WITH (format = 'TEXTFILE', field_delimeter = ',')
This generates a bunch of .gz files. My issue is that all the null/empty values have been converted to \N. Is there a way to replace this with just an empty string?
I did notice that if I just do SELECT * FROM sometable, basically not using unload, it seems to be what I want (without \N). I would like to get similar results with the unload, if its possible. There seems to be a SERDEPROPERTIES with serialization.null.format (for creating a table), but not sure how to use it with unload.
Seemingly cannot get Athena to partition projection to work.
When I add partitions the "old fashioned" way and then run a MSCK REPAIR TABLE testparts; I can query the data.
I drop the table and recreate with the partition projections below and it fails to query any data at all. The queries that I do get to run take a very very long time with no results, or they time out like below query.
For the sake of argument I followed AWS documentation:
select distinct year from testparts;
I get :
HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mydb.testparts' can potentially read more than 1000000 partitions.
I have ~7500 files in there at the moment in the file structures indicated in the table setups below.
I have:
Tried entering the separated parts as date type, provided format "yyyy-MM-dd" and still it did not work (including deleting and changing my s3 structures as well). I then tried to split the dates into different folders and set as integers (which you see below) and still did not work.
Given I can get it to operate "manually" after repairing the table, then successfully querying my structures - I must be doing something wrong at a fundamental level with partition projections.
I have also changed user from injected type to enum (not ideal given it's a plain old string, but did it for the purpose of testing)
Table creation :
CREATE EXTERNAL TABLE `testparts`(
`thisdata` array<struct<thistype:string,amount:float,identifiers:map<string,struct<id:string,type:string>>,selections:map<int,array<int>>>> COMMENT 'from deserializer')
PARTITIONED BY (
`year` int,
`month` int,
`day` int,
`user` string,
`thisid` int,
`account` int)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://testoutputs/partitiontests/responses'
TBLPROPERTIES (
'classification'='json',
'projection.day.digits'='2',
'projection.day.range'='1,31',
'projection.day.type'='integer',
'projection.enabled'='true',
'projection.account.range'='1,50',
'projection.account.type'='integer',
'projection.month.digits'='2',
'projection.month.range'='1,12',
'projection.month.type'='integer',
'projection.thisid.range'='1,15',
'projection.thisid.type'='integer',
'projection.user.type'='enum',
'projection.user.values'='usera,userb',
'projection.year.range'='2017,2027',
'projection.year.type'='integer',
'storage.location.template'='s3://testoutputs/partitiontests/responses/year=${year}/month=${month}/day=${day}/user=${user}/thisid=${thisid}/account=${account}/',
'transient_lastDdlTime'='1653445805')
If you run a query like SELECT * FROM testparts Athena will generate all permutations of possible values for the partition keys and list the corresponding location on S3. For your table this means doing 11,160,000 listings.
I don't believe that there's any optimization for SELECT DISTINCT year FROM testparts that would skip building the list of partition key values, so something similar would happen with that query too. Similarly, if you use "Preview table" to run SELECT * FROM testparts LIMIT 10 there is no optimization that skips building the list of partitons or skips listing the locations on S3.
Try running a query that doesn't wildcard any of the partition keys to validate that your config is correct.
Partition projection works differently from adding partitions to the catalog, and some care needs to be taken with wildcards. When partitions are in the catalog non-existent partitions can be eliminated cheaply, but with partition projection S3 has to be listed for every permutation of partition keys after predicates have been applied.
Partition projection works best when there are never wildcards on partition keys, to minimize the number of S3 listings that need to happen.
I am using aws glue csv crawler to crawl s3 directory containing csv files. Crawler works fine in the sense that it creates the schema with correct data types for each column, however, when I query data from athena, it doesn't show value under boolean type column.
A csv looks like this:
"val","ts","cond"
"1.2841974","15/05/2017 15:31:59","True"
"0.556974","15/05/2017 15:40:59","True"
"1.654111","15/05/2017 15:41:59","True"
And the table created by crawler is:
Column name Data type
val string
ts string
cond boolean
However, when I run say select * from <table_name> limit 10 it returns:
val ts cond
1 "1.2841974" "15/05/2017 15:31:59"
2 "0.556974" "15/05/2017 15:40:59"
3 "1.654111" "15/05/2017 15:41:59"
Does any one has any idea what might be the reason?
I forgot to add, if I change the data type of cond column to string, it does show data as string e.g. "True" or "False"
I don't know why Glue classifies the cond column as boolean, because Athena will not understand that value as a boolean. I think this is a bug in Glue, or an artefact of it not targeting Athena exclusively. Athena expects boolean values to be either true or false. I don't remember if that includes different capitalizations of the strings or not, but either way yours will fail because they are quoted. The actual bug is that Glue has not configured your table so that it strips the quotes from the strings, and therefore Athena sees a boolean column containing "True" with quotes and all, and that is not a supported boolean value. Instead you get NULL values.
You could try changing your tables to use the OpenCSVSerDe instead, it supports quoted values.
It's surprising that Glue continues to stumble on basic things like this. Glue is unfortunately rarely worth the effort over writing some basic scripts yourself.
For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html