I have a query:
copy quality_temp.temp_qa_leave_list_193de282ea194c169a97bb82a8fcc3b9 (exclusion_value)
from
's3://quality-staging1/quality-staging1/qa_leave/industry_copy_5eb3f0f88f5fe.csv' iam_role '' IGNOREBLANKLINES delimiter '\^' MAXERROR 1
But when I am looking at Redshift console, it shows 'Aborted'.
I am not able to understand why it is not working. Can anyone help?
Related
I'm trying to load a table in na spark EMR cluster from glue catalog in apache iceberg format that is stored in S3. The table is correctly created because I can query it from AWS Athena. On the cluster creation I have set this configuration:
[{"classification":"iceberg-defaults","properties":{"iceberg.enabled":"true"}}]
IK have tried running sql queries from spark that are in other formats(csv) and it works, but when I try to read iceberg tables I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
This is the code in the notebook:
%%configure -f
{
"conf":{
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.dev.type":"hadoop",
"spark.sql.catalog.dev.warehouse":"s3://pyramid-streetfiles-sbx/iceberg_test/"
}
}
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
spark = SparkSession.builder.getOrCreate()
# This query works and shows the iveberg table i want to read
spark.sql("show tables from iceberg_test").show(truncate=False)
# Here shows the error
spark.sql("select * from iceberg_test.table_name limit 10").show(truncate=False)
How can I read apache iceberg tables in EMR cluster with Spark and glue catalog?
You need to pass the catalog name glue.
Example: glue_catalog.<your_database_name>.<your_table_name>
https://docs.aws.amazon.com/pt_br/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html
I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through spark.
The first part of the code reads some data from S3/Glue, does some transformations and what not, then writes the resulting dataframe to S3/Glue like so:
df.repartition('datekey','coeff')\
.write\
.format('parquet')\
.partitionBy('datekey','coeff')\
.mode('overwrite')\
.option("path", S3_PATH)\
.saveAsTable('hive_tables.my_table')
I then try to access this table with Spark SQL, but when I run something as simple as
spark.sql('select * from hive_tables.my_table where datekey=20210506').show(),
it throws this:
An error was encountered:
"org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
I've learned this happens only when specifying the datekey partition. For example, both of the following commands work fine:
spark.sql('select * from hive_tables.my_table where coeff=0.5').show() and
spark.sql('select * from hive_tables.my_table').show()
I've verified through Spark SQL that the partitions exist and have data in them. The datekey query also works fine through AWS Athena - just not Spark SQL.
Also Glue definitely has the two partition columns recognized:
datekey: int
coeff: double
Any ideas here? I've tried everything I can think of and it just isn't making any sense.
I had same error In emr 6.3.0 (Spark 3.1.1).
After upgrade to emr 6.5.0 (Spark 3.1.2), It solved.
I would still like a straight-forward solution to this, but currently this workaround suffices:
I first read the table straight from the S3 path
temp_df = spark.read.parquet(S3_PATH)
so that it doesn't use the Glue catalog as the metadata. Then I create a temp table for the session:
temp_df.createGlobalTempView('my_table')
which allows me to query it using Spark SQL with the global_temp database:
spark.sql('select * from global_temp.my_table where datekey=20210506').show()
and this works
I had a similar issue in a similar environment (EMR cluster + Spark SQL + AWS Glue catalog). The query was like this:
select *
from ufd.core_agg_data
where year <> date_format(current_timestamp, 'yyyy')
This is a table partitioned by "year", and "year" is a string. Note that "year" is used in the filter.
I got
User class threw exception: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown operator '!='
Then I "modified" the query to this one, and it worked!
select *
from ufd.core_agg_data
where year in (select date_format(current_timestamp, 'yyyy'))
I have python 3.6.8 on GNU/Linux 3.10 on GCP and I'm trying to load data from Hive to CloudSQL.
gc_cmd_import_csv_p1 = subprocess.Popen(['gcloud', 'sql', 'import', 'csv',
'{}'.format(quote(cloudsql_instance)),
'{}'.format(quote(load_csv_files)),
'--database={}'.format(quote(cloudsql_db)),
'--table={}'.format(quote(cloudsql_table_name)),
'--user={}'.format(quote(db_user_name)),
'--quiet'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True)
import_cmd_op, import_cmd_error = gc_cmd_import_csv_p1.communicate()
import_cmd_return_code = gc_cmd_import_csv_p1.returncode
if import_cmd_return_code:
print("""[ERROR] Unable to import data from Hive to CloudSQL.
Error description: {}
Error Code(s): {}
Issue file name: {}
""".format(import_cmd_error, import_cmd_return_code, load_csv_files))
sys.exit(9)
print("[INFO] Data Import completed from HIVE to CloudSQL.")
In case of any error above, I'm getting message like:
Error description: ERROR: (gcloud.sql.import.csv) HTTPError 403: The client is not authorized to make this request.Error Code(s): 1
But when I actually run the same import command directly as shown below:
gcloud sql import csv test-cloud-sql-instance gs://test-server-12345/app1/data/lookup_table/000000_0 --database=test_db --table=name_lookup --user=test_user --quiet
I'm getting the actual error like below:
ERROR: (gcloud.sql.import.csv) [ERROR_RDBMS] ERROR: extra data after last expected column CONTEXT: COPY name_lookup, line 16902:
I want this message
( Extra data after last expected column... line 16902:)
to be shown in python script instead of
HTTPError 403:
error. How to capture that?
Please note: There is no authentication issue as suggested by HTTP Error.
So after long discussion with GCP Admin, we have found the issue.
We tried to execute the same import command using os.system() and then again we got the HTTP error. Admin then revisited the GCP IAM documentation and created a role for P-SQL user. Issue is resolved now.
I have added a Delimiter ',' but then too I am getting an error.
Code:
"copy %s.%s_tmp
from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
REMOVEQUOTES
ESCAPE
ACCEPTINVCHARS
ENCODING AS UTF8
DELIMITER ','
GZIP
ACCEPTANYDATE
region '%s'"
% (schema, table, s3_path, access_key, secret_key, region)
Error:
InternalError: Load into table 'my_table' failed. Check 'stl_load_errors' system table for details.
In this table in Redshift the error is Delimiter not found
How can I fix this?
One of the raw line is in this format :
1122,"",4332345,"2016-07-28 15:00:09","2032-09-28
15:00:09",19.00,"","some string","","som string","abc","abc","abc"
Try using the MAXERROR parameter in the copy command. IT will succeed you partial load even though some records are in error.
Also try using this version of COPY:
copy tblname(col1,col2,col3...) from s3 path
I am trying to copy the text file from S3 to Redshift using the below command but getting the same error.
Error:
Missing newline: Unexpected character 0xffffffe2 found at location 177
copy table from 's3://abc_def/txt_006'
credentials '1234567890'
DELIMITER '|'
NULL AS 'NULL'
NULL AS '' ;
The text file has No header and field delimiter is |.
I tried passing the parameters using: ACCEPTINVCHARS.
Redshift shows same error
1216 error code: invalid input line.
Can anyone provide how to resolve this issue?
Thanks in advance.
Is your file in UTF8 format? if not convert it and try reloading.
I am Assuming path to the text file is correct. Also you generated the text file with some tool and uploaded to redshift manually
I faced the same issue and the issue is with whitespaces .I recommend you to generate the text file by nulling and trimming the whitespaces .
your query should be select RTRIM(LTRIM(NULLIF({columnname}, ''))),.., from {table}. generate the output of this query into text file.
If you are using SQl Server, query out the table using BCP.exe by passing the above query with all the columns and functions
Then use the below copy command after uploading the txt file in S3
copy {table}
from 's3://{path}.txt'
access_key_id '{value}'
secret_access_key '{value}' { you can alternatively use credentials as mentioned above }
delimiter '|' COMPUPDATE ON
removequotes
acceptinvchars
emptyasnull
trimblanks
BLANKSASNULL
FILLRECORD
;
commit;
This solved my problem. Please let us know if you are facing anything else.