AWS Aurora ALTER TABLE not working - amazon-web-services

I'm trying to add a new column to a table that weighs about 20GB using:
ALTER TBLE ... ALGORYTHM = INPLACE
After about one hour of processing, the ALTER command fails and returns the following error without adding the column:
ERROR 1034 (HY000): Incorrect key file for table '[TABLE]'; try to repair it
Any idea why is this happening?

Seems to be an issue related to temporary disk space.
It's a known problem in Aurora: https://forums.aws.amazon.com/message.jspa?messageID=691512

Related

HIVE_PARTITION_SCHEMA_MISMATCH in Athena due to different order in struct?

Full error
HIVE_PARTITION_SCHEMA_MISMATCH: There is a mismatch between the table and partition schemas.
The types are incompatible and cannot be coerced. The column 'ein_verification' in table 'dynamodb_etl_dev.widget_user_snapshots' is declared as type
'struct<status:string,unlocktimestamp:bigint,message:string,lastverifiedtimestamp:bigint,datelastverified:bigint>', but partition 'snapshot_time=2022-08-03T18%3A41' declared column 'ein_verification' as type
'struct<status:string,unlocktimestamp:bigint,lastverifiedtimestamp:bigint,message:string,datelastverified:bigint>'.
It looks like the only difference is the order message:string,lastverifiedtimestamp:bigint is reversed but they are otherwise the same.
I know there are settings for updating the table definition and updating existing partitions with metadata from the table, but I'd like to understand why this is happening and possibly prevent it from happening at all.
Also it appears Athena is not trying to query the latest partition, as there is a new partitition with a more recent timestamp in this s3 bucket. I'm stuck on how to proceeed as I can run this job once and get a single partition and it works fine. But everytime so far that I run it a second time I get the error with struct out of order.
Found the answer to the partion error here . I n particular, theres a comment on how to do it in terraform that helped me get it running

Error while saving transformation in pentaho spoon

I am getting below error while I save the transformation in pentaho spoon:
Error saving transformation to repository!
Error updating batch
Cannot insert duplicate key row in object 'dbo.R_STEP_ATTRIBUTE' with unique index 'IDX_RSAT'. The duplicate key value is (2314, PARTITIONING_SCHEMA, 0).
Everything was working fine before I ran a job that creates multiple excel files. While this job was running suddenly a memory issue occurred and the job was aborted. After that I tried to save my file but it is deleted for saving but not been saved. So I lost the job I created.
Please help me to know the reason.
The last save of the directory did not end gracefully.
There is a small chance that you can repair it by easing the db-caches file in the .kettle directory.
If it does not work, create a new repository and copy the current in the new. Try the global repository export/import. Then erase the old rep and do the same from the just rebuild repository.
The intermediary repository may be on files rather than on a database.
If it is the first time you do this, plan for a one-two hours.
There is a easy way to recover this.
As AlainD says, the problem occurs when you save or delete a transformations, and suddenly you lost the connection or had a problem with Kettle.
When that occurs, you will find a lot of step records into the table R_STEP_ATTRIBUTE. In the error shown is the [ID_TRANSFORMATION] = 2314.
So, if you check the table R_TRANSFORMATION with [ID_TRANSFORMATION] = 2314, maybe wont find any transformation with that id.
After check that, you can delete all the records related with that [ID_TRANSFORMATION], for example:
delete from R_STEP_ATTRIBUTE where ID_TRANSFORMATION=2314
We just solved this issue by executing the following SQL statement
DELETE
FROM R_STEP_ATTRIBUTE
WHERE ID_STEP NOT IN (SELECT ID_STEP FROM R_STEP)

Query hive table with Spark

I am newbie to Apache Hive and Spark. I have some existing Hive tables sitting on my Hadoop server that I can run some HQL commands and get what I want out of the table using hive or beeline, e.g, selecting first 5 rows of my table. Instead of that I want to use Spark to achieve the same goal. My Spark version on server is 1.6.3.
Using below code (I replace my database name and table with database and table):
sc = SparkContext(conf = config)
sqlContext = HiveContext(sc)
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
df = query.toPandas()
df.show()
I get this error:
ValueError: Some of types cannot be determined after inferring.
Error:root: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
However, I can use beeline with same query and see the results.
After a day of googling and searching I modified the code as:
table_ccx = sqlContext.table("database.table")
table_ccx.registerTemplate("temp")
sqlContext.sql("SELECT * FROM temp LIMIT 5").show()
Now the error is gone but all the row values are null except one or two dates and column names.
I also tried
table_ccx.refreshTable("database.table")
and it did not help. Is there a setting or configuration that I need to ask my IT team to do? I appreciate any help.
EDIT: Having said that, my python code is working for some of the table on Hadoop. Do not know the problem is because of some entries on table or not? If yes, then how come the corresponding beeline/Hive command is working?
As it came out in the comments, straightening up the code a little bit makes the thing work.
The problem lies on this line of code:
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
What you are doing here is:
asking Spark to query the data source (which creates a DataFrame)
collect everything on the driver as a local collection
parallelize the local collection on Spark with createDataFrame
In general the approach should work, although it's evidently unnecessarily convoluted.
The following will do:
query = sqlContext.sql("SELECT * from database.table LIMIT 5")
I'm not entirely sure of why the thing breaks your code, but still it does (as it came out in the comments) and it also improves it.

Redshift copy command failure

I am using Amazon Redshift COPY command to insert new rows into a table.
The copy command fails and an error message coming up:
index "pg_toast_16408_index" is not a btree
I have noticed that the problem occurs because of description field that contains long string. When I try to copy without this field it works!
Does someone know why is that? How can I overcome this issue?
Use the TRUNCATECOLUMNS parameter:
Truncates data in columns to the appropriate number of characters so that it fits the column specification. Applies only to columns with a VARCHAR or CHAR data type, and rows 4 MB or less in size.

Redshift COPY command delimiter not found

I'm trying to load some text files to Redshift. They are tab delimited, except for after the final row value. That's causing a delimiter not found error. I only see a way to set the field delimiter in the COPY statement, not a way to set a row delimiter. Any ideas that don't involve processing all my files to add a tab to the end of each row?
Thanks
I don't think the problem is with missing <tab> at the end of lines. Are you sure that ALL lines have correct number of fields?
Run the query:
select le.starttime, d.query, d.line_number, d.colname, d.value,
le.raw_line, le.err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
order by le.starttime desc
limit 100
to get the full error report. It will show the filename with errors, incorrect line number, and error details.
This will help to find where the problem lies.
You can get the delimiter not found error if your row has less columns than expected. Some CSV generators may just output a single quote at the end if last columns are null.
To solve this you can use FILLRECORD on Redshift copy options.
From my understanding the error message Delimiter not found may be caused also by not specifying correctly the COPY command, in particular by not specifying the Data format parameters https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
In my case I was trying to load Parquet data with this expression:
COPY my_schema.my_table
FROM 's3://my_bucket/my/folder/'
IAM_ROLE 'arn:aws:iam::my_role:role/my_redshift_role'
REGION 'my-region-1';
and I received the Delimiter not found error message when looking into the system table stl_load_errors. But specifying I'm dealing with Parquet data in the expression in this way:
COPY my_schema.my_table
FROM 's3://my_bucket/my/folder/'
IAM_ROLE 'arn:aws:iam::my_role:role/my_redshift_role'
FORMAT AS PARQUET;
solved my problem and I was able to correctly load the data.
I know this was answered, but I just dealt with the same error and I had a simple solution so i'll share it.
This error can also be solved by stating the specific columns of the table that are copied from the s3 files (if you know what are the columns in the data on s3).
In my case the data had less columns than the number of columns in the table.
Madahava's answer with the 'FILLRECORD' option DID solve the issue for me but then I noticed a column that was supposed to filled up with default values, remained null.
COPY <table> (col1, col2, col3) from 's3://somebucket/file' ...
This may not be directly related to the OP's question but I received the same Delimiter not found error which was caused by newline characters within one of the fields.
For any field that you think may have newline characters you can remove them with:
replace(my_field, chr(10), '')
When you send fewer fields than expected on the destin table, it will also throw this error.
I'm sure there are multiple scenarios that would return this error. I just came across one that I don't see mentioned in the other answers while I was debugging someone else's code. The COPY had the EXPLICIT_IDS option listed, the table it was trying to import into had a column with a data type of identity(1,1), but the file it was trying to import into Redshift did not have an ID field. It made sense for me to add the identity field to the file. But, I imagine removing the EXPLICIT_IDS option would also have fixed the issue.
So recently I came across of this Delimiter not found error in Redshift SQL while loading the data with copy command. In my case, the problem was with column numbers.
I had created a table with 20 columns but I was loading the file with 21 columns.
I corrected it in my table by making 21 columns in the table and then re-loaded the data and boom it worked.
Hope it will be helpful to those who are facing the same kind of problem.
Ta-da
Sometimes this pops up when you dont specify the file type, for example CSV
Ref: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
copy "dev"."my"."table" from 's3://bucket/myfile_upload.csv' credentials 'aws_iam_role=arn:aws:iam::2112277888:role/RedshiftAccessRole' IGNOREHEADER 1 csv;