I am running the following code to get the number of records in a parquet file placed inside an S3 bucket.
import boto3
import os
s3 = boto3.client('s3')
sql_stmt = """SELECT count(*) FROM s3object s"""
req_fact =s3.select_object_content(
Bucket = 'test_hadoop',
Key = 'counter_db.cm_workload_volume_sec.dt=2023-01-23.cm_workload_volume_sec+2+000000347262.parquet',
ExpressionType = 'SQL',
Expression = sql_stmt,
InputSerialization={'Parquet':{}},
OutputSerialization = {'JSON': {}})
for event in req_fact['Payload']:
if 'Records' in event:
print(event['Records']['Payload'].decode('utf-8'))
elif 'Stats' in event:
print(event['Stats'])
However I get this error: botocore.exceptions.ClientError: An error occurred (XNotImplemented) when calling the SelectObjectContent operation: This node does not support SelectObjectContent.
What is the issue?
I ran your code against a known good (uncompressed) parquet file with no errors.
resp = s3.select_object_content(
Bucket='my-test-bucket',
Key='counter_db.cm_workload_volume_sec.dt=2023-01-23.cm_workload_volume_sec+2+000000347262.parquet',
ExpressionType='SQL',
Expression="SELECT count(*) FROM s3object s",
InputSerialization={'Parquet': {}},
OutputSerialization={'JSON': {}},
)
Output:
{"_1":4}
In the AWS console you can navigate to the S3 bucket and find the file, highlight it and choose to run S3 select there (Actions > Query with S3 Select). That will allow you to validate that the file can be queried with S3 select (which I think is your issue here)
Note the following: Amazon S3 Select does not support whole-object compression for Apache Parquet objects.
I'm trying to load a table in na spark EMR cluster from glue catalog in apache iceberg format that is stored in S3. The table is correctly created because I can query it from AWS Athena. On the cluster creation I have set this configuration:
[{"classification":"iceberg-defaults","properties":{"iceberg.enabled":"true"}}]
IK have tried running sql queries from spark that are in other formats(csv) and it works, but when I try to read iceberg tables I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table table_name. StorageDescriptor#InputFormat cannot be null for table: table_name(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
This is the code in the notebook:
%%configure -f
{
"conf":{
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.dev.type":"hadoop",
"spark.sql.catalog.dev.warehouse":"s3://pyramid-streetfiles-sbx/iceberg_test/"
}
}
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
import pyspark.sql.types as t
spark = SparkSession.builder.getOrCreate()
# This query works and shows the iveberg table i want to read
spark.sql("show tables from iceberg_test").show(truncate=False)
# Here shows the error
spark.sql("select * from iceberg_test.table_name limit 10").show(truncate=False)
How can I read apache iceberg tables in EMR cluster with Spark and glue catalog?
You need to pass the catalog name glue.
Example: glue_catalog.<your_database_name>.<your_table_name>
https://docs.aws.amazon.com/pt_br/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html
I have python 3.6.8 on GNU/Linux 3.10 on GCP and I'm trying to load data from Hive to CloudSQL.
gc_cmd_import_csv_p1 = subprocess.Popen(['gcloud', 'sql', 'import', 'csv',
'{}'.format(quote(cloudsql_instance)),
'{}'.format(quote(load_csv_files)),
'--database={}'.format(quote(cloudsql_db)),
'--table={}'.format(quote(cloudsql_table_name)),
'--user={}'.format(quote(db_user_name)),
'--quiet'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True)
import_cmd_op, import_cmd_error = gc_cmd_import_csv_p1.communicate()
import_cmd_return_code = gc_cmd_import_csv_p1.returncode
if import_cmd_return_code:
print("""[ERROR] Unable to import data from Hive to CloudSQL.
Error description: {}
Error Code(s): {}
Issue file name: {}
""".format(import_cmd_error, import_cmd_return_code, load_csv_files))
sys.exit(9)
print("[INFO] Data Import completed from HIVE to CloudSQL.")
In case of any error above, I'm getting message like:
Error description: ERROR: (gcloud.sql.import.csv) HTTPError 403: The client is not authorized to make this request.Error Code(s): 1
But when I actually run the same import command directly as shown below:
gcloud sql import csv test-cloud-sql-instance gs://test-server-12345/app1/data/lookup_table/000000_0 --database=test_db --table=name_lookup --user=test_user --quiet
I'm getting the actual error like below:
ERROR: (gcloud.sql.import.csv) [ERROR_RDBMS] ERROR: extra data after last expected column CONTEXT: COPY name_lookup, line 16902:
I want this message
( Extra data after last expected column... line 16902:)
to be shown in python script instead of
HTTPError 403:
error. How to capture that?
Please note: There is no authentication issue as suggested by HTTP Error.
So after long discussion with GCP Admin, we have found the issue.
We tried to execute the same import command using os.system() and then again we got the HTTP error. Admin then revisited the GCP IAM documentation and created a role for P-SQL user. Issue is resolved now.
The problem:
I'm trying to get it so I can use Cassandra to work with Python properly. I've been using a toy dataset to practice uploading a csv file into Cassandra with no luck.
Cassandra seems to work fine when I am not using COPY FROM for csv files.
My intention is to use this dataset as a test to make sure that I can load a csv file's information into Cassandra, so I can then load 5 csv files totaling 2 GB into it for my originally intended project.
Note: Whenever I use CREATE TABLE and then run SELECT * FROM tvshow_data, the columns don't appear in the order that I set them, is this going to affect anything, or does it not matter?
Info about my installations and usage:
I've tried running both cqlsh and cassandra with an admin powershell.
I have Python 2.7 installed inside of the apache-cassandra-3.11.6 folder.
I have Cassandra version 3.11.6 installed.
I have cassandra-driver 3.18.0 installed, with conda.
I use Python 3.7 installed for everything other than Cassandra's directory.
I have tried both CREATE TABLE tvshow and CREATE TABLE tvshow.tvshow_data.
My Python script:
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect()
create_and_add_file_to_tvshow = [
"DROP KEYSPACE tvshow;",
"CREATE KEYSPACE tvshow WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};",
"USE tvshow;",
"CREATE TABLE tvshow.tvshow_data (id int PRIMARY KEY, title text, year int, age int, imdb decimal, rotten_tomatoes int, netflix int, hulu int, prime_video int, disney_plus int, is_tvshow int);",
"COPY tvshow_data (id, title, year, age, imdb, rotten_tomatoes, netflix, hulu, prime_video, disney_plus, is_tvshow) FROM 'C:tvshows.csv' WITH HEADER = true;"
]
print('\n')
for query in create_and_add_file_to_tvshow:
session.execute(query)
print(query, "\nsuccessful\n")
Resulting python error:
This is the error I get when I run my code in the powershell with the following command, python cassandra_test.py.
cassandra.protocol.SyntaxException: <Error from server: code=2000 [Syntax error in
CQL query] message="line 1:0 no viable alternative at input 'COPY' ([
Resulting cqlsh error:
Running the previously stated cqlsh code in the create_and_add_file_to_tvshow variable in powershell after running cqlsh in the apache-cassandra-3.1.3/bin/ directory, creates the following error.
Note: The following error is only the first few lines to the code as well as the last new lines, I choose not to include it since it was several hundred lines long. If necessary I will include it.
Starting copy of tvshow.tvshow_data with columns [id, title, year, age, imdb, rotten_tomatoes, netflix, hulu, prime_video, disney_plus, is_tvshow].
Failed to import 0 rows: IOError - Can't open 'C:tvshows.csv' for reading: no matching file found, given up after 1 attempts
Process ImportProcess-44:
PTrocess ImportProcess-41:
raceback (most recent call last):
PTPProcess ImportProcess-42:
...
...
...
AA cls._loop.add_timer(timer)
AAttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
AttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
Processed: 0 rows; Rate: 0 rows/s; Avg. rate: 0 rows/s
0 rows imported from 0 files in 1.974 seconds (0 skipped).
A sample of the first 10 lines of the csv file used to import
I have tried creating a csv file with just these first two lines, for a toy's toy test, since I couldn't get anything else to work.
id,title,year,age,imdb,rotten_tomatoes,netflix,hulu,prime_video,disney_plus,is_tvshow
0,Breaking Bad,2008,18+,9.5,96%,1,0,0,0,1
1,Stranger Things,2016,16+,8.8,93%,1,0,0,0,1
2,Money Heist,2017,18+,8.4,91%,1,0,0,0,1
3,Sherlock,2010,16+,9.1,78%,1,0,0,0,1
4,Better Call Saul,2015,18+,8.7,97%,1,0,0,0,1
5,The Office,2005,16+,8.9,81%,1,0,0,0,1
6,Black Mirror,2011,18+,8.8,83%,1,0,0,0,1
7,Supernatural,2005,16+,8.4,93%,1,0,0,0,1
8,Peaky Blinders,2013,18+,8.8,92%,1,0,0,0,1
I've used the sample from the BQ documentation to read a BQ table into a pandas dataframe using this query:
query_string = """
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
print(dataframe.head())
url view_count
0 https://stackoverflow.com/questions/22879669 48540
1 https://stackoverflow.com/questions/13530967 45778
2 https://stackoverflow.com/questions/35159967 40458
3 https://stackoverflow.com/questions/10604135 39739
4 https://stackoverflow.com/questions/16609219 34479
However, the minute I try and use any other non-public data-set, I get the following error:
google.api_core.exceptions.FailedPrecondition: 400 there was an error creating the session: the table has a storage format that is not supported
Is there some setting I need to set in my table so that it can work with the BQ Storage API?
This works:
query_string = """SELECT funding_round_type, count(*) FROM `datadocs-py.datadocs.investments` GROUP BY funding_round_type order by 2 desc LIMIT 2"""
>>> bqclient.query(query_string).result().to_dataframe()
funding_round_type f0_
0 venture 104157
1 seed 43747
However, when I set it to use the bqstorageclient I get that error:
>>> bqclient.query(query_string).result().to_dataframe(bqstorage_client=bqstorageclient)
Traceback (most recent call last):
File "/Users/david/Desktop/V/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
return callable_(*args, **kwargs)
File "/Users/david/Desktop/V/lib/python3.6/site-packages/grpc/_channel.py", line 533, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/Users/david/Desktop/V/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "there was an error creating the session: the table has a storage format that is not supported"
debug_error_string = "{"created":"#1565047973.444089000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"there was an error creating the session: the table has a storage format that is not supported","grpc_status":9}"
>
I experienced the same issue as of 06 Nov 2019 and it turns out that the error that you are getting is a known issue with the Read API as it cannot currently handle result sets smaller than 10MB. I came across this that shed some light on this problem:
GitHub.com - GoogleCloudPlatform/spark-bigquery-connector - FAILED_PRECONDITION: there was an error creating the session: the table has a storage format that is not supported #46
I have tested it with a query that returns a larger than 10MB result set and it seems to be working fine for me with an EU multi-regional location of the dataset that I am querying against.
Also, you will need to install fastavro in your environment for this functionality to work.