Cassandra COPY FROM query error, with a CSV file - python-2.7

The problem:
I'm trying to get it so I can use Cassandra to work with Python properly. I've been using a toy dataset to practice uploading a csv file into Cassandra with no luck.
Cassandra seems to work fine when I am not using COPY FROM for csv files.
My intention is to use this dataset as a test to make sure that I can load a csv file's information into Cassandra, so I can then load 5 csv files totaling 2 GB into it for my originally intended project.
Note: Whenever I use CREATE TABLE and then run SELECT * FROM tvshow_data, the columns don't appear in the order that I set them, is this going to affect anything, or does it not matter?
Info about my installations and usage:
I've tried running both cqlsh and cassandra with an admin powershell.
I have Python 2.7 installed inside of the apache-cassandra-3.11.6 folder.
I have Cassandra version 3.11.6 installed.
I have cassandra-driver 3.18.0 installed, with conda.
I use Python 3.7 installed for everything other than Cassandra's directory.
I have tried both CREATE TABLE tvshow and CREATE TABLE tvshow.tvshow_data.
My Python script:
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect()
create_and_add_file_to_tvshow = [
"DROP KEYSPACE tvshow;",
"CREATE KEYSPACE tvshow WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};",
"USE tvshow;",
"CREATE TABLE tvshow.tvshow_data (id int PRIMARY KEY, title text, year int, age int, imdb decimal, rotten_tomatoes int, netflix int, hulu int, prime_video int, disney_plus int, is_tvshow int);",
"COPY tvshow_data (id, title, year, age, imdb, rotten_tomatoes, netflix, hulu, prime_video, disney_plus, is_tvshow) FROM 'C:tvshows.csv' WITH HEADER = true;"
]
print('\n')
for query in create_and_add_file_to_tvshow:
session.execute(query)
print(query, "\nsuccessful\n")
Resulting python error:
This is the error I get when I run my code in the powershell with the following command, python cassandra_test.py.
cassandra.protocol.SyntaxException: <Error from server: code=2000 [Syntax error in
CQL query] message="line 1:0 no viable alternative at input 'COPY' ([
Resulting cqlsh error:
Running the previously stated cqlsh code in the create_and_add_file_to_tvshow variable in powershell after running cqlsh in the apache-cassandra-3.1.3/bin/ directory, creates the following error.
Note: The following error is only the first few lines to the code as well as the last new lines, I choose not to include it since it was several hundred lines long. If necessary I will include it.
Starting copy of tvshow.tvshow_data with columns [id, title, year, age, imdb, rotten_tomatoes, netflix, hulu, prime_video, disney_plus, is_tvshow].
Failed to import 0 rows: IOError - Can't open 'C:tvshows.csv' for reading: no matching file found, given up after 1 attempts
Process ImportProcess-44:
PTrocess ImportProcess-41:
raceback (most recent call last):
PTPProcess ImportProcess-42:
...
...
...
AA cls._loop.add_timer(timer)
AAttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
AttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
Processed: 0 rows; Rate: 0 rows/s; Avg. rate: 0 rows/s
0 rows imported from 0 files in 1.974 seconds (0 skipped).
A sample of the first 10 lines of the csv file used to import
I have tried creating a csv file with just these first two lines, for a toy's toy test, since I couldn't get anything else to work.
id,title,year,age,imdb,rotten_tomatoes,netflix,hulu,prime_video,disney_plus,is_tvshow
0,Breaking Bad,2008,18+,9.5,96%,1,0,0,0,1
1,Stranger Things,2016,16+,8.8,93%,1,0,0,0,1
2,Money Heist,2017,18+,8.4,91%,1,0,0,0,1
3,Sherlock,2010,16+,9.1,78%,1,0,0,0,1
4,Better Call Saul,2015,18+,8.7,97%,1,0,0,0,1
5,The Office,2005,16+,8.9,81%,1,0,0,0,1
6,Black Mirror,2011,18+,8.8,83%,1,0,0,0,1
7,Supernatural,2005,16+,8.4,93%,1,0,0,0,1
8,Peaky Blinders,2013,18+,8.8,92%,1,0,0,0,1

Related

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata integrated with the AWS Glue catalog so that I can read and write to them through spark.
The first part of the code reads some data from S3/Glue, does some transformations and what not, then writes the resulting dataframe to S3/Glue like so:
df.repartition('datekey','coeff')\
.write\
.format('parquet')\
.partitionBy('datekey','coeff')\
.mode('overwrite')\
.option("path", S3_PATH)\
.saveAsTable('hive_tables.my_table')
I then try to access this table with Spark SQL, but when I run something as simple as
spark.sql('select * from hive_tables.my_table where datekey=20210506').show(),
it throws this:
An error was encountered:
"org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 778, in saveAsTable
self._jwrite.saveAsTable(name)
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown type : 'double' (Service: AWSGlue; Status Code: 400; Error Code: InvalidInputException; Request ID: 43ff3707-a44f-41be-b14a-7b9906d8d8f9; Proxy: null);"
I've learned this happens only when specifying the datekey partition. For example, both of the following commands work fine:
spark.sql('select * from hive_tables.my_table where coeff=0.5').show() and
spark.sql('select * from hive_tables.my_table').show()
I've verified through Spark SQL that the partitions exist and have data in them. The datekey query also works fine through AWS Athena - just not Spark SQL.
Also Glue definitely has the two partition columns recognized:
datekey: int
coeff: double
Any ideas here? I've tried everything I can think of and it just isn't making any sense.
I had same error In emr 6.3.0 (Spark 3.1.1).
After upgrade to emr 6.5.0 (Spark 3.1.2), It solved.
I would still like a straight-forward solution to this, but currently this workaround suffices:
I first read the table straight from the S3 path
temp_df = spark.read.parquet(S3_PATH)
so that it doesn't use the Glue catalog as the metadata. Then I create a temp table for the session:
temp_df.createGlobalTempView('my_table')
which allows me to query it using Spark SQL with the global_temp database:
spark.sql('select * from global_temp.my_table where datekey=20210506').show()
and this works
I had a similar issue in a similar environment (EMR cluster + Spark SQL + AWS Glue catalog). The query was like this:
select *
from ufd.core_agg_data
where year <> date_format(current_timestamp, 'yyyy')
This is a table partitioned by "year", and "year" is a string. Note that "year" is used in the filter.
I got
User class threw exception: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.InvalidObjectException: Unknown operator '!='
Then I "modified" the query to this one, and it worked!
select *
from ufd.core_agg_data
where year in (select date_format(current_timestamp, 'yyyy'))

BigQuery Storage API: the table has a storage format that is not supported

I've used the sample from the BQ documentation to read a BQ table into a pandas dataframe using this query:
query_string = """
SELECT
CONCAT(
'https://stackoverflow.com/questions/',
CAST(id as STRING)) as url,
view_count
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE tags like '%google-bigquery%'
ORDER BY view_count DESC
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
print(dataframe.head())
url view_count
0 https://stackoverflow.com/questions/22879669 48540
1 https://stackoverflow.com/questions/13530967 45778
2 https://stackoverflow.com/questions/35159967 40458
3 https://stackoverflow.com/questions/10604135 39739
4 https://stackoverflow.com/questions/16609219 34479
However, the minute I try and use any other non-public data-set, I get the following error:
google.api_core.exceptions.FailedPrecondition: 400 there was an error creating the session: the table has a storage format that is not supported
Is there some setting I need to set in my table so that it can work with the BQ Storage API?
This works:
query_string = """SELECT funding_round_type, count(*) FROM `datadocs-py.datadocs.investments` GROUP BY funding_round_type order by 2 desc LIMIT 2"""
>>> bqclient.query(query_string).result().to_dataframe()
funding_round_type f0_
0 venture 104157
1 seed 43747
However, when I set it to use the bqstorageclient I get that error:
>>> bqclient.query(query_string).result().to_dataframe(bqstorage_client=bqstorageclient)
Traceback (most recent call last):
File "/Users/david/Desktop/V/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
return callable_(*args, **kwargs)
File "/Users/david/Desktop/V/lib/python3.6/site-packages/grpc/_channel.py", line 533, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/Users/david/Desktop/V/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = "there was an error creating the session: the table has a storage format that is not supported"
debug_error_string = "{"created":"#1565047973.444089000","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"there was an error creating the session: the table has a storage format that is not supported","grpc_status":9}"
>
I experienced the same issue as of 06 Nov 2019 and it turns out that the error that you are getting is a known issue with the Read API as it cannot currently handle result sets smaller than 10MB. I came across this that shed some light on this problem:
GitHub.com - GoogleCloudPlatform/spark-bigquery-connector - FAILED_PRECONDITION: there was an error creating the session: the table has a storage format that is not supported #46
I have tested it with a query that returns a larger than 10MB result set and it seems to be working fine for me with an EU multi-regional location of the dataset that I am querying against.
Also, you will need to install fastavro in your environment for this functionality to work.

Turi Create - Please use dropna() to drop rows

I am having issues with Apple Turi Create and image classifier. I have successfully created a model with 22 categories. I have recently added 5 more categories and console is giving me error warning
Please use dropna() to drop rows with missing target values.
The full console log looks like this:
[16:30:30] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[16:30:30] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded!
Resizing images...
Performing feature extraction on resized images...
Premature end of JPEG file
Completed 512/1883
Completed 1024/1883
Completed 1536/1883
Completed 1883/1883
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
You can set ``validation_set=None`` to disable validation tracking.
[ERROR] turicreate.toolkits._main: Toolkit error: Target column has missing value.
Please use dropna() to drop rows with missing target values.
Traceback (most recent call last):
File "train.py", line 8, in <module>
model = tc.image_classifier.create(train_data, target='label', max_iterations=1000)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/turicreate/toolkits/image_classifier/image_classifier.py", line 132, in create
verbose=verbose)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/turicreate/toolkits/classifier/logistic_classifier.py", line 312, in create
seed=seed)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/turicreate/toolkits/_supervised_learning.py", line 397, in create
options, verbose)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/turicreate/toolkits/_main.py", line 75, in run
raise ToolkitError(str(message))
turicreate.toolkits._main.ToolkitError: Target column has missing value.
Please use dropna() to drop rows with missing target values.
I have upgraded turi and coremltools to the lates versions, but I don't know where I should implement the dropna() in the code. I only found this reference and followed the code.
It looks like this:
data.py
import turicreate as tc
image_data = tc.image_analysis.load_images('images', with_path=True)
labels = ['A', 'B', 'C', 'D']
def get_label(path, labels=labels):
for label in labels:
if label in path:
return label
image_data['label'] = image_data['path'].apply(get_label)
#import os
#image_data['label'] = image_data['path'].apply(lambda path: os.path.dirname(path).split('/')[-1])
image_data.save('boxes.sframe')
image_data.explore()
train.py
import turicreate as tc
data = tc.SFrame('boxes.sframe')
data.dropna()
train_data, test_data = data.random_split(0.8)
model = tc.image_classifier.create(train_data, target='label', max_iterations=1000)
predictions = model.classify(test_data)
results = model.evaluate(test_data)
print "Accuracy : %s" % results['accuracy']
print "Confusion Matrix : \n%s" % results['confusion_matrix']
model.save('boxes.model')
How do I drop all the empty columns and rows please? Does the max_iterations=1000 have also effect on the error?
Thank you for suggestions
data.dropna() isn't done in place, you need to write it: data = data.dropna()
See documentation here https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.dropna.html

Python 2.x unicode & pymssql

I want to upload a string in unicode to my sql server. I'm using python 2.7.6, sqlalchemy-migrate 0.7.2, pymssql 2.1.2.
But when I saved my object I got an OperationalError from sqlalchemy
OperationalError: (OperationalError) (105, "Unclosed quotation mark
after the character string '\xd8\xa3\xd8\xb3\xd8\xb1\xd8\xa7\xd8\xb1
\xd8\xaa\xd8\xad\xd8\xaf\xd9\x8a\xd8\xaf\xd8\xa7\xd9\x84\xd9\x88\xd8
\xac\xd9\x87 - \xd9\x81\xd9\x82\xd8\xb7 \xd9\x84\xd8\xaf\xd9\x89\xd9
\x85\xd8\xad\xd9\x84\xd8\xa7\xd8\xaa \xd9\x88\xd8\xac\xd9\x88\xd9\x87
\xe2\x9c\x8e '.DB-Lib error message 105, severity 15:\nGeneral SQL
Server error: Check messages from the SQL Server\n") 'INSERT INTO...
With more detail I'm guesssing is from my Description value :
...\u0648\u062c\u0648\u0647 \u270e \U0001f38'}
I saw a big U and not u, the character is the gift unicode just before, the "\u270e" works well and show a pencil. I strongly thing is because of the 8 values versus 4 for others.
But how to prevent this error ?
The column description inside of my DB is a nvarchar(2000)
I'm using Using flask restful reqparse to parse the argument create sync the object from the DB and save it :
parser_edit.add_argument('Name',
type=unicode,
required=True,
location='json')
parser_edit.add_argument('Description',
type=unicode,
required=False,
location='json')

How do I store a DataFrame into a BigTable in Google DataLab?

I have a DataFrame df. I create a BigQuery table.
# Create the schema, using the convenience of basing it on example DataFrame
schema = bq.Schema.from_dataframe(df)
# Create the dataset
bq.DataSet('ids').create()
# Create the table
suri_table = bq.Table('ids.suri').create(schema = schema, overwrite = True)
project = gcp.Context.default().project_id
There is a Pandas function [to_gbq()][1] which I want to use to store the DataFrame.
df.to_gbq(df, 'ids.suri', project)
This returns a "Not found exception" although the table exists. I just created it in the code above. Could someone help me out what the problem really is?
NotFoundException: Invalid Table Name. Should be of the form
'datasetId.tableId'
If I do:
from pandas.io import gbq
df.to_gbq('ids.suri', project_id=projectid)
I get:
/usr/lib/python2.7/dist-packages/pkg_resources.pyc in resolve(self, requirements, env, installer, replace_conflicting)
637 # unfortunately, zc.buildout uses a str(err)
638 # to get the name of the distribution here..
--> 639 raise DistributionNotFound(req)
640 to_activate.append(dist)
641 if dist not in req:
DistributionNotFound: google-api-python-client
[1]: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.io.gbq.to_gbq.html
You are conflating the Cloud Datalab way with the gbq way. You should use one or the other. To do this from Cloud Datalab, once you have created the data, you can just use:
suri_table.insert_data(df)
There are a couple of options if you want to include the index, etc; see http://googlecloudplatform.github.io/datalab/gcp.bigquery.html#gcp.bigquery.Table.insert_data