I using Python Spark API while getting file name having an issue. For example
recordDataFrame=originalDataFrame \
.withColumn('file_name',input_file_name())
Added the file_name column to dataFrame.The column is added to the dataFrame
recordDataFrame.take(1)
Above shows the column with value
But when I am converting dataFrame to RDD or looping over RDD file_name column don't have any value.
For example,
rdd_data=recordDataFrame.rdd
print(rdd_data.take(1))
This will show file_name column with blank value
Or if I do looping over the dataFrame directly then also file name don't have any value
recordDataFrame.foreach(process_data)
But if I pass the static value to file_name instead of using input_file_name() while adding column then everything works fine
This is a bug which has been resolved in 2.0.0.
Disclaimer:
These are serious hacks and should be avoided unless you're desperate. Also non of these have been properly tested. If you can it is better to update.
Trigger a shuffle after loading the data for example with:
recordDataFrame.repartition("file_name")
or
recordDataFrame.orderBy("file_name")
Truncate lineage as shown in high-performance-spark/high-performance-spark-examples
(code is GPL licensed so it cannot be reproduced here but the main idea is to access internal Java RDD, cache it and recreate DataFrame):
cutLineage(recordDataFrame)
Related
I am reading json files from GCS and I have to load data into different BigQuery tables. These file may have multiple records for same customer with different timestamp. I have to pick latest among them for each customer. I am planning to achieve as below
Read files
Group by customer id
Apply DoFn to compare timestamp of records in each group and have only latest one from them
Flat it, convert to table row insert into BQ.
But I am unable to proceed with step 1. I see GroupByKey.create() but unable to make it use customer id as key.
I am implementing using JAVA. Any suggestions would be of great help. Thank you.
Before you GroupByKey you need to have your dataset in key-value pairs. It would be good if you had shown some of your code, but without knowing much, you'd do the following:
PCollection<JsonObject> objects = p.apply(FileIO.read(....)).apply(FormatData...)
// Once we have the data in JsonObjects, we key by customer ID:
PCollection<KV<String, Iterable<JsonObject>>> groupedData =
objects.apply(MapElements.via(elm -> KV.of(elm.getString("customerId"), elm)))
.apply(GroupByKey.create())
Once that's done, you can check timestamps and discard all bot the most recent as you were thinking.
Note that you will need to set coders, etc - if you get stuck with that we can iterate.
As a hint / tip, you can consider this example of a Json Coder.
I am using aws glue csv crawler to crawl s3 directory containing csv files. Crawler works fine in the sense that it creates the schema with correct data types for each column, however, when I query data from athena, it doesn't show value under boolean type column.
A csv looks like this:
"val","ts","cond"
"1.2841974","15/05/2017 15:31:59","True"
"0.556974","15/05/2017 15:40:59","True"
"1.654111","15/05/2017 15:41:59","True"
And the table created by crawler is:
Column name Data type
val string
ts string
cond boolean
However, when I run say select * from <table_name> limit 10 it returns:
val ts cond
1 "1.2841974" "15/05/2017 15:31:59"
2 "0.556974" "15/05/2017 15:40:59"
3 "1.654111" "15/05/2017 15:41:59"
Does any one has any idea what might be the reason?
I forgot to add, if I change the data type of cond column to string, it does show data as string e.g. "True" or "False"
I don't know why Glue classifies the cond column as boolean, because Athena will not understand that value as a boolean. I think this is a bug in Glue, or an artefact of it not targeting Athena exclusively. Athena expects boolean values to be either true or false. I don't remember if that includes different capitalizations of the strings or not, but either way yours will fail because they are quoted. The actual bug is that Glue has not configured your table so that it strips the quotes from the strings, and therefore Athena sees a boolean column containing "True" with quotes and all, and that is not a supported boolean value. Instead you get NULL values.
You could try changing your tables to use the OpenCSVSerDe instead, it supports quoted values.
It's surprising that Glue continues to stumble on basic things like this. Glue is unfortunately rarely worth the effort over writing some basic scripts yourself.
For context: I skimmed this previous question but was dissatisifed with the answer for two reasons:
I'm not writing anything in Python; in fact, I'm not writing any custom scripts for this at all as I'm relying on a crawler and not a Glue script.
The answer is not as complete as I require since it's just a link to some library.
I'm looking to leverage AWS Glue to accept some CSVs into a schema, and using Athena, convert that CSV table into multiple Parquet-formatted tables for ETL purposes. The data I'm working with has quotes embedded in it, which would be okay save for the fact that one record I have has a value of:
"blablabla","1","Freeman,Morgan","bla bla bla"
It seems that Glue is tripping over itself when it encounters the "Freeman,Morgan" piece of data.
If I use the standard Glue crawler, I get a table created with the LazySimpleSerDe, which truncates the record above in its column to:
"Freeman,
...which is obviously not desirable.
How do I force the crawler to output the file with the correct SerDe?
[Unpleasant] Constraints:
Looking to not accomplish this with a Glue script, since for that to work I believe I have to have a table beforehand, whereas the crawler will create the table on my behalf.
If I have to do this all through Amazon Athena, I'd feel like that would largely defeat the purpose but it's a tenable solution.
This is going to turn into a very dull answer, but apparently AWS provides its own set of rules for classifying if a file is a CSV.
To be classified as CSV, the table schema must have at least two
columns and two rows of data. The CSV classifier uses a number of
heuristics to determine whether a header is present in a given file.
If the classifier can't determine a header from the first row of data,
column headers are displayed as col1, col2, col3, and so on. The
built-in CSV classifier determines whether to infer a header by
evaluating the following characteristics of the file:
Every column in a potential header parses as a STRING data type.
Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing
delimiter, the last column can be empty throughout the file.
Every column in a potential header must meet the AWS Glue regex requirements for a column name.
The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than
STRING type. If all columns are of type STRING, then the first row of
data is not sufficiently different from subsequent rows to be used as
the header.
I believed that I had met all of these requirements, given that the column names are wildly divergent from the actual data in the CSV, and ideally there shouldn't be much of an issue there.
However, in spite of my belief that it would satisfy the AWS Glue regex (which I can't find a definition for anywhere), I elected to move away from commas and to pipes instead. The data now loads as I expect it to.
Use glueContext.create_dynamic_frame_from_options() while converting csv to parquet and then run crawler over parquet data.
df = glueContext.create_dynamic_frame_from_options("s3", {"paths": [src]}, format="csv")
Default separator is ,
Default quoteChar is "
If you wish to change then check https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html
I am newbie to Apache Hive and Spark. I have some existing Hive tables sitting on my Hadoop server that I can run some HQL commands and get what I want out of the table using hive or beeline, e.g, selecting first 5 rows of my table. Instead of that I want to use Spark to achieve the same goal. My Spark version on server is 1.6.3.
Using below code (I replace my database name and table with database and table):
sc = SparkContext(conf = config)
sqlContext = HiveContext(sc)
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
df = query.toPandas()
df.show()
I get this error:
ValueError: Some of types cannot be determined after inferring.
Error:root: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))
However, I can use beeline with same query and see the results.
After a day of googling and searching I modified the code as:
table_ccx = sqlContext.table("database.table")
table_ccx.registerTemplate("temp")
sqlContext.sql("SELECT * FROM temp LIMIT 5").show()
Now the error is gone but all the row values are null except one or two dates and column names.
I also tried
table_ccx.refreshTable("database.table")
and it did not help. Is there a setting or configuration that I need to ask my IT team to do? I appreciate any help.
EDIT: Having said that, my python code is working for some of the table on Hadoop. Do not know the problem is because of some entries on table or not? If yes, then how come the corresponding beeline/Hive command is working?
As it came out in the comments, straightening up the code a little bit makes the thing work.
The problem lies on this line of code:
query = sqlContext.createDataFrame(sqlContext.sql("SELECT * from database.table LIMIT 5").collect())
What you are doing here is:
asking Spark to query the data source (which creates a DataFrame)
collect everything on the driver as a local collection
parallelize the local collection on Spark with createDataFrame
In general the approach should work, although it's evidently unnecessarily convoluted.
The following will do:
query = sqlContext.sql("SELECT * from database.table LIMIT 5")
I'm not entirely sure of why the thing breaks your code, but still it does (as it came out in the comments) and it also improves it.
We are working to upgrade our application to a more current version of Ruby & Rails. Our app integrates with a legacy database (SQL Server 2008 R2) that has a table with a column of image data type (we are unable to change this column to varbinary(max)). Previously we were able to save a binary into the image column. However now we are getting conversion errors.
We are working to upgrade to the following (among others):
Rails 4.2.1
ActiveRecord_SQLServer_Adapter (4.2.4)
tiny_tds (0.6.3.rc1)
freeTDS (v0.91.112)
When we now attempt to save into the image column, we get errors similar to:
TinyTds::Error: Unclosed quotation mark after the character string
Researching various issues within tiny_tds & activerecord_sqlserver_adapter, we decided to create a second table that matched the first but change the data type from image to varbinary(max). We can save a binary into the column.
The code causing the challenge is in a background job where we grab images from s3, store them locally and then push the image into the database. Again, we don't control the legacy database and thus can't change the data type (or confront the issue of why we are storing the image in the db in the first place).
...
#d = Doc.new
...
open("#{Rails.root}/cache/pictures/image.png", "wb") do |file|
file << open(r.image.url).read
end
#d.document = File.binread("#{Rails.root}/cache/pictures/image.png")
#d.save!
Given the upgrade has broken our saving images, we are trying to figure out how best to determine a fix. We could obviously roll back until we find a version that works. However we hope to find a fix. Anyone have any ideas?
Update:
We added the following configuration as we had triggers on the table being inserted: ActiveRecord::ConnectionAdapters::SQLServerAdapter.use_output_inserted = true
When we remove this configuration we get the following error:
TinyTds::Error: The target table 'doc' of the DML statement cannot have any enabled triggers if the statement contains an OUTPUT clause without INTO clause.
Note: We are unable to make any modifications to the triggers.
Per feedback on the ActiveRecord_SQLServer_Adapter site, we rolled back to 4.1.11 and we are now able to save into the image column.
We also had to add this snippet to overcome the issue with the triggers.