Pig cannot dump after new bag generated - casting

This is my script:
Customer = LOAD '/home/hduser/PigSandbox/PigOut/Customer_Base.json/part-r-00000' USING JsonLoader();
CustomerPost = FOREACH Customer GENERATE ((chararray)Email
,DateModified
,Age
,City
,Name
,Occupation
,State
,Address1
,Address2
,Company
,Country
,DateOfBirth
,Fax
,Phone
,PostalCode);
DUMP CustomerPost;
This is the error I get:
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF or Union from two different Loaders. Cannot determine how to convert the bytearray to string.
All I want to do is cast the Email from a bytearray to a chararray. If there is a better way of doing this than generating a new bag please let me know. If not how can I get what I have working? Or at least what does this error mean?
My pig version is 0.14.0 from Nov 16 2014
Here is an example row :
{"Email" : "foo#bar.com", "DateModified":"2014-12-27T07:36:38.000-05:00","Age":"10-20","City":"Some Location","Name":"foo","Occupation":"engineer","State":"ok","Address1":"foo lane","Address2":null,"Company":"lorem","Country":"us","DateOfBirth":"1969","Fax":"555-555-5555","Phone":"555-555-5555","PostalCode":"12345"}
and here is the schema file
{"fields":[{"name":"Email","type":50,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"DateModified","type":30,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Age","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"City","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Name","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Occupation","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"State","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Address1","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Address2","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Company","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Country","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"DateOfBirth","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Fax","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"Phone","type":55,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"PostalCode","type":55,"description":"autogenerated from Pig Field Schema","schema":null}}

Related

pyspark.sql.utils.AnalysisException: Reference 'title' is ambiguous, could be: title, title

I am using glue version 3.0, python version 3, spark version 3.1.
I am extracting data from xml creating dataframe and writing data to s3 path in csv format.
Before writing dataframe I printed the schema and 1 record of dataframe using show(1). till this point everything was fine.
but while writing it to csv file in s3 location got error duplicate column found as my dataframe had 2 columns namely "Title" and "title".
tried to add a new column title2 which will have content of title and thought of dropping title later with below command
from pyspark.sql import functions as f
df=df.withcoulumn('title2',f.expr("title"))
but was getting error
Reference 'title' is ambiguous, could be: title, title
Tried
df=df.withcoulumn('title2',f.col("title"))
got same error.
any help or approach to resolve this please..
By default spark is case in-sensitive, we can make spark sensitive by setting the spark.sql.caseSensitive to True.
from pyspark.sql import functions as f
df = spark.createDataFrame([("CaptializedTitleColumn", "title_column", ), ], ("Title", "title", ))
spark.conf.set('spark.sql.caseSensitive', True)
df.withColumn('title2',f.expr("title")) .show()
Output
+--------------------+------------+------------+
| Title| title| title2|
+--------------------+------------+------------+
|CaptializedTitleC...|title_column|title_column|
+--------------------+------------+------------+

Expression Error Key didn't Match Any Rows

I am trying to get today's current date and format it to yymmdd because my table name change daily. e.g. MICRINFO210616 and tomorrow it will be MICRINFO210617
When I run thecode below I get the following error:
Expression.Error: The key didn't match any rows in the table.
Key=
Schema=dbo
Item=MICRINFO210617
Table=[Table]
code:
let
Source = Sql.Database("TEST", "TEST"),
formattedDate = Date.ToText(DateTime.Date(DateTime.LocalNow()), "yyMMdd"),
combine = "MICRINFO" & formattedDate,
dbo_MICRINFO210616 = Source{[Schema="dbo", Item=combine]}[Data]
in
dbo_MICRINFO210616
Make sure the account you're using has at least read permissions (to the new table).
Check if the structure of both tables is the same (same number of columns, same datatype).

import data from HBase insert to another table using Pig UDF

I'm using Pig UDFs in python to read data from HBase table then process and parse it, to finally insert it into another HBase table. But I'm facing some issues.
Pig's map is equivalent to Python's dictionary;
My Python script take as input (rowkey, some_mapped_values), and
return two strings "key" and "content", and the
#outputSchema('tuple:(rowkey:chararray, some_values:chararray)');
The core of my Python takes a rowkey, parse it, and transform it into
another rowkey, and transform the mapped data into another string, to
return variables (key,content);
But when I try to insert those new values into another HBase table, I have faced two problems:
Processing is well done, but the script insert "new_rowkey+content" as a rowkey and leaves the cell empty;
Second problem, is how do I specify the new column family to insert
in ?
here is a snippet of my Pig script:
register 'parser.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myParser;
data = LOAD 'hbase://source' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('values:*', '-loadKey true') AS
(rowkey:chararray, curves:map[]);
data_1 = FOREACH data GENERATE myParser.table_to_xml(rowkey,curves);
STORE data_1 INTO 'hbase://destination' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a_column_family:curves:*');
and my Python script take as input (rowkey,curves)
#outputSchema('tuple:(rowkey:chararray, curves:chararray)')
def table_to_xml(rowkey,curves):
key = some_processing_which_is_correct
for k in curves:
content = some_processing_which_is_also_correct
return (key,content)

How to parse through a column in Pig to create additional columns

New Apache Pig user here. I basically have data in a format and need to split this into 6 columns to create my desired schema and then load into Pig for my existing script to run.
Sorry if the format below is untidy, i cant upload a picture due to reputation score.
Existing format has 3 columns
User-Equipment values::key:bytearray values:value:bytearray
user1-mobile 20130306-AC 9
user1-mobile 20130306-AT 21
user2-laptop 20130306-BC 0
Required format:
User Equipment Date Type "Count or Time" Value
user1 mobile 20130306 A C 9
user1 mobile 20130306 A T 21
Any suggestions on how to ge this done? IS there a regex I need to write?
The tricky thing here is all the columns have a delimiter (-) between them except "Type" and column "C or T"
If you don't have a common delimiter I can think of two possibilities:
You could implement your own LoadFunc as explained here: http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
You could use REGEX_EXTRACT_ALL as explained here: Apache Pig: Extra query parameters from web log
Here you go for 2.:
A = LOAD 'abc.txt' AS (line:CHARARRAY);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line, '^(.+?)\\-(.+?)\\s(.+?)\\-(.)(.)\\s(.+)$')) AS (User:CHARARRAY,Equipment:CHARARRAY,Date:CHARARRAY,Type:CHARARRAY,CountorTime:CHARARRAY,Value:CHARARRAY);

Rails 4 + MongoDB + Search query LIKE does not give correct output

In Rails, I am trying to fetch data from mongodb using LIKE query by providing regular expression but even though not getting the correct output.
Model : User
_id, name, display_name, age, address, nick_name
a1, Johny, Johny K, 12, New York, John
b1, James, James Waltor, 15, New York, James
c1, Joshua, Joshua T, 13, California, Josh
Now I have 3 set of records.
Query 1 : Search User having 'Jo' as keyword in initial name
User.where(name: /^jo/i)
Output - Only One record - instead of two.
Query 2 :- Match the text with all column values
User.where($where: /^jo/i)
Not getting the proper output.
Ok on the Query 1, can you output the documents. I believe one of your records in 'name' has a character in front of it such as white space. I just run the same query locally and it pulled multiple records back.
Try this:
User.where(name/(.*)jo(.*)/i).count and see what that returns. It should match 2. If that works, then you'll need to look at what is incorrect with the store value.
On Query 2, where have you seen this syntax. The $where is expecting a string of a js function to execute to match records. In your case to match any field within the document with an expression you would need to do a recursive function across each field in each document.
For Query 2 to match against all fields
One solution, although inefficient, is to do it within the Rails app instead of Mongodb query.
e.g.
User.all.select do | user | user.attributes.values.grep(/^jo/i).any? end