Error in executing PIg script

Error in executing PIg script - hdfs

I am trying to execute a pig script which is placed in HDFS. I am getting an error.
Pig Stack Trace
ERROR 2999: Unexpected internal error. null
java.lang.NullPointerException
at org.apache.pig.impl.io.FileLocalizer.fetchFilesInternal(FileLocalizer.java:734)
at org.apache.pig.impl.io.FileLocalizer.fetchFiles(FileLocalizer.java:699)
at org.apache.pig.PigServer.registerJar(PigServer.java:522)
at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:473)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:546)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:613)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
But If I execute the same script statement by statement in grunt shell then it works fine. Following command I am using to execute a script.
pig -x mapreduce hdfs://quickstart.cloudera:8020/user/cloudera/oozie/PigScript/tweetAnalysis.pig
Here is the script
--REGISTER required Json Jars for parsing Json
REGISTER '/usr/lib/pig/lib/json-simple-1.1.jar'
REGISTER '/home/cloudera/Desktop/hadoopProgram/jar/elephant-bird-hadoop-compat-4.1.jar'
REGISTER '/home/cloudera/Desktop/hadoopProgram/jar/elephant-bird-pig-4.1.jar'
-- Parsing and loading Company.cfg file
parsing_company = LOAD 'company' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
flatten_company = FOREACH parsing_company GENERATE FLATTEN($0#'relevanceInfo');
flatten_company_1 = FOREACH flatten_company GENERATE FLATTEN($0) AS mymap;
extract_company_details= FOREACH flatten_company_1 GENERATE mymap#'companyName' as companyName,FLATTEN(mymap#'names') AS mymapNew;
company_results = FOREACH extract_company_details GENERATE companyName,mymapNew#'name' as Names;
-- Parsing and loading AllsightTweets file
tweets_test = LOAD 'AllsightTweets' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
extract_tweets_details = FOREACH tweets_test GENERATE myMap#'text' as text;
-- Parsing and loading keywords file
load_keyword = LOAD 'keywords' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
foreach_keywords = FOREACH load_keyword GENERATE FLATTEN($0#'test') AS mymap;
flatten_keywords= FOREACH foreach_keywords GENERATE FLATTEN(mymap#'keywords') AS mymapNew;
result_keywords = FOREACH flatten_keywords GENERATE mymapNew#'keyword' as keyword,mymapNew#'category' as category;
-- Cross product between company and tweet relations
cross_company_tweet = cross company_results,extract_tweets_details;
-- Cross product between the outcome of first relation with keyword relations
cross_company_tweet_keywords = cross cross_company_tweet,result_keywords;
-- Filter the records where tweet matches with keywords and company
res = FILTER cross_company_tweet_keywords BY ((text MATCHES CONCAT(CONCAT('.*',company_results::companyName),'.*')) AND (text MATCHES CONCAT(CONCAT('.*',result_keywords::keyword),'.*')));
-- Group the result based on company name and category
res_group = Group res by (company_results::companyName,result_keywords::category);
res_group_count = FOREACH res_group GENERATE FLATTEN(group) as (company_results::companyName,result_keywords::category), COUNT($1);
-- store the result in HDFS
store res_group_count into 'PigResultLatest' using PigStorage (',');

Related

Cassandra COPY FROM query error, with a CSV file

The problem:
I'm trying to get it so I can use Cassandra to work with Python properly. I've been using a toy dataset to practice uploading a csv file into Cassandra with no luck.
Cassandra seems to work fine when I am not using COPY FROM for csv files.
My intention is to use this dataset as a test to make sure that I can load a csv file's information into Cassandra, so I can then load 5 csv files totaling 2 GB into it for my originally intended project.
Note: Whenever I use CREATE TABLE and then run SELECT * FROM tvshow_data, the columns don't appear in the order that I set them, is this going to affect anything, or does it not matter?
Info about my installations and usage:
I've tried running both cqlsh and cassandra with an admin powershell.
I have Python 2.7 installed inside of the apache-cassandra-3.11.6 folder.
I have Cassandra version 3.11.6 installed.
I have cassandra-driver 3.18.0 installed, with conda.
I use Python 3.7 installed for everything other than Cassandra's directory.
I have tried both CREATE TABLE tvshow and CREATE TABLE tvshow.tvshow_data.
My Python script:
from cassandra.cluster import Cluster
cluster = Cluster()
session = cluster.connect()
create_and_add_file_to_tvshow = [
"DROP KEYSPACE tvshow;",
"CREATE KEYSPACE tvshow WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};",
"USE tvshow;",
"CREATE TABLE tvshow.tvshow_data (id int PRIMARY KEY, title text, year int, age int, imdb decimal, rotten_tomatoes int, netflix int, hulu int, prime_video int, disney_plus int, is_tvshow int);",
"COPY tvshow_data (id, title, year, age, imdb, rotten_tomatoes, netflix, hulu, prime_video, disney_plus, is_tvshow) FROM 'C:tvshows.csv' WITH HEADER = true;"
]
print('\n')
for query in create_and_add_file_to_tvshow:
session.execute(query)
print(query, "\nsuccessful\n")
Resulting python error:
This is the error I get when I run my code in the powershell with the following command, python cassandra_test.py.
cassandra.protocol.SyntaxException: <Error from server: code=2000 [Syntax error in
CQL query] message="line 1:0 no viable alternative at input 'COPY' ([
Resulting cqlsh error:
Running the previously stated cqlsh code in the create_and_add_file_to_tvshow variable in powershell after running cqlsh in the apache-cassandra-3.1.3/bin/ directory, creates the following error.
Note: The following error is only the first few lines to the code as well as the last new lines, I choose not to include it since it was several hundred lines long. If necessary I will include it.
Starting copy of tvshow.tvshow_data with columns [id, title, year, age, imdb, rotten_tomatoes, netflix, hulu, prime_video, disney_plus, is_tvshow].
Failed to import 0 rows: IOError - Can't open 'C:tvshows.csv' for reading: no matching file found, given up after 1 attempts
Process ImportProcess-44:
PTrocess ImportProcess-41:
raceback (most recent call last):
PTPProcess ImportProcess-42:
...
...
...
AA cls._loop.add_timer(timer)
AAttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
AttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
ttributeError: 'NoneType' object has no attribute 'add_timer'
Processed: 0 rows; Rate: 0 rows/s; Avg. rate: 0 rows/s
0 rows imported from 0 files in 1.974 seconds (0 skipped).
A sample of the first 10 lines of the csv file used to import
I have tried creating a csv file with just these first two lines, for a toy's toy test, since I couldn't get anything else to work.
id,title,year,age,imdb,rotten_tomatoes,netflix,hulu,prime_video,disney_plus,is_tvshow
0,Breaking Bad,2008,18+,9.5,96%,1,0,0,0,1
1,Stranger Things,2016,16+,8.8,93%,1,0,0,0,1
2,Money Heist,2017,18+,8.4,91%,1,0,0,0,1
3,Sherlock,2010,16+,9.1,78%,1,0,0,0,1
4,Better Call Saul,2015,18+,8.7,97%,1,0,0,0,1
5,The Office,2005,16+,8.9,81%,1,0,0,0,1
6,Black Mirror,2011,18+,8.8,83%,1,0,0,0,1
7,Supernatural,2005,16+,8.4,93%,1,0,0,0,1
8,Peaky Blinders,2013,18+,8.8,92%,1,0,0,0,1

How do I transform a python list to a dynamic frame that can be used to create a csv file in s3

I'm using aws glue with a custom pyspark script which loads data from an aurora instance and transforms it. Due to the nature of my data source (I need to recursively run sql commands on a list of ids) I ended up with a normal python list which contains lists of tuples. the list looks something like this:
list = [[('id': 1), ('name1': value2),('name2': value2')], [('id': 2),...]
I've tried to convert it into a normal data frame using spark's createDataFrameMethod:
listDataFrame = spark.createDataFrame(list)
and converting that list to a dynamic Frame using the the fromDF method on the DynamicFrame class, ending up with something like this:
listDynamicFrame = fromDF(dataframe = listDataFrame, glue_ctx = glueContext, name = listDynamicFrame)
and then passing that to the from_options method:
datasink2 = glueContext.write_dynamic_frame.from_options(frame =
listDynamicFrame, connection_type = "s3", connection_options = {"path": "s3://glue.xxx.test"},
format = "csv", transformation_ctx = "datasink2")
job.commit()
So yes, unfortunately, this doesn't seem to work: I'm getting the following error message:
1554379793700 final status: FAILED tracking URL: http://169.254.76.1:8088/cluster/app/application_1554379233180_0001 user: root
Exception in thread "main"
org.apache.spark.SparkException: Application application_1554379233180_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1122)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1168)
at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/04/04 12:10:00 INFO ShutdownHookManager: Shutdown hook called

create and insert into database tables using luigi

I am trying to understand the correct way to drop and recreate a table and insert data into the newly created table using luigi. I have multiple CSV files passed from the prior task to the task which should be inserted into the database.
My current code looks like this.
class CreateTables(sqla.CopyToTable):
connection_string = DatabaseConfig().data_mart_connection_string
table = DatabaseConfig().table_name
def requires(self):
return CustomerJourneyToCSV()
def output(self):
return SQLAlchemyTarget(
connection_string=self.connection_string,
target_table="customerJourney_1",
update_id=self.update_id(),
connect_args=self.connect_args,
echo=self.echo)
def create_table(self, engine):
base = automap_base()
Session = sessionmaker(bind=engine)
session = Session()
metadata = MetaData(engine)
base.prepare(engine, reflect=True)
# Drop existing tables
for i in range(1, len(self.input())+1):
for t in base.metadata.sorted_tables:
if t.name in "{0}_{1}".format(self.table, i):
t.drop(engine)
# Create new tables and insert data
i = 1
for f in self.input():
df = pd.read_csv(f.path, sep="|")
df.fillna(value="", inplace=True)
ts = define_table_schema(df)
t = Table("{0}_{1}".format(self.table, i), metadata, *[Column(*c[0], **c[1]) for c in ts])
t.create(engine)
# TODO: Need to remove head and figure out how to stop the connection from timing out
my_insert = t.insert().values(df.head(500).to_dict(orient="records"))
session.execute(my_insert)
i +=1
session.commit()
The code works creates the tables and inserts the data but falls over with the following error.
C:\Users\simon\AdviceDataMart\lib\site-packages\luigi\worker.py:191:
DtypeWarning: Columns (150) have mixed types. Specify dtype option on import
or set low_memory=False.
new_deps = self._run_get_new_deps()
File "C:\Users\simon\AdviceDataMart\lib\site-packages\luigi\worker.py", line
191, in run
new_deps = self._run_get_new_deps()
File "C:\Users\simon\AdviceDataMart\lib\site-packages\luigi\worker.py", line
129, in _run_get_new_deps
task_gen = self.task.run()
File "C:\Users\simon\AdviceDataMart\lib\site-
packages\luigi\contrib\sqla.py", line 375, in run
for row in itertools.islice(rows, self.chunk_size)]
File "C:\Users\simon\AdviceDataMart\lib\site-
packages\luigi\contrib\sqla.py", line 363, in rows
with self.input().open('r') as fobj:
AttributeError: 'list' object has no attribute 'open'
I am not sure what is causing this and am not able to easily debug a luigi pipeline. I am not sure if this has to do with implementation of the run method or the output method?

Python MYSQL to Text File - Headings Required

I have some code to query a MYSQL database and send the output to a text file.
The code below prints out the first 7 columns of data and sends it to a text file called Test
My question is, how do i also obtain the column HEADINGS from the database as well to display in the text file?
I am using Python 2.7 with a MYSQL database.
import MySQLdb
import sys
connection = MySQLdb.connect (host="localhost", user = "", passwd = "", db =
"")
cursor = connection.cursor ()
cursor.execute ("select * from tablename")
data = cursor.fetchall ()
OutputFile = open("C:\Temp\Test.txt", "w")
for row in data :
print>>OutputFile, row[0],row[1],row[2],row[3],row[4],row[5],row[6]
OutputFile.close()
cursor.close ()
connection.close ()
sys.exit()

The best way to get the details of the column name is by using INFORMATION_SCHEMA
SELECT `COLUMN_NAME`
FROM `INFORMATION_SCHEMA`.`COLUMNS`
WHERE `TABLE_SCHEMA`='yourdatabasename'
AND `TABLE_NAME`='yourtablename';
or by using the SHOW command of mySQL
SHOW columns FROM your-table;
This command is only mySQL specific.
and then to get the data you can use the .fetchall() function to get the details.

Loading Data to Pig with a pipe hyphen pipe |-| delimiter

When I try to load file.txt to Pig I am getting the following error:
pig script failed to validate: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[\|-\|]'
A sample line from the file is:
text|-|text|-|text
I am using the following command:
bag = LOAD 'file.txt' USING PigStorage('\\|-\\|') AS (v1:chararray, v2:chararray, v3:chararray);
Is it the delimiter? My regex?

If you don't want to write a custom LOAD function,you could probably load your records using '-' as the delimiter and then add another step to replace all the '|' in your fields.
bag = LOAD 'file.txt' USING PigStorage('-') AS (v1:chararray, v2:chararray, v3:chararray);
bag_new = FOREACH bag GENERATE
REPLACE(v1,'|','') as v1_new,
REPLACE(v2,'|','') as v2_new,
REPLACE(v3,'|','') as v3_new;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Error in executing PIg script - hdfs

Related

Cassandra COPY FROM query error, with a CSV file

How do I transform a python list to a dynamic frame that can be used to create a csv file in s3

create and insert into database tables using luigi

Python MYSQL to Text File - Headings Required

Loading Data to Pig with a pipe hyphen pipe |-| delimiter

Categories

Resources