Regex on io.Text RDD using scala

Regex on io.Text RDD using scala - regex

I have a problem. I need to extract some data from a file like this:
(3269,
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>...
)
(194712,
<page>
<title>AssistiveTechnology</title>
<ns>0</ns>
<id>23</id>..
) etc...
This file was generated using:
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "</page>")
val rdd=sc.newAPIHadoopFile("sample.bz2", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
rdd.map{case (k,v) => (k.get(), new String(v.copyBytes()))}
I need to obtain the title content. Im using regex but the output file still remains empty. My code is like this:
val xx = rdd.map(x => x._2).filter(x => x.matches(".*<title>([A-Za-z]+)<\\/title>.*"))
I also try with these:
".*<title>([A-Za-z]+)</title>.*"
And using this:
val reg = ".*<title>([\\w]+)</title>.*".r
val xx = rdd.map(x => x._2).filter(x => reg.pattern.matcher(x).matches)
I create the .jar using sbt and running with spark-submit.
BTW, using spark-shell it works :S
I need your help please. Thanks.

You could use built-in Scala support for XML. Something like
import scala.xml._
rdd.map(x => (XML.loadString(x._2) \ "title").text)

Related

Ruby regx for xml attributes

i am trying to create a regx expression for fluentbit parser and not sure how to drop specific characters from a string
<testsuite name="Activity moved" tests="1" errors="0" failures="0" skipped="0" time="151.109" timestamp="2022-09-05T16:22:53.184000">
Above is the input which is i have as a string and i want to make multiple keys out of it.
expected output:
name: Activity moved
tests: 1
errors: 0
failures: 0
skipped: 0
timestamp: 2022-09-05T16:22:53.184000
How can i achieve this please?

try this:
str = "<testsuite name=\"Activity moved\" tests=\"1\" errors=\"0\" failures=\"0\" skipped=\"0\" time=\"151.109\" timestamp=\"2022-09-05T16:22:53.184000\">"
regexp = /(\w*)="(.*?)"/ # there's your regexp
str.scan(regexp).to_h # and this is how you make the requested hash
# => {"name"=>"Activity moved", "tests"=>"1", "errors"=>"0", "failures"=>"0", "skipped"=>"0", "time"=>"151.109", "timestamp"=>"2022-09-05T16:22:53.184000"}

Of course you can write your own parser but may be it's more comfortable to use Nokogiri?
require 'nokogiri'
doc = Nokogiri::XML(File.open("your.file", &:read))
puts doc.at("testsuite").attributes.map { |name, value| "#{name}: #{value}" }

How to use the RegexMatcher in SparkNLP

Here is the case. I want to run SparkNLP on Jupyterlab with Scala kernel. I want to use the RegexMatcher annotation. I saved the pattern in a file named patterns.txt on s3 bucket. And I tried the implementation below
import com.johnsnowlabs.nlp.util.io.ExternalResource
import com.johnsnowlabs.nlp.util.io.ReadAs.LINE_BY_LINE
val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val regexmatcher = new RegexMatcher().
setInputCols(Array("document")).
setOutputCol("match").
setStrategy("MATCH_ALL").
setRules(ExternalResource("s3://bucket_name/patterns.txt", LINE_BY_LINE, Map("format" -> "text", "delimiter" -> " ")))
val pipeline_regex = new Pipeline().setStages(Array(document, regexmatcher))
val regex_match = pipeline_regex.fit(dev_data)
regex_match.transform(dev_data).select('match).show(false)
However, it seems thit doesn't work at all, and patterns.txt are not used. How to fix it.

Task not serializable - Regex

i have a movie which has a title. In this title is the year of the movie like "Movie (Year)". I want to extract the Year and i'm using a regex for this.
case class MovieRaw(movieid:Long,genres:String,title:String)
case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
val regexYear = ".*\\((\\d*)\\)".r
moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case regexYear(y) => Integer.parseInt(y)})}
When executing the last command i get the following Error:
java.io.NotSerializableException: org.apache.spark.SparkConf
Running in the Spark/Scala REPL, with this SparkContext:
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)

As Dean explained, the reason of the problem is that the REPL creates a class out of the code added to the REPL and, in this case, the other variables in the same context are being "pulled" in the closure by the regex declaration.
Given the way you're creating the context, a simple way to avoid that serialization issue would be to declare the SparkConf and SparkContext transient:
#transient val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
#transient val sc = new SparkContext(conf)
You don't even need to recreate the spark context in the REPL for the only purpose of connecting to Cassandra:
spark-shell --conf spark.cassandra.connection.host=localhost

You probably have this code in a larger Scala class or object (a type), right? If so, in order to serialize the regexYear, the whole enclosing type gets serialized, but you probably have the SparkConf defined in that type.
This is a very common and confusing problem and efforts are underway to prevent it, given the constraints of the JVM and languages on top of it, like Java.
The solution (for now) is to put regexYear inside a method or another object:
object MyJob {
def main(...) = {
case class MovieRaw(movieid:Long,genres:String,title:String)
case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
val regexYear = ".*\\((\\d*)\\)".r
moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case regexYear(y) => Integer.parseInt(y)})}
...
}
}
or
...
object small {
case class MovieRaw(movieid:Long,genres:String,title:String)
case class Movie(movieid:Long,genres:Set[String],title:String,year:Int)
val regexYear = ".*\\((\\d*)\\)".r
moviesRaw.map{case MovieRaw(i,g,t) => Movie(i,g,t,t.trim() match { case regexYear(y) => Integer.parseInt(y)})}
}
Hope this helps.

Try passing in the cassandra option on the command line for spark-shell like this:
spark-shell [other options] --conf spark.cassandra.connection.host=localhost
And that way you won't have to recreate the SparkContext -- you can use the SparkContext (sc) that gets instantiated automatically with spark-shell.

add product prestashop webservice. So difficult?

I'm searching for a tutorial or documentation where is explained the webservice use of Prestashop (v1.5.6.0).
I'd like simply add or edit (update) o product.
There is'not a tutorial clean or with example about use of prestashop's api.
Coul you help me please ?
For example, i'd like add object a:
define('PS_SHOP_PATH', 'http://localhost/myshop'); // Root path of your PrestaShop store
define('PS_WS_AUTH_KEY', '****'); // Auth key (Get it in your Back Office)
require_once('api/PSWebServiceLibrary.php');
$webService = new PrestaShopWebservice(PS_SHOP_PATH, PS_WS_AUTH_KEY, DEBUG);
$opt = array('resource' => 'products');
Now, how can i set my values for a new object ? In the example you can insert only required value.
Could you help me ?
And for update ?
Please ,no linked me Prestashop documentation, i have already read it, i'm asking your help.
Thanks and excuse for my bad english.

Here is the link to the official documentation. You will find there all the information you need.
Basically, you'll need to create an XML that represents the object you'd like to PUT.
Lets say you want to create a new category.
First, you need to get the schema :
$xml = $webService->get(array('url' => PS_SHOP_PATH.'/api/categories?schema=blank'));
Which is something like this :
<prestashop>
<category>
</category>
</prestashop>
Then you'll have to set the content of the xml ($resources)
$resources = $xml->children()->children();
$resources->active = true;
$resources->..........
etc.
Finally,
try
{
$opt = array('resource' => 'categories');
$opt['postXml'] = $xml->asXML();
$xml = $webService->add($opt);
}
catch (PrestaShopWebserviceException $e)
{
echo 'Something went wrong: '.$e->getMessage();
}

Perl: can not post xml data to web service

The web service accepts the xml data and returns values back in xml again. I am trying to post the xml data to the web services, without any success, I need to do it using Perl. Following is the code I tried:
use SOAP::Lite ;
my $URL = "http://webservice.com:7011/webServices/HealthService.jws?WSDL=";
my $xml_data = '<Request>HealthCheck</Request>' ;
my $result = SOAP::Lite -> service($xml_data);
print $result ;
I tried another approach with proxy:
use SOAP::Lite +trace => 'debug';
my $URI = 'webServices/HealthService' ;
my $URL = "http://webservice.com:7011/webServices/HealthService.jws?WSDL=" ;
my $test = SOAP::Lite -> uri($URI)
-> proxy($URL) ;
my $xml_data = '<Request>HealthCheck</Request>' ;
my $result = $test -> healthRequest($xml_data);
print $result ;
However this is throwing the following error:
Can't locate class method "http://webservice.com:7011/healthRequest" via package "SOAP::Lite\" at 7.pl line 4. BEGIN failed--compilation aborted at 7.pl line 4.
The webservice provides only one method HealthRequest. I am not sure why it is trying to find out the class method in SOAP:Lite. I get the same error for both the approach.
Is there any other method to achieve the same using Perl?

Try something like this, I have not tested it so just test and see what happens, you should at least not get the PM error.
use strict;
use SOAP::Lite;
my $xml_data = '<Request>HealthCheck</Request>' ;
my $soap = SOAP::Lite
->uri("webServices/HealthService")
->proxy("http://webservice.com:7011/webServices/HealthService.jws?WSDL=");
print $soap->service($xml_data),"\n";

If you want to create the XML yourself and not delegate that task to SOAP::Lite, you need to let SOAP::Lite know what you are doing:
$soap = SOAP::Lite->ns( $URI )->proxy( $URL );
$soap->HealthCheck( SOAP::Data->type( xml => $xml_data ) );
I have my doubts, though, that this will work with your XML.
If your request really has no variable parameters, this may work:
$soap = SOAP::Lite->ns( $URI )->proxy( $URL );
$soap->HealthCheck;
PS: Are your sure that your webservice is a SOAP service?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex on io.Text RDD using scala - regex

You could use built-in Scala support for XML. Something like import scala.xml._ rdd.map(x => (XML.loadString(x._2) \ "title").text)

Related

Ruby regx for xml attributes

How to use the RegexMatcher in SparkNLP

Task not serializable - Regex

add product prestashop webservice. So difficult?

Perl: can not post xml data to web service

Categories

Resources