Strip or Regex function in Spark 1.3 Dataframe - regex

I have some code from PySpark 1.5 that I unfortunately have to port backwards to Spark 1.3. I have a column with elements that are alpha-numeric but I only want the digits.
An example of the elements in 'old_col' of 'df' are:
'125 Bytes'
In Spark 1.5 I was able to use
df.withColumn('new_col',F.regexp_replace('old_col','(\D+)','').cast("long"))
However, I cannot seem to come up with a solution using old 1.3 methods like SUBSTR or RLIKE. Reason being the number of digits in front of "Bytes" will vary in length, so what I really need is the 'replace' or 'strip' functionality I can't find in Spark 1.3
Any suggestions?

As long as you use HiveContext you can execute corresponding Hive UDFs either with selectExpr:
df.selectExpr("regexp_extract(old_col,'([0-9]+)', 1)")
or with plain SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT regexp_extract(old_col,'([0-9]+)', 1) FROM df")

Related

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a (string or BINARY) column.
And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ#mail.gmail.com%3E
which uses scala instead of pyspark as an example:
Configuration conf = new Configuration();
+ conf.set("parquet.strings.signed-min-max.enabled", "true");
Path inputPath = new Path(input);
FileStatus inputFileStatus =
inputPath.getFileSystem(conf).getFileStatus(inputPath);
List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);
I've been unable to set this value in pyspark (perhaps I'm setting it in the wrong place?)
example dataframe
import random
import string
from pyspark.sql.types import StringType
r = []
for x in range(2000):
r.append(u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)))
df = spark.createDataFrame(r, StringType())
I've tried a few different ways of setting this option:
df.write.format("parquet").option("parquet.strings.signed-min-max.enabled", "true").save("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", "true").parquet("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", True).parquet("s3a://test.bucket/option")
But all of the saved parquet files are missing the ST/STATS for the BINARY column. Here is an example output of the metadata from one of the parquet files:
creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"value","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
value: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:33 TS:515
---------------------------------------------------------------------------------------------------
Also, based on this email chain https://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C9DEF4C39-DFC2-411B-8987-5B9C33842974#videoamp.com%3E and question: Specify Parquet properties pyspark
I tried sneaking the config in through the pyspark private API:
spark.sparkContext._jsc.hadoopConfiguration().setBoolean("parquet.strings.signed-min-max.enabled", True)
So I am still unable to set this conf parquet.strings.signed-min-max.enabled in parquet-mr (or it is set, but something else has gone wrong)
Is it possible to configure parquet-mr from pyspark
Does pyspark 2.3.x support BINARY column stats?
How do I take advantage of the PARQUET-686 feature to add min/max metadata for string columns in a parquet file?
Since historically Parquet writers wrote wrong min/max values for UTF-8 strings, new Parquet implementations skip those stats during reading, unless parquet.strings.signed-min-max.enabled is set. So this setting is a read option that tells the Parquet library to trust the min/max values in spite of their known deficiency. The only case when this setting can be safely enabled is if the strings only contain ASCII characters, because the corresponding bytes for those will never be negative.
Since you use parquet-tools for dumping the statistics and parquet-tools itself uses the Parquet library, it will ignore string min/max statistics by default. Although it seems that there are no min/max values in the file, in reality they are there, but get ignored.
The proper solution for this problem is PARQUET-1025, which introduces new statistics fields min-value and max-value. These handle UTF-8 strings correctly.

Searching a Mongo database using PyMongo, while using regex

I currently have a PyMongo collection with around 100,000 documents. I need to perform a regex search on each of these documents, checking each document against around 1,800 values to see if a particular field (which is an array) contains one of the 1,800 strings. After testing a variety of ways of using regex, such as compiling into a regular expression, multiprocessing and multi-threading, the performance is still abysmal, and takes around 30-45 minutes.
The current regex I'm using to find the value at the end of the string is:
rgx = re.compile(string_To_Be_Compared + '$')
And then this is ran using a standard pymongo find query:
coll.find( { 'field' : rgx } )
I was wondering if anyone had any suggestions for querying these values in a more optimal way? Ideally the search to return all the values should take less than 5 minutes. Would the best course of action to be use something like ElasticSearch or am I missing something basic?
Thanks for you time

How to store Array or Blob in SnappyData?

I'm trying to create a table with two columns like below:
CREATE TABLE test (col1 INT ,col2 Array<Decimal>) USING column options(BUCKETS '5');
It is creating successfully but when i'm trying to insert data into it, it is not accepting any format of array. I've tried the following queries:
insert into test1 values(1,Array(Decimal("1"), Decimal("2")));
insert into test1 values(1,Array(1,2));
insert into test1 values(1,[1,2,1]);
insert into test1 values(1,"1,2,1");
insert into test1 values(1,<1,2,1>);
etc..
Please help!
There is an open ticket for this: https://jira.snappydata.io/browse/SNAP-1284 which will be addressed in next release for VALUES strings (JSON string and Spark compatible strings).
The Spark Catalyst compatible format will work:
insert into test1 select 1, array(1, 2);
When selecting the data is by default shipped in serialized form and shown as binary. For now you have to use "complexTypeAsJson" hint to show as JSON:
select * from test1 --+complexTypeAsJson(true);
Support for displaying in simpler string format by default will be added in next release.
One other thing that can be noticed in your example is a prime value for buckets. This was documented as preferred in previous releases but as of 1.0 release it is recommended to use a power of two or some even number (e.g the total number of cores in your cluster can be a good choice) -- perhaps some examples are still using the older recommendation.

How to count the number of times value exists in line?

My Data like this..,
[123:1000,156,132,123,156,123]
[123:1009,392,132,123,156,123]
[234:987,789,132,123,156,123]
[234:8765,789,132,123,156,123]
I need to count number of times "123" exists in each line using expression language in nifi.
I need to do it in expression language only.How can i count it?
Any help appreciated.
You should use the SplitContent processor to split the flowfile content into individual flowfiles per line, then use ExtractText with a regex like pattern=(123)? which will result in an attribute being added to the flowfile for each matching group:
[123:1009,392,132,123,156,123] -> pattern.1, pattern.2, pattern.3
[234:987,789,132,123,156,123] -> pattern.1, pattern.2
Finally, you can use a ScanAttribute processor to detect the attribute with the highest group count in each of the flowfiles and route it to an UpdateAttribute to put that value into a common flowfile attribute (i.e. count). You could also replace some steps with an ExecuteStreamCommand and use a variety of OS-level tools (grep/awk/sed/cut/etc.) to perform the count, return that value, and update the content of the flowfile.
It would probably be simpler for you to perform this count action within an ExecuteScript processor, as it could be done in 1-2 lines of Groovy, Ruby, Python, or Javascript, and would not require multiple processors. Apache NiFi is designed for data routing and simple transformation, not complex event processing, so there are not standard processors developed for these tasks. There is an open Jira for "Add processor to perform simple aggregations" which has a patch available here, which may be useful for you.
According to the documentation a count is done like this:
${allMatchingAttributes(".*"):contains("123"):count()}

Dealing with a list of tuples - SQLite3 & Python 2.7

I am using a database to return a couple of values I place there. Let's just say the data is google, yahoo, bing.
The Code
dbCursor.execute('''SELECT ticker FROM SearchEngines''')
allEngines = dbCursor.fetchall()
for engine in engines:
print engine
Yields the following result:
(u'google')
(u'yahoo')
(u'bing)
This is troublesome because I require the result to be appended to a url in a string format. Does anybody know a way around this?
Thanks
fetchall() always returns a tuple, even if you're just selecting one field. So...
for engine in engines:
print engine[0]
Or:
for (engine,) in engines:
print engine
Hope this helps.