Extract a string from a another column using regexp_extract - regex

I want to get data of s[0] from "column1":
sada/object=fan/sn=dadfs/s[0]=gsf,sdfs,sfdgs,/s[1]=dfsd,sdg,hte,/redirect=sdgfd/
Output should be values of s[0]
gsf,sdfs,sfdgs
I was trying to do using \ and it's not working
REGEXP_EXTRACT(column1, 's\\[0\\] = ([^&]+)')
This is in PySpark.

Input:
from pyspark.sql import functions as F
# Spark dataframe:
df = spark.createDataFrame([("sada/object=fan/sn=dadfs/s[0]=gsf,sdfs,sfdgs,/s[1]=dfsd,sdg,hte,/redirect=sdgfd/",)], ["column1"])
# SQL table:
df.createOrReplaceTempView("df")
PySpark:
df.select(F.regexp_extract('column1', r's\[0\]=(.*?),/', 1).alias('match')).show()
# +--------------+
# | match|
# +--------------+
# |gsf,sdfs,sfdgs|
# +--------------+
SQL:
spark.sql("select regexp_extract(column1, r's\\[0\\]=(.*?),/', 1) as match from df").show()
# +--------------+
# | match|
# +--------------+
# |gsf,sdfs,sfdgs|
# +--------------+

Related

Change values within AWS Glue DynamicFrame columns

I am trying to change values within some columns of my DynamicFrame in a AWS Glue job.
I see there is a Map function that seems useful for the task, but I cannot make it work.
This is my code:
def map_values_in_columns(self, df):
df = Map.apply(frame = df, f = self._map_values_in_columns)
return df
def _map_values_in_columns(self, rec):
for k, v in self.config['value_mapping'].items():
column_name = self.config['value_mapping'][k]['column_name']
values = self.config['value_mapping'][k]['values']
for old_value, new_value in values.items():
if rec[column_name] == old_value:
rec[column_name] = new_value
return rec
My config file is a yaml file with this structure:
value_mapping:
column_1:
column_name: asd
values:
- old_value_1: new_value_1
- old_value_2: new_value_2
column_2:
column_name: dsa
- old_value_1: new_value_1
- old_value_2: new_value_2
The above method throws a serialisation error:
_pickle.PicklingError: Could not serialize object: Py4JError: An error occurred while calling o81.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
I am not sure if this is due to how I am implementing the Map method, or if I should use a completely different approach.
So the question:
How can I change multiple values within multiple columns using AWS DynamicFrame, trying to avoid conversion back and forth between DynamicFrames and DataFrames?
There are a few problems in your code and yaml config, I'm not going to debug them here. See a working sample below, this can also be executed locally in a jupyter notebook.
I have simplified the yaml to keep the parsing complexity low.
from awsglue.context import GlueContext
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
columns = ["id", "asd", "dsa"]
data = [("1", "retain", "old_val_dsa_1"), ("2", "old_val_asd_1", "old_val_dsa_2"), ("3", "old_val_asd_2", "retain"), ("4", None, "")]
df = spark.createDataFrame(data).toDF(*columns)
dyF = DynamicFrame.fromDF(df, glueContext, "test_dyF")
import yaml
config = yaml.load('''value_mapping:
asd:
old_val_asd_1: new_val_asd_1
old_val_asd_2: new_val_asd_2
dsa:
old_val_dsa_1: new_val_dsa_1
old_val_dsa_2: new_val_dsa_2''')
def map_values(rec):
for k, v in config['value_mapping'].items():
if rec[k] is not None:
replacement_val = v.get(rec[k])
if replacement_val is not None:
rec[k] = replacement_val
return rec
print("-- dyF --")
dyF.toDF().show()
mapped_dyF = Map.apply(frame = dyF, f = map_values)
print("-- mapped_dyF --")
mapped_dyF.toDF().show()
-- dyF --
+---+-------------+-------------+
| id| asd| dsa|
+---+-------------+-------------+
| 1| retain|old_val_dsa_1|
| 2|old_val_asd_1|old_val_dsa_2|
| 3|old_val_asd_2| retain|
| 4| null| |
+---+-------------+-------------+
-- mapped_dyF --
+-------------+-------------+---+
| asd| dsa| id|
+-------------+-------------+---+
| retain|new_val_dsa_1| 1|
|new_val_asd_1|new_val_dsa_2| 2|
|new_val_asd_2| retain| 3|
| null| | 4|
+-------------+-------------+---+```

PySPARK UDF on withColumn to replace column

This UDF is written to replace a column's value with a variable. Python 2.7; Spark 2.2.0
import pyspark.sql.functions as func
def updateCol(col, st):
return func.expr(col).replace(func.expr(col), func.expr(st))
updateColUDF = func.udf(updateCol, StringType())
Variable L_1 to L_3 have updated columns for each row .
This is how I am calling it:
updatedDF = orig_df.withColumn("L1", updateColUDF("L1", func.format_string(L_1))). \
withColumn("L2", updateColUDF("L2", func.format_string(L_2))). \
withColumn("L3", updateColUDF("L3",
withColumn("NAME", func.format_string(name)). \
withColumn("AGE", func.format_string(age)). \
select("id", "ts", "L1", "L2", "L3",
"NAME", "AGE")
The error is:
return Column(sc._jvm.functions.expr(str))
AttributeError: 'NoneType' object has no attribute '_jvm'
Tried to create a sample dataframe and then make use of the lit function in the PySpark.
Seems to work fine, this is using the Databricks notebook
The error is because you are using pyspark functions inside a udf. It would also be very helpful to know the content of your L1, L2.. variables.
However, if I am understanding what you want to do correctly, you don't need a udf. I am assuming L1, L2 etc are constants, right? If not let me know to adjust the code accordingly. Here's an example:
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
conf = SparkConf()
spark_session = SparkSession.builder \
.config(conf=conf) \
.appName('test') \
.getOrCreate()
data = [{'L1': "test", 'L2': "data"}, {'L1': "other test", 'L2': "other data"}]
df = spark_session.createDataFrame(data)
df.show()
# +----------+----------+
# | L1| L2|
# +----------+----------+
# | test| data|
# |other test|other data|
# +----------+----------+
L1 = 'some other data'
updatedDF = df.withColumn(
"L1",
F.lit(L1)
)
updatedDF.show()
# +---------------+----------+
# | L1| L2|
# +---------------+----------+
# |some other data| data|
# |some other data|other data|
# +---------------+----------+
# or if you need to replace the value in a more complex way
pattern = '\w+'
updatedDF = updatedDF.withColumn(
"L1",
F.regexp_replace(F.col("L1"), pattern, "testing replace")
)
updatedDF.show()
# +--------------------+----------+
# | L1| L2|
# +--------------------+----------+
# |testing replace t...| data|
# |testing replace t...|other data|
# +--------------------+----------+
# or even something more complicated:
# set L1 value to L2 column when L2 column equals to data, otherwise, just leave L2 as it is
updatedDF = df.withColumn(
"L2",
F.when(F.col('L2') == 'data', L1).otherwise(F.col('L2'))
)
updatedDF.show()
# +----------+---------------+
# | L1| L2|
# +----------+---------------+
# | test|some other data|
# |other test| other data|
# +----------+---------------+
So your example would be:
DF = orig_df.withColumn("L1", pyspark_func.lit(L_1))
...
Also, please make sure you have an active spark session before this point
I hope this helps.
Edit: If L1, L2 etc are lists, then one option is to create a dataframe with them and join to the initial df. We'll need indexes for the join unfortunately and since your dataframe is quite big, I don't think this is a very performant solution. We could also use broadcasts and a udf or broadcasts and join.
Here's a (suboptimal I think) example of how to do the join:
L1 = ['row 1 L1', 'row 2 L1']
L2 = ['row 1 L2', 'row 2 L2']
# create a df with indexes
to_update_df = spark_session.createDataFrame([{"row_index": i, "L1": row[0], "L2": row[1]} for i, row in enumerate(zip(L1, L2))])
# add indexes to the initial df
indexed_df = updatedDF.rdd.zipWithIndex().toDF()
indexed_df.show()
# +--------------------+---+
# | _1 | _2 |
# +--------------------+---+
# | [test, some other... | 0 |
# | [other test, othe... | 1 |
# +--------------------+---+
# bring the df back to its initial form
indexed_df = indexed_df.withColumn('row_number', F.col("_2"))\
.withColumn('L1', F.col("_1").getItem('L1'))\
.withColumn('L2', F.col("_1").getItem('L2')).\
select('row_number', 'L1', 'L2')
indexed_df.show()
# +----------+----------+---------------+
# |row_number| L1| L2|
# +----------+----------+---------------+
# | 0| test|some other data|
# | 1|other test| other data|
# +----------+----------+---------------+
# join with your results and keep the updated columns
final_df = indexed_df.alias('initial_data').join(to_update_df.alias('other_data'), F.col('row_index')==F.col('row_number'), how='left')
final_df = final_df.select('initial_data.row_number', 'other_data.L1', 'other_data.L2')
final_df.show()
# +----------+--------+--------+
# |row_number| L1| L2|
# +----------+--------+--------+
# | 0|row 1 L1|row 1 L2|
# | 1|row 2 L1|row 2 L2|
# +----------+--------+--------+
This ^ can definitely be better in terms of performance.

How apply regexp_replace with unicode characters in spark hive

I'm try to count the number of occurrences of emoticons in the string in spark dataframe.
I use SQLTransformer.
My statement:
select LENGTH(regexp_replace(text, '[^\\uD83C-\\uDBFF\\uDC00-\\uDFFF]+', '')) as count_emoji from __THIS__
But this statement doesn't work.
What am I doing wrong?
It looks like your SQLTransform is working. Please find the code as below.
object SparkHiveExample extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("Spark Hive Example")
.getOrCreate()
import spark.implicits._
//Prepare Test Data
val df = Seq("hello, how are you?\uD83D\uDE0A\uD83D\uDE0A\uD83D\uDE0A")
.toDF("text")
df.show(false)
+-------------------------+
|text |
+-------------------------+
|hello, how are you?๐Ÿ˜Š๐Ÿ˜Š๐Ÿ˜Š|
+-------------------------+
df.createOrReplaceTempView("__THIS__")
val finalDf = spark.sql("select LENGTH(regexp_replace(text,'[^\\\\uD83C-\\\\uDBFF\\\\uDC00-\\\\uDFFF]+', '')) as count_emoji from __THIS__")
finalDf.show(false)
+-----------+
|count_emoji|
+-----------+
|3 |
+-----------+
}
If you want read data from Hive table then instantiate SparkSession with HiveSupport and Configuration of Hive is done by placing your hive-site.xml, core-site.xml (for security configuration), and hdfs-site.xml (for HDFS configuration) file in conf/.
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()

Using split function in PySpark

I am trying to search a particular line from a very big log file. I am able to search the line.
Now using that line space I want to create a dataframe,I am unable to do that. I have tried below code but unable to achieve.
from pyspark import SparkConf,SparkContext
from pyspark import SQLContext
from pyspark.sql.types import *
from pyspark.sql import *
conf=SparkConf().setMaster("local").setAppName("invparsing")
sc=SparkContext(conf=conf)
sql=SQLContext(sc)
def f(x) :print(x)
data_frame_schema=StructType([
StructField("Typeof",StringType()),
#StructField("Produt_mod",StringType()),
#StructField("Col2",StringType()),
#StructField("Col3",StringType()),
#StructField("Col4",StringType()),
#StructField("Col5",StringType()),
])
path="C:/rk/IBMS/inv.log"
lines=sc.textFile(path)
NodeStr=lines.filter(lambda x:'Node :RBS6301' in x).map(lambda x:x.split(" +"))
NodeStr.foreach(f)
Nodedf=sql.createDataFrame(NodeStr,data_frame_schema)
Nodedf.show(truncate=False)
Now, I am getting output here - only one single string. O want to split value on the basis of space.
[u'Node: RBS6301 XP10521/26 R30F L17A.4-6 (C17.0_LSV_PS4)']
+-------------------------------------------------------------+
|Typesof |
+-------------------------------------------------------------+
|Node: RBS6301 XP10521/26 R30F L17A.4-6 (C17.0_LSV_PS4)
+-------------------------------------------------------------+
Expected output:
Typeof Produt_mod Col2 Col3 Col4 COL5
Node RBS6301 XP10521/26 R30F L17A.4-6 C17.0_LSV_PS4
The first mistake you made is here:
lambda x:x.split(" +")
str.split takes a constant string not a regular expression. To split on a whitespace you should just omit separator
lines = sc.parallelize(["Node: RBS6301 XP10521/26 R30F L17A.4-6 (C17.0_LSV_PS4)"])
lines.map(lambda s: s.split()).first()
# ['Node:', 'RBS6301', 'XP10521/26', 'R30F', 'L17A.4-6', '(C17.0_LSV_PS4)']
Once you've done that you can just filter and convert to a DataFrame:
df = lines.map(lambda s: s.split()).filter(lambda x: len(x) == 6).toDF(
["col1", "col2", "col3", "col4", "col5", "col6"]
)
df.show()
# +-----+-------+----------+----+--------+---------------+
# | col1| col2| col3|col4| col5| col6|
# +-----+-------+----------+----+--------+---------------+
# |Node:|RBS6301|XP10521/26|R30F|L17A.4-6|(C17.0_LSV_PS4)|
# +-----+-------+----------+----+--------+---------------+
and filter:
df[df["col2"] == "RBS6301"].show()
# +-----+-------+----------+----+--------+---------------+
# | col1| col2| col3|col4| col5| col6|
# +-----+-------+----------+----+--------+---------------+
# |Node:|RBS6301|XP10521/26|R30F|L17A.4-6|(C17.0_LSV_PS4)|
# +-----+-------+----------+----+--------+---------------+

Retrieve multiple tiers of data structure

Suppose such a text:
In [1]: import re
In [2]: with open('text.md', 'r') as f:
...: cont = f.read()
In [3]: cont
Out[3]: '- ## First steps[ยถ](https://docs.djangoproject.com/en/2.0/#first-steps)\n\n Are you new to Django or to programming? This is the place to start!\n\n - **From scratch:** [Overview](https://docs.djangoproject.com/en/2.0/intro/overview/) | [Installation](https://docs.djangoproject.com/en/2.0/intro/install/)\n - **Tutorial:** [Part 1: Requests and responses](https://docs.djangoproject.com/en/2.0/intro/tutorial01/) | [Part 2: Models and the admin site](https://docs.djangoproject.com/en/2.0/intro/tutorial02/) | [Part 3: Views and templates](https://docs.djangoproject.com/en/2.0/intro/tutorial03/) | [Part 4: Forms and generic views](https://docs.djangoproject.com/en/2.0/intro/tutorial04/) | [Part 5: Testing](https://docs.djangoproject.com/en/2.0/intro/tutorial05/) | [Part 6: Static files](https://docs.djangoproject.com/en/2.0/intro/tutorial06/) | [Part 7: Customizing the admin site](https://docs.djangoproject.com/en/2.0/intro/tutorial07/)\n - **Advanced Tutorials:** [How to write reusable apps](https://docs.djangoproject.com/en/2.0/intro/reusable-apps/) | [Writing your first patch for Django](https://docs.djangoproject.com/en/2.0/intro/contributing/)\n\n ## The model layer[ยถ](https://docs.djangoproject.com/en/2.0/#the-model-layer)\n\n Django provides an abstraction layer (the โ€œmodelsโ€) for structuring and manipulating the data of your Web application. Learn more about it below:\n\n - **Models:** [Introduction to models](https://docs.djangoproject.com/en/2.0/topics/db/models/) | [Field types](https://docs.djangoproject.com/en/2.0/ref/models/fields/) | [Indexes](https://docs.djangoproject.com/en/2.0/ref/models/indexes/) | [Meta options](https://docs.djangoproject.com/en/2.0/ref/models/options/) | [Model class](https://docs.djangoproject.com/en/2.0/ref/models/class/)\n - **QuerySets:** [Making queries](https://docs.djangoproject.com/en/2.0/topics/db/queries/) | [QuerySet method reference](https://docs.djangoproject.com/en/2.0/ref/models/querysets/) | [Lookup expressions](https://docs.djangoproject.com/en/2.0/ref/models/lookups/)\n - **Model instances:** [Instance methods](https://docs.djangoproject.com/en/2.0/ref/models/instances/) | [Accessing related objects](https://docs.djangoproject.com/en/2.0/ref/models/relations/)\n - **Migrations:** [Introduction to Migrations](https://docs.djangoproject.com/en/2.0/topics/migrations/) | [Operations reference](https://docs.djangoproject.com/en/2.0/ref/migration-operations/) | [SchemaEditor](https://docs.djangoproject.com/en/2.0/ref/schema-editor/) | [Writing migrations](https://docs.djangoproject.com/en/2.0/howto/writing-migrations/)\n - **Advanced:** [Managers](https://docs.djangoproject.com/en/2.0/topics/db/managers/) | [Raw SQL](https://docs.djangoproject.com/en/2.0/topics/db/sql/) | [Transactions](https://docs.djangoproject.com/en/2.0/topics/db/transactions/) | [Aggregation](https://docs.djangoproject.com/en/2.0/topics/db/aggregation/) | [Search](https://docs.djangoproject.com/en/2.0/topics/db/search/) | [Custom fields](https://docs.djangoproject.com/en/2.0/howto/custom-model-fields/) | [Multiple databases](https://docs.djangoproject.com/en/2.0/topics/db/multi-db/) | [Custom lookups](https://docs.djangoproject.com/en/2.0/howto/custom-lookups/) |[Query Expressions](https://docs.djangoproject.com/en/2.0/ref/models/expressions/) | [Conditional Expressions](https://docs.djangoproject.com/en/2.0/ref/models/conditional-expressions/) | [Database Functions](https://docs.djangoproject.com/en/2.0/ref/models/database-functions/)\n - **Other:** [Supported databases](https://docs.djangoproject.com/en/2.0/ref/databases/) | [Legacy databases](https://docs.djangoproject.com/en/2.0/howto/legacy-databases/) | [Providing initial data](https://docs.djangoproject.com/en/2.0/howto/initial-data/) | [Optimize database access](https://docs.djangoproject.com/en/2.0/topics/db/optimization/) | [PostgreSQL specific features](https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/)'
It's chapters are retrieved by,
In [9]: chapters = re.findall(r'## (.+)\[', cont)
In [10]: chapters
Out[10]: ['First steps', 'The model layer']
It's sections are obtained by,
In [21]: sections = re.findall(r'- \*\*(.+)\*\*',cont)
In [23]: sections
Out[23]:
['From scratch:',
'Tutorial:',
'Advanced Tutorials:',
'Models:',
'QuerySets:',
'Model instances:',
'Migrations:',
'Advanced:',
'Other:']
I'd like to output a data structure like:
['First steps',['From scratch:',
'Tutorial:',
'Advanced Tutorials:'],
'The model layer',['Models:',
'QuerySets:',
'Model instances:',
'Migrations:',
'Advanced:',
'Other:']]
How to acomplish such a task?
Find both chapters and sections simultanously:
>>> content = re.findall(r'## (.+)\[|- \*\*(.+)\*\*', cont)
Then put them in your desired structure:
>>> structure = []
>>> for c, s in results:
if c:
structure.extend([c, []])
elif s:
structure[-1].append(s)
This results in:
>>> structure
['First steps', ['From scratch:', 'Tutorial:', 'Advanced Tutorials:'], 'The model layer', ['Models:', 'QuerySets:', 'Model instances:', 'Migrations:', 'Advanced:', 'Other:']]