Using regex function on date in Pyspark - regex

I need to validate date(string format) in Pyspark Dataframe and I need to remove additonal characters,notations in date if they are present. How to validate like that ?
I came across this code
regex_string='\/](19|[2-9][0-9])\d\d$)|(^29[\/]02[\/](19|[2-9][0-9])(00|04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96)$)'
df.select(regexp_extract(col("date"),regex_string,0).alias("cleaned_map"),col('date')).show()
Below is my output
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| |01/06/w2020|
| |02/06/2!020|
| 02/06/2020| 02/06/2020|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 02/06/2020| 02/06/2020|
+-----------+-----------+
My expected output
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| 01/06/2020|01/06/w2020|
| 02/06/2020|02/06/20!20|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 06/06/2020| 06/06/2020|
| 07/06/2020| 07/06/2020|
+-----------+-----------+

Try this-
val df = Seq("01/06/w2020",
"02/06/2!020",
"02/06/2020",
"03/06/2020",
"04/06/2020",
"05/06/2020",
"02/06/2020",
"//01/0/4/202/0").toDF("date")
df.withColumn("cleaned_map", regexp_replace($"date", "[^0-9T]", ""))
.withColumn("date_type", to_date($"cleaned_map", "ddMMyyyy"))
.show(false)
/**
* +--------------+-----------+----------+
* |date |cleaned_map|date_type |
* +--------------+-----------+----------+
* |01/06/w2020 |01062020 |2020-06-01|
* |02/06/2!020 |02062020 |2020-06-02|
* |02/06/2020 |02062020 |2020-06-02|
* |03/06/2020 |03062020 |2020-06-03|
* |04/06/2020 |04062020 |2020-06-04|
* |05/06/2020 |05062020 |2020-06-05|
* |02/06/2020 |02062020 |2020-06-02|
* |//01/0/4/202/0|01042020 |2020-04-01|
* +--------------+-----------+----------+
*/
enrich this pattern "[^0-9/T]" if you want exclude any chars to be removed

Try regexp_replace to remove additional character notations.
df.show()
# +-----------+
# | date|
# +-----------+
# |01/06/w2020|
# |02/06/2!020|
# | 02/06/2020|
# +-----------+
df.withColumn("cleaned_map", F.regexp_replace("date", r'[^\d\/]','')).show()
# +-----------+-----------+
# | date|cleaned_map|
# +-----------+-----------+
# |01/06/w2020| 01/06/2020|
# |02/06/2!020| 02/06/2020|
# | 02/06/2020| 02/06/2020|
# +-----------+-----------+

Related

Django one column values to one column concat value using annotate Subquery returns more than 1 row

hellow my models see the blow
class IP(models.Model):
subnet = models.ForeignKey(Subnet,verbose_name="SUBNET",on_delete=models.CASCADE,related_name="ip_set")
ip = models.GenericIPAddressField(verbose_name="IP",protocol="both",unpack_ipv4=True,unique=True)
asset = models.ManyToManyField(Asset,verbose_name="HOSTNAME",through="AssetIP",related_name="ip_set",blank=True,)
description = models.CharField(verbose_name="DESCRIPTION",max_length=50,default="",null=True,blank=True)
class AssetIP(models.Model):
TYPE_CHOICES = [
("GATEWAY-IP", "GATEWAY-IP"),
("MGT-IP", "MGT-IP"),
("PRIMARY-IP", "PRIMARY-IP"),
("OTHER-IP", "OTHER-IP"),
]
ip_type = models.CharField(verbose_name="IP TYPE",max_length=30,choices=TYPE_CHOICES)
ip = models.ForeignKey(IP,verbose_name="IP",on_delete=models.CASCADE,related_name="asset_ip_set")
asset = models.ForeignKey(Asset,verbose_name="HOSTNAME",on_delete=models.CASCADE,related_name="asset_ip_set")
class Asset(models.Model):
barcode = models.CharField(verbose_name="Barcode",max_length=60,blank=True,null=True,unique=True)
hostname= models.CharField(verbose_name="Hostname",max_length=30)
so in this model data is blow
IP Model
| IP | Asset | Description |
|:---- |:------:| -----:|
| 10.10.10.2 | A_HOST,B_HOST,C_HOST | - |
| 10.10.10.3 | A_HOST,B_HOST | - |
| 10.10.10.4 | A_HOST | - |
| 10.10.10.5 | A_HOST | - |
AssetIP through Model
| IP | Asset | IP_TYPE |
|:---- |:------:| -----:|
| 10.10.10.2 | A_HOST | OTHER-IP |
| 10.10.10.2 | B_HOST | OTHER-IP |
| 10.10.10.2 | C_HOST | OTHER-IP |
| 10.10.10.3 | A_HOST | OTHER-IP |
| 10.10.10.4 | A_HOST | OTHER-IP |
| 10.10.10.5 | A_HOST | PRIMARY-IP |
So Asset Query Result in this
Result = Asset.objects.all()
in this result Field
Asset = {
barcode: "ddd",
hostname: "A_HOST",
}
I Want Field and Result
Asset = {
barcode: "ddd",
hostname: "A_HOST",
primary_ip : "10.10.10.5",
other_ip : "10.10.10.2, 10.10.10.3, 10.10.10.4"
}
I Try the this query in this queryset is not filtering "OHTER-IP"
assets = Asset.objects.annotate(other_ips=GroupConcat('asset_ip_set__ip__ip'))
assets[0].other_ips
result : '10.10.10.2,10.10.10.3,10.10.10.4,10.10.10.5'
and try to this queryset
filtered_ips = AssetIP.objects.filter(asset=OuterRef('pk'), ip_type="OTHER-IP").values_list('ip__ip', flat=True)
Asset.objects.filter(asset_ip_set__ip_type="OTHER-IP").annotate(
other_ips=GroupConcat(
Subquery(filtered_ips),
delimiter=', '
)
)
result : django.db.utils.OperationalError: (1242, 'Subquery returns more than 1 row')
Help me....

Django annotate StrIndex for empty fields

I am trying to use Django StrIndex to find all rows with the value a substring of a given string.
Eg:
my table contains:
+----------+------------------+
| user | domain |
+----------+------------------+
| spam1 | spam.com |
| badguy+ | |
| | protonmail.com |
| spammer | |
| | spamdomain.co.uk |
+----------+------------------+
but the query
SpamWord.objects.annotate(idx=StrIndex(models.Value('xxxx'), 'user')).filter(models.Q(idx__gt=0) | models.Q(domain='spamdomain.co.uk')).first()
matches <SpamWord: *#protonmail.com>
The query it is SELECT `spamwords`.`id`, `spamwords`.`user`, `spamwords`.`domain`, INSTR('xxxx', `spamwords`.`user`) AS `idx` FROM `spamwords` WHERE (INSTR('xxxx', `spamwords`.`user`) > 0 OR `spamwords`.`domain` = 'spamdomain.co.uk')
It should be <SpamWord: *#spamdomain.co.uk>
this is happening because
INSTR('xxxx', '') => 1
(and also INSTR('xxxxasd', 'xxxx') => 1, which it is correct)
How can I write this query in order to get entry #5 (spamdomain.co.uk)?
The order of the parameters of StrIndex [Django-doc] is swapped. The first parameter is the haystack, the string in which you search, and the second one is the needle, the substring you are looking for.
You thus can annotate with:
from django.db.models import Q, Value
SpamWord.objects.annotate(
idx=StrIndex('user', Value('xxxx'))
).filter(
Q(idx__gt=0) | Q(domain='spamdomain.co.uk')
).first()
Just filter rows where user is empty:
(~models.Q(user='') & models.Q(idx__gt=0)) | models.Q(domain='spamdomain.co.uk')

Using Pyspark, the regex between two characters gets text and numbers, but not date

Using Pyspark regex_extract() I can substring between two characters in the string. It is grabbing the text and numbers, but is not grabbing the dates.
data = [('2345', '<Date>1999/12/12 10:00:05</Date>'),
('2398', '<Crew>crewIdXYZ</Crew>'),
('2328', '<Latitude>0.8252644369443788</Latitude>'),
('3983', '<Longitude>-2.1915840465066916<Longitude>')]
df = sc.parallelize(data).toDF(['ID', 'values'])
df.show(truncate=False)
+----+-----------------------------------------+
|ID |values |
+----+-----------------------------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |
|2398|<Crew>crewIdXYZ</Crew> |
|2328|<Latitude>0.8252644369443788</Latitude> |
|3983|<Longitude>-2.1915840465066916<Longitude>|
+----+-----------------------------------------+
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^<:]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> | |
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
What can I add to the regex statement to get the date as well?
#jxc Thanks. Here is what made it work:
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^>]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
You may use
>([^<>]+)<
See the regex demo. The regex matches a >, then captures into Group 1 any one or more chars other than < and >, and then just matches >. The ncol argument should be set to 1 since the value you need is in Group 1:
df_2 = df.withColumn('vals', regexp_extract(col('values'), '>([^<>]+)<', 1))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+

Mockito verifying method invocation without using equals method

While using Spock i can do something like this:
when:
12.times {mailSender.send("blabla", "subject", "content")}
then:
12 * javaMailSender.send(_)
When i tried to do same in Mockito:
verify(javaMailSender,times(12)).send(any(SimpleMailMessage.class))
I got an error that SimpleMailMessage has null values, so i had to initialize it in test:
SimpleMailMessage simpleMailMessage = new SimpleMailMessage()
simpleMailMessage.setTo("blablabla")
simpleMailMessage.subject = "subject"
simpleMailMessage.text = "content"
verify(javaMailSender,times(12)).send(simpleMailMessage))
Now it works but it's a large workload and i really don't care about equality. What if SimpleMailMessage will have much more arguments or another objects with another arguments, meh. Is there any way to check that send method was just called X times?
EDIT: added implementation of send method.
private fun sendEmail(recipient: String, subject: String, content: String)
{
val mailMessage = SimpleMailMessage()
mailMessage.setTo(recipient)
mailMessage.subject = subject
mailMessage.text = content
javaMailSender.send(mailMessage)
}
There are 2 senders, mailSender is my custom object and javaMailSender is from another libary
Stacktrace:
Mockito.verify(javaMailSender,
Mockito.times(2)).send(Mockito.any(SimpleMailMessage.class))
| | | | |
| | | | null
| | | Wanted but not invoked:
| | | javaMailSender.send(
| | | <any org.springframework.mail.SimpleMailMessage>
| | | );
| | | -> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
| | |
| | | However, there were exactly 2 interactions with this mock:
| | | javaMailSender.send(
| | | SimpleMailMessage: from=null; replyTo=null; to=blabla; cc=; bcc=; sentDate=null; subject=subject; text=content
| | | );
| | | -> at MailSenderServiceImpl.sendEmail(MailSenderServiceImpl.kt:42)
| | |
| | | javaMailSender.send(
| | | SimpleMailMessage: from=null; replyTo=null; to=blabla; cc=; bcc=; sentDate=null; subject=subject; text=content
| | | );
If you don't care for the parameter of send, leave any() empty:
verify(javaMailSender,times(12)).send(any())

Building dojo generate multiple files

I'm using the building tool of dojo to generate a single file dojo.js, but I don't know why I'm getting multiple files.
This is my example profile:
var profile = (function(){
return {
basePath: "../../../",
releaseDir: "./app",
releaseName: "lib",
action: "release",
layerOptimize: "closure",
optimize: "closure",
mini: true,
stripConsole: "warn",
selectorEngine: "lite",
defaultConfig: {
            hasCache:{
                "dojo-built": 1,
                "dojo-loader": 1,
                "dom": 1,
                "host-browser": 1,
                "config-selectorEngine": "lite"
            },
            async: 1
        },
 
        staticHasFeatures: {
'dojo-trace-api': 0,
'dojo-log-api': 0,
'dojo-publish-privates': 0,
'dojo-sync-loader': 0,
'dojo-test-sniff': 0
},
 
        packages:['dojo'],
 
        layers: {
            "dojo/dojo": {
                include: ["dojo/domReady"],
                customBase: true,
                boot: true
            }
        }
    };
})();
This is my .bat:
./util/buildscripts/build profile=cgl-dojo
After execute it, this is the release folder:
app
\---lib
\---dojo
+---cldr
| \---nls
| +---ar
| +---ca
| +---cs
| +---da
| +---de
| +---el
| +---en
| +---en-au
| +---en-ca
| +---en-gb
| +---es
| +---fi
| +---fr
| +---fr-ch
| +---he
| +---hu
| +---it
| +---ja
| +---ko
| +---nb
| +---nl
| +---pl
| +---pt
| +---pt-pt
| +---ro
| +---ru
| +---sk
| +---sl
| +---sv
| +---th
| +---tr
| +---zh
| +---zh-hant
| +---zh-hk
| \---zh-tw
+---data
| +---api
| \---util
+---date
+---dnd
+---errors
+---fx
+---io
+---nls
| +---ar
| +---az
| +---bg
| +---ca
| +---cs
| +---da
| +---de
| +---el
| +---es
| +---fi
| +---fr
| +---he
| +---hr
| +---hu
| +---it
| +---ja
| +---kk
| +---ko
| +---nb
| +---nl
| +---pl
| +---pt
| +---pt-pt
| +---ro
| +---ru
| +---sk
| +---sl
| +---sv
| +---th
| +---tr
| +---uk
| +---zh
| \---zh-tw
+---promise
+---request
+---resources
| \---images
+---router
+---rpc
+---selector
+---store
| +---api
| \---util
+---_base
\---_firebug
I need a release folder with only one file, please help me.
The entire tree of registered packages is always built because the build tool has no way of knowing whether or not you are conditionally requiring other modules within your application. There is no way to make the build system only output one file, and in fact a single file is a bad idea because each locale has its own set of localisation rules. If you want to reduce the number of files after a build, you can just delete all the ones you don’t want.