Extract value from a string occurring after a particular word - regex

A json script is passed as a string and I need to extract the numeric value after the content_id for further mapping. Sample Data below:
{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}
The parameters are dynamic so I can't extract using a substr function or count to extract after certain number of occurrences of a special character.

JSON in your example is malformed, it contains extra ] and some tail after closing }. For correct JSON you can use get_json_object, for example:
select get_json_object(src_json,'$.url.content_id') from
(
select '{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36], "packager_path": "/opt/bento4"}}' as src_json
)s
;
Result:
OK
1000231205
Time taken: 21.606 seconds, Fetched: 1 row(s)

You can use regexp_extract funciton in hive with matching regex to extract only the digits from content_id.
Example:
select regexp_extract(col1,'"content_id":\\s"(\\d+)"',1) from (
select string('{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25, "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}')col1
)t;
+-------------+--+
| _c0 |
+-------------+--+
| 1000231205 |
+-------------+--+
Regex description:
"content_id":\\s"(\\d+)" //match literal "content_id": + any space + "digit inside quotes"

Found an expensive way to do it through a combination of regex and substring functions
substr(split(regexp_extract(message,'content_id([^&]*)'), '"')[3],1) as content_id

Related

How to extract date from String column using regex in Spark

I have a dataframe which consist of filename, email and other details. Need to get the dates out of it from one of the column file name.
Ex: File name: Test_04_21_2019_34600.csv
Need to extract the date: 04_21_2019
Dataframe
val df1 = Seq(
("Test_04_21_2018_1200.csv", "abc#gmail.com",200),
("home/server2_04_15_2020_34610.csv", "abc1#gmail.com", 300),
("/server1/Test3_01_2_2019_54680.csv", "abc2#gmail.com",800))
.toDF("file_name", "email", "points")
Output To be
date email points
04_21_2018 abc#gmail.com 200
04_15_2020 abc1#gmail.com 300
01_2_2019 abc2#gmail.com 800
Can we use regex on spark dataframe to achieve this or any other way to achieve this. Any help will be appreciated.
You can use regexp_extract function to extract the date as below
val resultDF = df1.withColumn("date",
regexp_extract($"file_name", "\\d{1,2}_\\d{1,2}_\\d{4}", 0)
)
Output:
+--------------------+--------------+------+----------+
| file_name| email|points| date|
+--------------------+--------------+------+----------+
|Test_04_21_2018_1...| abc#gmail.com| 200|04_21_2018|
|home/server2_04_1...|abc1#gmail.com| 300|04_15_2020|
|/server1/Test3_01...|abc2#gmail.com| 800| 01_2_2019|
+--------------------+--------------+------+----------+

django rawsql postgres nested json/jsonb query

I have a jsonb structure on postgres named data where each row (there are around 3 million of them) looks like this:
[
{
"number": 100,
"key": "this-is-your-key",
"listr": "20 Purple block, THE-CITY, Columbia",
"realcode": "LA40",
"ainfo": {
"city": "THE-CITY",
"county": "Columbia",
"street": "20 Purple block",
"var_1": ""
},
"booleanval": true,
"min_address": "20 Purple block, THE-CITY, Columbia LA40"
},
.....
]
I would like to query the min_address field in the fastest possible way. In Django I tried to use:
APModel.objects.filter(data__0__min_address__icontains=search_term)
but this takes ages to complete (also, "THE-CITY" is in uppercase, so, I have to use icontains here. I tried dropping to rawsql like so:
cursor.execute("""\
SELECT * FROM "apmodel_ap_model"
WHERE ("apmodel_ap_model"."data"
#>> array['0', 'min_address'])
#> %s \
""",\
[json.dumps([{'min_address': search_term}])]
)
but this throws me strange errors like:
LINE 4: #> '[{"min_address": "some lane"}]'
^
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts.
I am wondering what is the fastest way I can query the field min_address by using rawsql cursors.
Late answer, probably it won't help OP anymore. Also I'm not at all an expert in Postgres/JSONB, so this might be a terrible idea.
Given this setup;
so49263641=# \d apmodel_ap_model;
Table "public.apmodel_ap_model"
Column | Type | Collation | Nullable | Default
--------+-------+-----------+----------+---------
data | jsonb | | |
so49263641=# select * from apmodel_ap_model ;
data
-------------------------------------------------------------------------------------------
[{"number": 1, "min_address": "Columbia"}, {"number": 2, "min_address": "colorado"}]
[{"number": 3, "min_address": " columbia "}, {"number": 4, "min_address": "California"}]
(2 rows)
The following query "expands" objects from data arrays to individual rows. Then it applies pattern matching to the min_address field.
so49263641=# SELECT element->'number' as number, element->'min_address' as min_address
FROM apmodel_ap_model ap, JSONB_ARRAY_ELEMENTS(ap.data) element
WHERE element->>'min_address' ILIKE '%col%';
number | min_address
--------+---------------
1 | "Columbia"
2 | "colorado"
3 | " columbia "
(3 rows)
However, I doubt it will perform well on large datasets as the min_address values are casted to text before pattern matching.
Edit: Some great advice here on indexing JSONB data for search https://stackoverflow.com/a/33028467/1284043

SQLite extract string from text in column

I have a Spatialite Database and I've imported OSM Data into this database.
With the following query I get all motorways:
SELECT * FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
I use GLOB '*A [0-9]*' here, because in Germany every Autobahn begins with A, followed by a number (like A 73).
There is a column called other_tags with information about the motorway part:
"bdouble"=>"yes","hazmat"=>"designated","lanes"=>"2","maxspeed"=>"none","oneway"=>"yes","ref"=>"A 73","width"=>"7"
If you look closer there is the part "ref"=>"A 73".
I want to extract the A 73 as the name for the motorway.
How can I do this in sqlite?
If the format doesn't change, that means that you can expect that the other_tags field is something like %"ref"=>"A 73","width"=>"7"%, then you can use instr and substr (note that 8 is the length of "ref"=>"):
SELECT substr(other_tags,
instr(other_tags, '"ref"=>"') + 8,
instr(other_tags, '","width"') - 8 - instr(other_tags, '"ref"=>"')) name
FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
The result will be
name
A 73
Check with following condition..
other_tags like A% -- Begin With 'A'.
abs(substr(other_tags, 3,2)) <> 0.0 -- Substring from 3rd character, two character is number.
length(other_tags) = 4 -- length of other_tags is 4
So here is how your query should be:
SELECT *
FROM lines
WHERE other_tags LIKE 'A%'
AND abs(substr(other_tags, 3,2)) <> 0.0
AND length(other_tags) = 4
AND highway = 'motorway'

R - grab the exact 8 digits number in a string and transform it

I have 2 problems in extracting and transforming data using R.
Here's the dataset:
messageID | msg
1111111111 | hey id 18271801, fix it asap
2222222222 | please fix it soon id12901991 and 91222911. dissapointed
3333333333 | wow $300 expensive man, come on
4444444444 | number 2837169119 test
The problem is:
I want to grab the number with only 8 digits length. In the dataset above, message id 3333...(300 - 3 digits) and 4444...(2837169119 - 10 digits) should not included. And here's my best shot so far:
as.matrix(unlist(apply(df[2],1,function(x){regmatches(x,gregexpr('([0-9]){8}', x))})))
.
However, with this line of code, message 444... is included because is contains more than 8 digits number.
Transform the data to another form like this:
message_id | customer_ID
1111111111 | 18271801
2222222222 | 12901991
2222222222 | 91222911
I don't know how to efficiently transform the data.
The output of dput(df):
structure(list(id = c(1111111111, 2222222222, 3333333333, 4444444444
), msg = c("hey id 18271801, fix it asap", "please fix it soon id12901991 and 91222911. dissapointed",
"wow $300 expensive man, come on", "number 2837169119 test")), .Names = c("id",
"msg"), row.names = c(NA, 4L), class = "data.frame")
Thanks
Use rebus to create your regular expression, and stringr to extract the matches.
You may need to play with the exact form of the regular expression. This code works on your examples, but you'll probably need to adapt it for your dataset.
library(rebus)
library(stringr)
# Create regex
rx <- negative_lookbehind(DGT) %R%
dgt(8) %R%
negative_lookahead(DGT)
rx
## <regex> (?<!\d)[\d]{8}(?!\d)
# Extract the IDs
extracted_ids <- str_extract_all(df$msg, perl(rx))
# Stuff the IDs into a data frame.
data.frame(
messageID = rep(
df$id,
vapply(extracted_ids, length, integer(1))
),
extractedID = unlist(extracted_ids, use.names = FALSE)
)

REGEXP_SUBSTR Is taking more time for execution in Oracle

I am trying to split a comma separated email string into individual email ids which are comma separated but each email id is enclosed inside single quotation.
My Input is 'one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com'
My Output Should be: 'one#gmail.com','two#gamil.com','three#gmail.com','four#gmail.com'
I am going to use the output string above in oracle query where condition like...
Where EmailId's in ( 'one#gmail.com','two#gamil.com','three#gmail.com','four#gmail.com');
I am using the following code to achieve this
WHERE EMAIL IN
(REGEXP_SUBSTR('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' ,'[^,]+', 1, LEVEL))
CONNECT BY LEVEL <= LENGTH('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' ) - LENGTH(REPLACE('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com' , ',', '')) +1;
But the above query taking 60 seconds to return only 16 records. Can any one suggest me the best approach for this...
Try this,
WHERE email IN (
select regexp_substr('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com','[^,]+', 1, level) from dual
connect by regexp_substr('one#gmail.com,two#gamil.com,three#gmail.com,four#gmail.com', '[^,]+', 1, level) is not null );