Data Preperation Identify String using Regex and move to new column

Data Preperation Identify String using Regex and move to new column - regex

Hello I am using Talend to prepare product data for import into DB. I want to use the extract string parts function for Talend.
I have the following data in one cell. (The length of the data varies not a fixed width format)
Measurement: Ring Head Width: 6.8 Ring Height: 5.5 Ring Shank Width: 1.1 Ladies Band Width: 2.5 Ladies band shank Width: 1.2
I need help creating a regex format to match each measurement value and extract it to a new column.
What would be Regex to match the following text ?
Ring Head Width: 6.8
and extract the numeric value following it, which is
6.8
Similarly I want to create regex for all the above measurements. I am assuming the format will be the same.
Thank for your time and help.

If you don't bother using multiple actions to acheive this result I suggest that you use:
the "Split text in parts" action on ":"
and then use "remove whitespaces" to have a clean value.
If you really need to keep one action, you have the "Remove part of the text" action on regex that is based on the java Pattern.
Using regex ".*:\s" works fine

Related

Hour:Minute format on an APEX chart is not possible

I use Oracle APEX (v22.1) and on a page I created a (line) chart, but I have the following problem for the visualization of the graphic:
On the y-axis it is not possible to show the values in the format 'hh:mi' and I need a help for this.
Details for the axis:
x-axis: A date column represented as a string: to_char(time2, 'YYYY-MM')
y-axis: Two date columns and the average of the difference will be calculated: AVG(time2 - time1); the date time2 is the same as the date in the x-axis.
So I have the following SQL query for the visualization of the series:
SELECT DISTINCT to_char(time2, 'YYYY-MM') AS YEAR_MONTH --x-axis,
AVG(time2 - time1) AS AVERAGE_VALUE --y-axis
FROM users
GROUP BY to_char(time2, 'YYYY-MM')
ORDER BY to_char(time2, 'YYYY-MM')
I have another problem to solve it in another way: I am not familiar with JavaScript, if the solution is only possible in this way. Because I started new with APEX, but I have seen in different tutorials that you can use JS. So, when JS is the only solution, I would be happy to get a short description what I must do on the page.
(I don't know if this point is important for this case: The values time1 and time2 are updated daily.)
On the attributes of the chart I enabled the 'Time Axis Type' under Settings
On the y-axis I change the format to "Time - Short" and I tried with different pattern like ##:## but in this case you see for every value and also on the y-axis the value '01:00' although the line chart was represented in the right way. But when I change the format to Decimal the values are shown correct as the line chart.
I also tried it with the EXTRACT function for the value like 'EXTRACT(HOUR FROM AVG(time2 - time1))|| ':' || EXTRACT(MINUTE FROM AVG(time2 - time1))' but in this case I get an error message
So where is my mistake or is it more difficult to solve this?

ROUND(TRUNC(avg(time2 - time1)/60) + mod(avg(time2 - time1),60)/100, 2) AS Y
will get close to what you want, you can set Y Axis minimum 0 maximum 24
then 12.23 means 12 hour and 23 minutes.

Regex to extract shoe size from string column

I have a database with string column product_name which has data like:
Vans Classic Slip-On Black & White Checkerboard/ White - veľkosť (US) : 6 (EUR: 38)
Vans Old Skool - čierna - veľkosť (US) : 9.5 (EUR: 42.5)
I am trying to extract the US size...
SELECT REGEXP_SUBSTR("product_name", ...) AS "size"
...with desired output like this.
size
6
9.5
I have tried this, but to no avail
SELECT REGEXP_SUBSTR("product_name", '(US)(\d+)') AS "size"

I need to agree with B001, this might not be the best way of saving your information. However, if you are sure your strings are going to have this format, you could use this regex
\(US\) ?: ?(\d+\.?\d*) \(EUR: ?(\d+\.?\d*)\)
This will match the US shoe size first and then the EUR one.
Here is a visual explaination of the regex
Please note that this regex will match BOTH sizes, I'm not sure which one you prefer
You can test more cases in this regex101

When working in the web UI I had to double slash my slashes. Thus the following worked as you want.
select REGEXP_SUBSTR(str, '\\(US\\)\\s\\:\\s(\\d+\\.?\\d*)',1,1,'i',1)
from values ('Vans Classic Slip-On Black & White Checkerboard/ White - veľkosť (US) : 6 (EUR: 38)'),
('Vans Old Skool - čierna - veľkosť (US) : 9.5 (EUR: 42.5)') v(str);
gives:
REGEXP_SUBSTR(STR, '\\(US\\)\\S\\:\\S(\\D+\\.?\\D*)',1,1,'I',1)
6
9.5

Extracting the 'end' of a string, conditions or regular expression?

I have a data table - can be extracted to text or spreadsheet - The column has random text with areas in square metres that I want to copy to a new column in hectares. (So parse text and divide by 10,000).
e.g.
Deposited Plan 172499, 53,310 m2
Deposited Plan 166167, 853 m2
This plan has no area stated
Section 21 Block I Wellington District, 403,573 m2
Output column should have:
5.3310
0.0853
40.3573
Is there a way I can automate this in LibreOffice Calc, or with a regular expression editor like TextCrawler? Or perhaps using an AutoIt script?

Try with this
/([0-9]+,*[0-9]+\sm2)$/

Get a string after a specific word, using a program that has limited regex features?

Looking for help on building a regex that captures a 1-line string after a specific word.
The challenge I'm running into is that the program where I need to build this regex uses a single line format, in other words dot matches new line. So the formula I created isn't working. See more details below. Any advice or tips?
More specific regex task:
I'm trying to grab the line that comes after the word Details from entries like below. The goal is pull out 100% Silk, or 100% Velvet. This is the material of the product that always comes after Details.
Raw data:
<p>Loose fitted blouse green/yellow lily print.
V-neck opening with a closure string.
Small tie string on left side of top.</p>
<h3>Details</h3> <p>100% Silk.</p>
<p>Made in Portugal.</p> <h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p> <p>Size 34 measurements</p>
OR
<p>The velvet version of this dress. High waist fit with hook and zipper closure.
Seams run along edges of pants to create a box-like.</p>
<h3>Details</h3> <p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
Here is the current formula I created that's not working:
Replace (.)(\bDetails\s+(.)) with $3
The output gives the below:
<p>100% Silk.</p>
<p>Made in Portugal.</p>
<h3>Fit</h3>
<p>Model is 5‰Ûª10,‰Û size 2 wearing size 34.</p>
<p>Size 34 measurements</p>
OR
<p>100% Velvet.</p>
<p>Made in the United States.</p>
<h3>Fit</h3> <p>Model is 5‰Ûª10‰Û, size 2 and wearing size M pants.</p> <p>Size M measurements Length: 37.5"åÊ</p>
<p>These pants run small. We recommend sizing up.</p>
`
How do I capture just the desired string? Let me know if you have any tips! Thank you!

Difficult to provide a working solution in your situation as you mention your program has "limited regex features" but don't explain what limitations.
Here is a Regex you can try to work with to capture the target string
^(?:<h3>Details<\/h3>)(.*)$

I would personally use BeautifulSoup for something like this, but here are two solutions you could use:
Match the line after "Details", then pull out the data.
matches = re.findall('(?<=Details<).*$', text)
matches = [i.strip('<>') for i in matches]
matches = [i.split('<')[0] for i in [j.split('>')[-1] for j in matches]]
Replace "Details<...>data" with "Detailsdata", then find the data.
text = re.sub('Details<.*?<.*>', '', text)
matches = re.findall('(?<=Details).*?(?=<)', text)

Neo4j regex string matching not returning expected results

I'm trying to use the Neo4j 2.1.5 regex matching in Cypher and running into problems.
I need to implement a full text search on specific fields that a user has access to. The access requirement is key and is what prevents me from just dumping everything into a Lucene instance and querying that way. The access system is dynamic and so I need to query for the set of nodes that a particular user has access to and then within those nodes perform the search. I would really like to match the set of nodes against a Lucene query, but I can't figure out how to do that so I'm just using basic regex matching for now. My problem is that Neo4j doesn't always return the expected results.
For example, I have about 200 nodes with one of them being the following:
( i:node {name: "Linear Glass Mosaic Tiles", description: "Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!"})
This query produces one result:
MATCH (p)-->(:group)-->(i:node)
WHERE (i.name =~ "(?i).*mosaic.*")
RETURN i
> Returned 1 row in 569 ms
But this query produces zero results even though the description property matches the expression:
MATCH (p)-->(:group)-->(i:node)
WHERE (i.description=~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 601 ms
And this query also produces zero results even though it includes the name property which returned results previously:
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
WHERE (searchText =~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 487 ms
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
RETURN searchText
>
...
SotoLinear Glass Mosaic Tiles Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!
...
Even more odd, if I search for a different term, it returns all of the expected results without a problem.
MATCH (p)-->(:group)-->(i:node)
WITH i, (p.name + i.name + COALESCE(i.description, "")) AS searchText
WHERE (searchText =~ "(?i).*plumbing.*")
RETURN i
> Returned 8 rows in 522 ms
I then tried to cache the search text on the nodes and I added an index to see if that would change anything, but it still didn't produce any results.
CREATE INDEX ON :node(searchText)
MATCH (p)-->(:group)-->(i:node)
WHERE (i.searchText =~ "(?i).*mosaic.*")
RETURN i
> Returned 0 rows in 3182 ms
I then tried to simplify the data to reproduce the problem, but in this simple case it works as expected:
MERGE (i:node {name: "Linear Glass Mosaic Tiles", description: "Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!"})
WITH i, (
i.name + " " + COALESCE(i.description, "")
) AS searchText
WHERE searchText =~ "(?i).*mosaic.*"
RETURN i
> Returned 1 rows in 630 ms
I tried using the CYPHER 2.1.EXPERIMENTAL tag as well but that didn't change any of the results. Am I making incorrect assumptions on how the regex support works? Is there something else I should try or some other way to debug the problem?
Additional information
Here is a sample call that I make to the Cypher Transactional Rest API when creating my nodes. This is the actual plain text that is sent (other than some formatting for easier reading) when adding nodes to the database. Any string encoding is just standard URL encoding that is performed by Go when creating a new HTTP request.
{"statements":[
{
"parameters":
{
"p01":"lsF30nP7TsyFh",
"p02":
{
"description":"Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!",
"id":"lsF3BxzFdn0kj",
"name":"Linear Glass Mosaic Tiles",
"object":"material"
}
},
"resultDataContents":["row"],
"statement":
"MATCH (p:project { id: { p01 } })
WITH p
CREATE UNIQUE (p)-[:MATERIAL]->(:materials:group {name: \"Materials\"})-[:MATERIAL]->(m:material { p02 })"
}
]}
If it is an encoding issue, why does a search on name work, description not work, and name + description not work? Is there any way to examine the database to see if/how the data was encoded. When I perform searches, the text returned appears correct.

just a few notes:
probably replace create unique with merge (which works a bit differently)
for your fulltext search I would go with the lucene legacy index for performance, if your group restriction is not limiting enough to keep the response below a few ms
I just tried your exact json statement, and it works perfectly.
inserted with
curl -H accept:application/json -H content-type:application/json -d #insert.json \
-XPOST http://localhost:7474/db/data/transaction/commit
json:
{"statements":[
{
"parameters":
{
"p01":"lsF30nP7TsyFh",
"p02":
{
"description":"Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!",
"id":"lsF3BxzFdn0kj",
"name":"Linear Glass Mosaic Tiles",
"object":"material"
}
},
"resultDataContents":["row"],
"statement":
"MERGE (p:project { id: { p01 } })
WITH p
CREATE UNIQUE (p)-[:MATERIAL]->(:materials:group {name: \"Materials\"})-[:MATERIAL]->(m:material { p02 }) RETURN m"
}
]}
queried:
MATCH (p)-->(:group)-->(i:material)
WHERE (i.description=~ "(?i).*mosaic.*")
RETURN i
returns:
name: Linear Glass Mosaic Tiles
id: lsF3BxzFdn0kj
description: Introducing our new Rip Curl linear glass mosaic tiles. This Caribbean color combination of greens and blues brings a warm inviting feeling to a kitchen backsplash or bathroom. The colors work very well with white cabinetry or larger tiles. We also carry this product in a small subway mosaic to give you some options! SOLD OUT: Back in stock end of August. Call us to pre-order and save 10%!
object: material
What you can try to check your data is to look at the json or csv dumps that the browser offers (little download icons on the result and table-result)
Or you use neo4j-shell with my shell-import-tools to actually output csv or graphml and check those files.
Or use a bit of java (or groovy) code to check your data.
There is also the consistency-checker that comes with the neo4j-enterprise download. Here is a blog post on how to run it.
java -cp 'lib/*:system/lib/*' org.neo4j.consistency.ConsistencyCheckTool /tmp/foo
I added a groovy test script here: https://gist.github.com/jexp/5a183c3501869ee63d30
One more idea: regexp flags
Sometimes there is a multiline thing going on, there are two more flags:
multiline (?m) which also matches across multiple lines and
dotall (?s) which allows the dot also to match special chars like newlines
So could you try (?ism).*mosaic.*

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Data Preperation Identify String using Regex and move to new column - regex

Related

Hour:Minute format on an APEX chart is not possible

Regex to extract shoe size from string column

Extracting the 'end' of a string, conditions or regular expression?

Get a string after a specific word, using a program that has limited regex features?

Neo4j regex string matching not returning expected results

Categories

Resources