How to extract only the weight from the title using regular expressions?

How to extract only the weight from the title using regular expressions? - regex

I need to get the weight for below titles? Since they aren't follow the same way I couldn't get the desired results.
Pommes d'Aquitaine et mangue (dès 4 mois) - 2 x 130 g (Babybio)
Céréales vanille avec quinoa - à partir de 6 mois - 220 g (Babybio)
Ratatouille riz (dès 12 mois) 2 x 200 g (Babybio)
Pomme de terre, petits pois et jambon (dès 8 mois) 2 x 200 g
Fondue de carotte et maïs doux au quinoa (dès 12 mois) 230 g (Babybio)
Gourdes de fruits: pomme d'Aquitaine, poire et pêche - dès 6 mois - 4 x 90 g (Babybio)
Douceur de panais du Val de Loire, carotte et riz (dès 12 mois) 230 g (Babybio)
Expecting results:
2 x 130 g
220 g
2 x 200 g
230 g
4 x 90 g
230 g
I tried this pattern:
[0-9]+ x \d+ g

Try this:
\d+( ?x ?\d+)? g
Here Is Demo
This allow:
220 g
220x10 g
10 x23 g
45x 78 g
...

This will work. Your regex was missing only the cases that there was not x
[0-9]+( x )?\d+ g
The (...) captures everything enclosed in it, and the ? captures one occurrence or none.

Related

Counting sequence by group

Lets say I ahve a dataset as follows:
ID Cat
101 G
101 G
101 F
101 G
102 F
102 F
102 G
102 F
102 F
i want to create a variable for sequence by group variable ID,Cat(notsorted)
a count can be this way
data X1; set have; by ID, cat notsorted;
if first.cat then count=1; else count+1;run;
ID Cat count
101 G 1
101 G 2
101 F 1
101 G 1
102 F 1
102 F 2
102 G 1
102 F 1
102 F 1
however what I am looking for is.
ID Cat Seq
101 G 1
101 G 1
101 F 2
101 G 3
102 F 1
102 F 1
102 G 2
102 F 3
102 F 3

You can just use
seq+first.cat;
So every time you start a new CAT value the SEQ will increment by one.
To reset for each ID add:
if first.id then seq=1;

How to move a word position in a sentence in pyspark

I have the following street addresses:
- KR 71D 6 94 SUR LC 1709
- KR 24B 15 20 SUR AP 301
- KR 72F 39 42 SUR
- KR 72F SUR 39 42
- KR 72 SUR 39 42
What I need is detect the word SUR only located after the address plate, remove it and then setter after the main address. For example:
- KR 71D 6 94 SUR LC 1709 <-- Change it to: KR 71D SUR 6 94 LC 1709
- KR 24B 15 20 SUR AP 301 <-- Change it to: KR 24B SUR 15 20 AP 301
- KR 72F 39 42 SUR <-- Change it to: KR 72F SUR 39 42
- KR 72F SUR 39 42 <-- It is ok, leave it this way
- KR 72 SUR 39 42 <-- It is ok, leave it this way
Thanks a lot, and I hope somebody could help me.

You can try this:
import re
lyst = ["KR 71D 6 94 SUR LC 1709","KR 24B 15 20 SUR AP 301","KR 72F 39 42 SUR","KR 72F SUR 39 42","KR 72 SUR 39 42"]
comp = re.compile(r'([a-zA-Z]+)(\s)(\w+)\s(\d+)\s(\d+)\s([a-zA-Z]+)(.*)$')
Logic:
Using the logic of capturing the match in parenthesis, you can capture all the matches of words(inclusive numbers and words) separated by spaces, for the match of SUR, we need the fifth word to be matched and inserted at third position. So, we capture that in \6 (one greater than 5 because we are also matching one space). After this match, pick everything else in the single match using (.*). We are using here sub from re module. For the last two strings since the pattern never passes hence nothing is replaced and the string will remain as it is.
newlyst = []
for items in lyst:
newlyst.append(re.sub(comp, r'\1\2\3\2\6\2\4\2\5\7', items))
You can print the newlyst to see the output:
Output:
['KR 71D SUR 6 94 LC 1709', 'KR 24B SUR 15 20 AP 301', 'KR 72F SUR 39 42', 'KR 72F SUR 39 42', 'KR 72 SUR 39 42']

TOPN in PowerBI DAX not arranging values in proper order

I have been running into some issues with the TOPN function in DAX in PowerBI.
Below is the original dataset:
regions sales
--------------
a 1191
b 807
c 1774
d 376
e 899
f 1812
g 1648
h 6
i 1006
j 1780
k 243
l 777
m 747
n 61
o 1637
p 170
q 1319
r 1437
s 493
t 1181
u 118
v 1787
w 1396
x 102
y 104
z 656
So now, I want to get the Top 5 sales in descending order.
I used the following code:
Table = TOPN(5, SUMMARIZE(Sheet1, Sheet1[regions], Sheet1[sales]), Sheet1[sales], DESC)
The resulting table is as follows:
regions sales
--------------
g 1648
j 1780
c 1774
v 1787
f 1812
Any idea why this is happening?

According to Microsoft documentation this is working as intended.
https://msdn.microsoft.com/en-us/query-bi/dax/topn-function-dax
Remarks
TOPN does not guarantee any sort order for the results.
What you can do is to create a RANKX to sort by.

When reading from a file using getline() spaces appear between each character?

I am designing a program that takes a directory list using "dir > music.txt". My goal is to remove the permissions and the date from the file and alphabetize the list of artists. When using "getline(file, input), spaces appear between each character when the "input" variable is sent to the screen. Here is the code:
#include "stdafx.h"
using namespace std;
void AlphaSort(string (&data)[300], int size);
void PrintArray(string(&data)[300]);
// This program will read in an array, the program will then remove any text
// that is before 59 characters, then the program will remove any spaces that
// are not succeeded by letters.
int main()
{
fstream file;
string input;
string data[300];
file.open("music.txt");
while (getline(file, input))
{
cout << input << endl;
// Scroll through the entire file. Copy the lines into memory
for (int i = 0; i <= input.length() - 1; i++)
{
// Process input here...
}
}
// The array has been loaded into memory, run the sort
//AlphaSort(data, 300);
//PrintArray(data);
return 0;
}
Below is sample of the output:
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 5 A M E i f f e l 6 5
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 9 A M O n e R e p u b l i c
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 8 A M M a r o o n 5
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 8 A M L u m i n e e r s
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 8 A M M y C h e m i c a l R o m a n c e
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 4 A M B o b M a r l e y
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 9 A M P a r a m o r e
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 7 A M I n c u b u s
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 4 A M C a r p e n t e r s
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 5 A M F a i t h N o M o r e
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 2 A M B a s t i l l e
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 6 A M F r a n k i e G o e s T o H o l l y w o o d
d - - - - - 8 / 7 / 2 0 1 7 1 1 : 1 7 A M H o o b a s t a n k
As you can see, there are spaces included between each character.I have been looking at this for the past hour. I there a more "correct" way to input from a file? Below is the input file that does not contain spaces between each character:
d----- 8/7/2017 11:15 AM Eiffel 65
d----- 8/7/2017 11:19 AM One Republic
d----- 8/7/2017 11:18 AM Maroon 5
d----- 8/7/2017 11:18 AM Lumineers
d----- 8/7/2017 11:18 AM My Chemical Romance
d----- 8/7/2017 11:14 AM Bob Marley
d----- 8/7/2017 11:19 AM Paramore
d----- 8/7/2017 11:17 AM Incubus
d----- 8/7/2017 11:14 AM Carpenters
d----- 8/7/2017 11:15 AM Faith No More
d----- 8/7/2017 11:12 AM Bastille
d----- 8/7/2017 11:16 AM Frankie Goes To Hollywood
d----- 8/7/2017 11:17 AM Hoobastank
d----- 8/7/2017 11:21 AM Young The Giant
d----- 8/7/2017 11:15 AM Disturbed
d----- 8/7/2017 11:12 AM Authority Zero

I am designing a program that takes a directory list using "dir > music.txt".
Don't. If you want to work with the directory contents, work with it directly.
My goal is to remove the permissions and the date from the file and alphabetize the list of artists.
If you wouldn't use dir, you wouldn't have permissions and date information listed.
Also, C++ is not a good tool to use for this task. You seem to be on Windows, so try writing a PowerShell script, or if you have Unix-ish tools installed, bash. It should be pretty simple.

Creating A Dataframe From A Text Dataset

I have a dataset that has hundreds of thousands of fields. The following is a simplified dataset
dataSet <- c("Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb DFStorLocLevel",
"0231 0002 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD X A A A 18 136 30 29 50 43 24.88 51.000 EA",
"0231 0002 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD X B B A 16 17 3 3 5 4 483.87 1.000 EA X",
"0231 0002 WH.920569 SPINDLE MOTOR MINI O 22 PD X A A A 69 85 15 9 25 13 680.91 21.000 EA",
"0231 0002 GB.C150583-00001 VALVE-AIR MDI 64 PD X A A A 16 113 50 35 80 52 19.96 116.000 EA",
"0231 0002 FG.124-0140 BEARING 32 PD X A A A 36 205 35 32 50 48 21.16 55.000 EA",
"0231 0002 WP.254997 BEARING,BALL .9843 X 2.04 52 PD X A A A 18 155 50 39 100 58 2.69 181.000 EA"
)
I would like to create a dataframe out of this dataSet for further calculation. The approach I am following is as follows:
I split the dataSet by space and then recombine it.
dataSetSplit <- strsplit(dataSet, "\\s+")
The header (which is the first line) splits correctly and produces 25 characters. This can be seen by the str() function.
str(dataSetSplit)
I will then intend to combine all the rows together using the folloing script
combinedData <- data.frame(do.call(rbind, dataSetSplit))
Please note that the above script "combinedData " errors because the split did not produce equal number of fields.
For this approach to work all the fields must split correctly into 25 fields.
If you think this is a sound approach please let me know how to split the fileds into 25 fields.
It is worth mentioning that I do not like the approach of splitting the data set with the function strsplit(). It is an extremely time consuming step if used with a large data set. Can you please recommend an alternate approach to create a data frame out of the supplied data?

By the looks of it, you have a header row that is actually helpful. You can easily use gregexpr to calculate your "widths" to use with read.fwf.
Here's how:
## Use gregexpr to find the position of consecutive runs of spaces
## This will tell you the starting position of each column
Widths <- gregexpr("\\s+", dataSet[1])[[1]]
## `read.fwf` doesn't need the starting position, but the width of
## each column. We can use `diff` to calculate this.
Widths <- c(Widths[1], diff(Widths))
## Since there are no spaces after the last column, we need to calculate
## a reasonable width for that column too. We can do this with `nchar`
## to find the widest row in the data. From this, subtract the `sum`
## of all the previous values.
Widths <- c(Widths, max(nchar(dataSet)) - sum(Widths))
Let's also extract the column names. We could do this in read.fwf, but it would require us to substitute the spaces in the first line with a "sep" character.
Names <- scan(what = "", text = dataSet[1])
Now, read in everything except the first line. You would use the actual file instead of textConnection, I would suppose.
read.fwf(textConnection(dataSet), widths=Widths, strip.white = TRUE,
skip = 1, col.names = Names)
# Plnt SLoc Material Description L.T MRP Stat Auto MatSG PC PN Freq Qty
# 1 231 2 GB.C152260-00001 ASSY PISTON & SEAL/O-RING 44 PD NA X A A A 18 136
# 2 231 2 WH.112734 MOTOR REDUCER, THREE-PHAS 41 PD NA X B B A 16 17
# 3 231 2 WH.920569 SPINDLE MOTOR MINI O 22 PD NA X A A A 69 85
# 4 231 2 GB.C150583-00001 VALVE-AIR MDI 64 PD NA X A A A 16 113
# 5 231 2 FG.124-0140 BEARING 32 PD NA X A A A 36 205
# 6 231 2 WP.254997 BEARING,BALL .9843 X 2.04 52 PD NA X A A A 18 155
# CFreq CQty Cur.RPt New.RPt CurRepl NewRepl Updt Cost ServStock Unit OpenMatResb
# 1 NA NA 30 29 50 43 NA 24.88 51 EA <NA>
# 2 NA NA 3 3 5 4 NA 483.87 1 EA X
# 3 NA NA 15 9 25 13 NA 680.91 21 EA <NA>
# 4 NA NA 50 35 80 52 NA 19.96 116 EA <NA>
# 5 NA NA 35 32 50 48 NA 21.16 55 EA <NA>
# 6 NA NA 50 39 100 58 NA 2.69 181 EA <NA>
# DFStorLocLevel
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA

Many thanks to Ananda Mahto, he provided many pieces to this answer.
widthMinusFirst <- diff(gregexpr('(\\s[A-Z])+', dataSet[1])[[1]])
widthFirst <- gregexpr('\\s+', dataSet[1])[[1]][1]
Width <- c(widthFirst, widthMinusFirst)
Widths <- c(Width, max(nchar(dataSet)) - sum(Width))
columnNames <- scan(what = "", text = dataSet[1])
read.fwf(textConnection(dataSet[-1]), widths = Widths, strip.white = FALSE,
skip = 0, col.names = columnNames)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to extract only the weight from the title using regular expressions? - regex

Try this: \d+( ?x ?\d+)? g Here Is Demo This allow: 220 g 220x10 g 10 x23 g 45x 78 g ...

This will work. Your regex was missing only the cases that there was not x [0-9]+( x )?\d+ g The (...) captures everything enclosed in it, and the ? captures one occurrence or none.

Related

Counting sequence by group

How to move a word position in a sentence in pyspark

TOPN in PowerBI DAX not arranging values in proper order

When reading from a file using getline() spaces appear between each character?

Creating A Dataframe From A Text Dataset

Categories

Resources