R - split string before two last digits in each column cell - regex

I have a csv with usernames in a column, followed by each user's feedback rating, out of 100.
E.g. James89
I hope to find a way to split the name and the rating, e.g. by inserting a comma before the two last digits using regex. Is this possible? And/or is there a better way to do this?
df1 = data.frame(Product = c(rep("ARCH78"), rep("AUSFUNGUY91"), rep("AddiesAndXans96"), rep("AfroBro79")))
The code above is a tiny excerpt of the data I'm dealing with. I hope to get this output:
ARCH 78
AUSFUNGUY 91
AddiesAndXans 96
AfroBro 79
I've tried this code (inspired from this answer:
df1$P2 <- gsub("(.*?)(..)", "\\1", df1$Product)
It seems to be working, but there's something wrong with the output:
ARCH78 AR
AUSFUNGUY91 AUUNY
AddiesAndXans96 AdesdXs
AfroBro79 AfBr9

As for the following:
I hope to find a way to split the name and the rating, e.g. by inserting a comma before the two last digits using regex.
You can achieve it with a mere
df1 = data.frame(Product = c(rep("ARCH78"), rep("AUSFUNGUY91"), rep("AddiesAndXans96"), rep("AfroBro79")))
gsub("(\\d{2})$",",\\1",df1$Product)
## => [1] "ARCH,78" "AUSFUNGUY,91" "AddiesAndXans,96" "AfroBro,79"
See IDEONE demo
You can further adjust the replacement ",\\1" that features a backreference \1 to the last 2 digits.

Related

Retrieving the 12th through 14th characters from a long strong using ONLY regex - Grafana variable

I have a small issue, I am trying to get specific characters from a long string using regex but I am having trouble.
Workflow
Prometheus --> Grafana --> Variable (using regex)
I can't use anything other than Regex expressions to achieve this result
I am currently using this expression to grab the long string from some json output:
.*channel_id="(.*?)".*
FROM THIS
{account_id="XXXXXXX-xxxx-xxxx-xxxx-xxxxxxxxxx",account_name="testalpha",channel_id="s0022110430col0901241usa",channel_abbr="s0022109430col}
This returns a string that's ALWAYS 24 characters long:
s0022110430col0901241usa
PROBLEM:
I need to grab the 3 letters 'col' and 'usa' as they are the two teams that are playing, ideally I would be able to pipe the results from the first regex to get these values (the position is key, since the first value will ALWAYS be the 12-14th characters and the second value is the last 3 characters) if I could output these values in uppercase with the string "vs" in between to create a string such as:
COL vs USA
or
ARG vs BRA
I am open to any and every suggestion anyone may have
Thank you!
PS - The uppercase thing is 'nice to have' BUT not needed
I'm still learning RegEx, so this is all I could come up with:
For the col (first team):
(?<=(channel_id=".{11}))\w{3}
For the usa (second team):
(?<=(channel_id=".{21}))\w{3}
Can you define the channel_id?
It begins with 's' and then there are many numbers. If they are always numbers, you can use this regex:
channel_id=".[0-9]+([a-z]+)[0-9]+([a-z]+)
You will get 2 groups, one with "col" and the other with "usa".
Edit:
Or if you just know, that you have always the same size, you can use something like:
channel_id=".{11}([a-z]+).{7}([a-z]+)

Extract data from dataset

I need to extract title from name but cannot understand how it is working . I have provided the code below :
combine = [traindata , testdata]
for dataset in combine:
dataset["title"] = dataset["Name"].str.extract(' ([A-Za-z]+)\.' , expand = False )
There is no error but i need to understand the working of above code
Name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry
Moran, Mr. James
above is the name feature from csv file and in dataset["title"] it stores the title of each name that is mr , miss , master , etc
Your code extracts the title from name using pandas.Series.str.extract function which uses regex
pandas.series.str.extract - Extract capture groups in the regex pat as columns in a DataFrame.
' ([A-Za-z]+)\.' this is a regex pattern in your code which finds the part of string that is here Name wherever a . is present.
[A-Za-z] - this part of pattern looks for charaters between alphabetic range of a-z and A-Z
+ it states that there can be more than one character
\. looks for following . after a part of string
An example is provided on the link above where it extracts a part from
string and puts the parts in seprate columns
I found this specific response with the link very helpful on how to use the 'str's extract method and put the strings in columns and series with changing the expand's value from True to False.

Removing unmatched text and building a table with the remaining matches

I have 30000 lines that look like the one below.
342800005013000 CON N GORE PT LOT 31 RP 11R2284 PT PART 1 RP 11R4541 PT PART 2
I would like to capture the 15 digit number at the beginning and any "11R***" numbers.
In Notepad++ I've used \d{15}|(11R\d*)* to match everything that I want. Ultimately I would like to get all the matched results into excel. What would be the best way to do so?
Thanks for your help.
Notepad++ Matches
You could try this one
(^[0-9]*)|(11R[0-9A-Za-z]*)
edit: check it now, the code formatting correctly displays the regex;

Find repeating gps using regular expression

I work with text files, and I need to be able to see when the gps (last 3 columns of csv) "hangs up" for more than a few lines.
So for example, usually, part of a text file looks like this:
5451,1667,180007,35.7397387,97.8161897,375.8
5448,1053z,180006,35.7397407,97.8161814,375.7
5444,1667,180005,35.7397445,97.8161674,375.6
5439,1668,180004,35.7397483,97.8161526,375.5
5435,1669,180003,35.7397518,97.8161379,375.5
5431,1669,180002,35.7397554,97.8161269,375.6
5426,1054z,180001,35.7397584,97.8161115,375.6
5420,1670,175959,35.7397649,97.8160931,375.9
But sometimes there is an error with the gps and it looks like this:
36859,1598,202603.00,35.8867316,99.2515545,555.700
36859,1598,202608.00,35.8867316,99.2515545,555.700
36859,1142z,202610.00,35.8867316,99.2515545,555.700
36859,1597,202612.00,35.8867316,99.2515545,555.700
36859,1597,202614.00,35.8867316,99.2515545,555.700
36859,1596,202616.00,35.8867316,99.2515545,555.700
36859,1595,202618.00,35.8867316,99.2515545,555.700
I need to be able to figure out a way to search for matching strings of 7 different numbers, (the decimal portion of the gps) but so far I've only been able to figure out how to search for repeating #s or consecutive numbers.
Any ideas?
If you were to find such repetitions in an editor (such as Notepad++), you could use the following regex to find 4 or more repeating lines:
([^,]+(?:,[^,]+){2})\v+(?:(?:[^,]+,){3}\1(?:\v+|$)){3,}
To go a bit into detail
([^,]+(?:,[^,]+){2})\v+ is a group consisting of one or more non-commas followed by comma and another one or more non-commas followed by a vertical space (linebreak), that is not part of the group (e.g. 1,1,1\n)
(?:[^,]+,){3} matches one or more non-commas followed by comma, three times (your columns that don't have to be considered)
\1 is a backreference to group 1, matching if it contains exactly the same as group 1
(?:\v+|$) matches either another vertical whitespaces or the end of the text
{3,} for 3 or more repetitions - increase it if you want more
Here you can see, how it works
However, if you are using any programming language to check this, I wouldn't walk on the path of regex, as checking for those repetitions can be done a lot easier. Here is one example in Python, I hope you can adopt it for your needs:
oldcoords = [0,0,0]
lines = [line.rstrip('\n') for line in open(r'C:\temp\gps.csv')]
for line in lines:
gpscoords = line.split(',')[3:6]
if gpscoords == oldcoords:
repetitions += 1
else:
oldcoords = gpscoords
repetitions = 0
if repetitions == 4: #or however you define more than a few
print(', '.join(gpscoords) + ' is repeated')
If you can use perl, and if I understood you:
perl -ne 'm/^[^,]*,[^,]*,[^,]*,([^,]*,[^,]*,[^,]*$)/g; $current_line=$1; ++$line_number; if ($prev_line==$current_line){$equals++} else {if ($equals>=6){ print "Last three fields in lines ".($line_number-$equals-1)." to ".($line_number-1)." are equals to:\n$prev_line" } ; $equals=0}; $prev_line=$current_line' < onlyreplacethiswithyourfilepath should do the trick.
Sample output:
Last three fields in lines 1 to 7 are equals to:
35.8867316,99.2515545,555.700
Last three fields in lines 16 to 22 are equals to:
37.8782116,99.7825545,572.810
Last three fields in lines 31 to 44 are equals to:
36.6868916,77.2594245,581.358
Last three fields in lines 57 to 63 are equals to:
35.5128764,71.2874545,575.631

Split sentence by words with regex in R

I'm using (or I'd like to use) R to extract some information. I have the following sentence and I'd like to split. In the end, I'd like to extract only the number 24.
Here's what I have:
doc <- "Hits 1 - 10 from 24"
And I want to extract the number "24". I know how to extract the number once I can reduce the sentence in "Hits 1 - 10 from" and "24". I tried using this:
n_docs <- unlist(str_split(key_n_docs, ".\\from"))[1]
But this leaves me with: "Hits 1 - 10"
Obviously the split works somehow, but I'm interested in the part after "from" not the one before. All the help is appreciated!
If you want to extract from a single character string:
strsplit(key_n_docs, "from")[[1]][2]
or the equivalent expression used by #BastiM (sorry I saw your answer after I submitted mine)
unlist(strsplit(key_n_docs, "from"))[2]
If you want to extract from a vector of character strings:
sapply(strsplit(key_n_docs, "from"),`[`, 2)
Usually the result of str_split would contain the number you're searching for at index 1, but since you wrap it with unlist it seems you have to increment the index by one. Using
unlist(strsplit("Hits 1 - 10 from 24", "from"))[2]
works like a charm for me.
demo # ideone
You can use str_extract from stringr:
library(stringr)
numbers <- str_extract(doc, "[0-9]+$")
This will give only the numbers in the end of the sentence.
numbers
"24"
You can use sub to extract the number:
sub(".*from *(\\d+).*", "\\1", doc)
# [1] "24"