Group repeated regex? - regex

I would like to match the following text:
PokerStars Hand #95528134282: Tournament #2013004001, $0.10+$0.01 USD Hold'em No Limit - Level VI (100/200) - 2013/03/14 15:35:36 WET [2013/03/14 11:35:36 ET]
Table '2013004001 5898' 9-max Seat #1 is the button
Seat 1: Pucharrin (7250 in chips)
Seat 2: pahol (24180 in chips)
Seat 3: dno16 (2000 in chips)
Seat 4: sogd20i07 (150 in chips) is sitting out
Seat 5: koaollie (13680 in chips)
Seat 6: vovik770 (6307 in chips)
Seat 7: gab341978 (6920 in chips)
Seat 8: 19gow63 (1000 in chips)
Seat 9: pokerplayer (9048 in chips)
pahol: posts small blind 100
dno16: posts big blind 200
*** HOLE CARDS ***
Dealt to pokerplayer [3s 9d]
sogd20i07: folds
koaollie: folds
vovik770: folds
gab341978: folds
19gow63: raises 800 to 1000 and is all-in
pokerplayer: folds
Pucharrin: folds
pahol: raises 1000 to 2000
dno16: calls 1800 and is all-in
*** FLOP *** [4s 7c Ah]
*** TURN *** [4s 7c Ah] [Qs]
*** RIVER *** [4s 7c Ah Qs] [Ks]
*** SHOW DOWN ***
pahol: shows [Qc Qd] (three of a kind, Queens)
dno16: shows [6h 2h] (high card Ace)
pahol collected 2000 from side pot
19gow63: shows [Kd 2s] (a pair of Kings)
pahol collected 3000 from main pot
19gow63 re-buys and receives 1000 chips for $0.10
dno16 re-buys and receives 2000 chips for $0.20
*** SUMMARY ***
Total pot 5000 Main pot 3000. Side pot 2000. | Rake 0
Board [4s 7c Ah Qs Ks]
Seat 1: Pucharrin (button) folded before Flop (didn't bet)
Seat 2: pahol (small blind) showed [Qc Qd] and won (5000) with three of a kind, Queens
Seat 3: dno16 (big blind) showed [6h 2h] and lost with high card Ace
Seat 4: sogd20i07 folded before Flop (didn't bet)
Seat 5: koaollie folded before Flop (didn't bet)
Seat 6: vovik770 folded before Flop (didn't bet)
Seat 7: gab341978 folded before Flop (didn't bet)
Seat 8: 19gow63 showed [Kd 2s] and lost with a pair of Kings
Seat 9: pokerplayer folded before Flop (didn't bet)
And I would like to capture the lines starting with "Seat somenumber: someplayername (somenumber in chips) someoptionaltext" as a group.
I tried the following regex:
PokerStars.*?Level .+? \(\d+\/(\d+)\) (?:.|\s)*?((?:Seat \d+: .*? \(\d+ in chips\)(?:.|\s)*?)+)(?:.|\s)*?(.*?) collected (\d+)
but it only captures the first occurence "Seat 1: Pucharrin (7250 in chips) ".
How can I change it to capture the group?
Thanks.

Here is a regexp that captures the data you need. When you specify the syntax/ordering of those elements, I could revamp it for you
(Seat\s(\d)+:\s+([\w\d\-\_\.]+)\s+\((\d+)\s+in\s+chips\)(.*))
I used your service and input to test it, seems to work fine.
-- EDIT --
Unless you want the entire block starting with PokerStars and stopping at the last )
Then maybe:
if (preg_match('/((?:PokerStars.*\s*Table.*[\r\n]*)(?:Seat.*\s+)+)/i', $subject, $regs)) {
$result = $regs[0];
} else {
$result = "";
}

Related

Extracting Multiple Blocks of Similar Text

I am trying to parse a report. The following is a sample of the text that I need to parse:
7605625112 DELIVERED N 1 GORDON CONTRACTORS I SIPLAST INC Freight Priority 2000037933 $216.67 1,131 ROOFING MATERIALS
04/23/2021 02:57 PM K WRISHT N 4 CAPITOL HEIGHTS, MD ARKADELPHIA, AR Prepaid 2000037933 -$124.23 170160-00
04/27/2021 12:41 PM 2 40 20743-3706 71923 $.00 055 $.00
2 WBA HOT $62.00 0
$12.92 $92.44
$167.36
7605625123 DELIVERED N 1 SECHRIST HALL CO SIPLAST INC Freight Priority 2000037919 $476.75 871 PAIL,UN1263,PAINT,3,
04/23/2021 02:57 PM S CHAVEZ N 39 HARLINGEN, TX ARKADELPHIA, AR Prepaid 2000037919 -$378.54
04/27/2021 01:09 PM 2 479 78550 71923 $.00 085 $95.35
2 HRL HOT $62.00 21
$13.55 $98.21
$173.76
This comprised of two or more blocks that start with "[0-9]{10}\sDELIVERED" and the last currency string prior to the next block.
If I test with "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$167.36\n)" I successfully get the first Block, but If I use "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$\d\d\d.\d\d\n)" it grabs everything.
If someone can show me the changes that I need to make to return two or more blocks I would greatly appreciate it.
* is a greedy operator, so it will try to match as much characters as possible. See also Repetition with Star and Plus.
For fixing it, you can use this regex:
(?s)(\d{10}\sDELIVERED)((.(?!\d{10}\sDELIVERED))*)(?<=\$\d\d\d.\d\d)
in which I basically replaced .* with (.(?!\d{10}\sDELIVERED))* so that for every character it checks if it is followed or not by \d{10}\sDELIVERED.
See a demo here

substring characters from a column in a data.table in R

Is there a more "r" way to substring two meaningful characters out of a longer string from a column in a data.table?
I have a data.table that has a column with "degree strings"... shorthand code for the degree someone got and the year they graduated.
> srcDT<- data.table(
alum=c("Paul Lennon","Stevadora Nicks","Fred Murcury"),
degree=c("W72","WG95","W88")
)
> srcDT
alum degree
1: Paul Lennon W72
2: Stevadora Nicks WG95
3: Fred Murcury W88
I need to extract the digits of the year from the degree, and put it in a new column called "degree_year"
No problem:
> srcDT[,degree_year:=substr(degree,nchar(degree)-1,nchar(degree))]
> srcDT
alum degree degree_year
1: Paul Lennon W72 72
2: Stevadora Nicks WG95 95
3: Fred Murcury W88 88
If only it were always that simple.
The problem is, the degree strings only sometimes look like the above. More often, they look like this:
srcDT<- data.table(
alum=c("Ringo Harrison","Brian Wilson","Mike Jackson"),
degree=c("W72 C73","WG95 L95","W88 WG90")
)
I am only interested in the 2 numbers next to the characters I care about: W & WG (and if both W and WG are there, I only care about WG)
Here's how I solved it:
x <-srcDT$degree ##grab just the degree column
z <-character() ## create an empty character vector
degree.grep.pattern <-c("WG[0-9][0-9]","W[0-9][0-9]")
## define a vector of regex's, in the order
## I want them
for(i in 1:length(x)){ ## loop thru all elements in degree column
matched=F ## at the start of the loop, reset flag to F
for(j in 1:length(degree.grep.pattern)){
## loop thru all elements of the pattern vector
if(length(grep(degree.grep.pattern[j],x[i]))>0){
## see if you get a match
m <- regexpr(degree.grep.pattern[j],x[i])
## if you do, great! grab the index of the match
y<-regmatches(x[i],m) ## then subset down. y will equal "WG95"
matched=T ## set the flag to T
break ## stop looping
}
## if no match, go on to next element in pattern vector
}
if(matched){ ## after finishing the loop, check if you got a match
yr <- substr(y,nchar(y)-1,nchar(y))
## if yes, then grab the last 2 characters of it
}else{
#if you run thru the whole list and don't match any pattern at all, just
# take the last two characters from the affilitation
yr <- substr(x[i],nchar(as.character(x[i]))-1,nchar(as.character(x[i])))
}
z<-c(z,yr) ## add this result (95) to the character vector
}
srcDT$degree_year<-z ## set the column to the results.
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
This works. 100% of the time. No errors, no mis-matches.
The problem is: it doesn't scale. Given a data table with 10k rows, or 100k rows, it really slows down.
Is there a smarter, better way to do this? This solution is very "C" to me. Not very "R."
Thoughts on improvement?
Note: I gave a simplified example. In the actual data, there are about 30 different possible combinations of degrees, and combined with different years, there are something like 540 unique combinations of degree strings.
Also, I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.
As it seem (per OPs) comments, there is no situation of "WG W", then a simple regex solution should do the job
srcDT[ , degree_year := gsub(".*WG?(\\d+).*", "\\1", degree)]
srcDT
# alum degree degree_year
# 1: Ringo Harrison W72 C73 72
# 2: Brian Wilson WG95 L95 95
# 3: Mike Jackson W88 WG90 90
Here's a solution based on the assumption that want the most recent degree with W in it:
regex <- "(?<=W|(?<=W)G)[0-9]{2}"
srcDT[ , degree_year :=
sapply(regmatches(degree,
gregexpr(regex, degree, perl = TRUE)),
function(x) max(as.integer(x)))]
> srcDT
alum degree degree_year
1: Ringo Harrison W72 C73 72
2: Brian Wilson WG95 L95 95
3: Mike Jackson W88 WG90 90
You said:
I gave the degree.grep.pattern with only 2 patterns to match. In the actual work I'm doing, there are 7 or 8 patterns to match.
But I'm not sure what this means. There are more options besides W and WG?
Here is one quick hack:
# split all words from degree and order so that WG is before W
words <- lapply(strsplit(srcDT$degree, " "), sort, decreasing=TRUE)
# obtain tags for each row (getting only first. But works since ordered)
tags <- mapply(Find, list(function(x) grepl("^WG|^W", x)), words)
# simple gsub to remove WG and W
(result <- gsub("^WG|^W", "", tags))
[1] "72" "95" "90"
Fast with 100k rows.
A solution without regular expressions, it's quite slow as it creates a sparse table... but it's clean and flexible so i leave it here.
First I split the degreeyears by space, then browse through them and build a clean structured table with one column per degree, that i fill it with years.
degreeyear_split <- sapply(srcDT$degree,strsplit," ")
for(i in 1:nrow(srcDT)){
for (degree_year in degreeyear_split[[i]]){
n <- nchar(degree_year)
degree <- substr(degree_year,1,n-2)
year <- substr(degree_year,n-1,n)
srcDT[i,degree] <- year
}}
Here I have my structure table, I paste W on the year i'm interested in, then paste WG on top of it.
srcDT$year <- srcDT$W
srcDT$year[srcDT$WG!=""]<-srcDT$WG[srcDT$WG!=""]
Then here's you result:
srcDT
alum degree W C WG L year
1: Ringo Harrison W72 C73 72 73 72
2: Brian Wilson WG95 L95 95 95 95
3: Mike Jackson W88 WG90 88 90 90

Sentence detection and extraction into same data frame

I have a following data frame:
reviews <- data.frame(value = c("Product was received in excellent condition. Made with high quality materials. Very Good product",
"Inexpensive. An improvement over integrated graphics.",
"I love that product so excite. I will order again if I need more .",
"Excellent card, great graphics."),
user = c(1,2,3,4),
Review_Id = c("101968","101968","210546","112546"),
stringsAsFactors = FALSE)
and I need to have desired output:
user review_Id sentence
1 101968 Made with high quality materials.
1 101968 Very Good product
2 101968 Inexpensive.
2 101968 An improvement over integrated graphics.
3 210546 I love that product so excite.
3 210546 I will order again if I need more .
4 112546 Excellent card, great graphics.
I was wondering about something like this: sent_detect(reviews$value)
But how could I combine that function to have that desired output.
If your data really are so tidy, you can just use cSplit from my "splitstackshape" package.
library(splitstackshape)
cSplit(reviews, "value", ".", direction = "long")
# value user Review_Id
# 1: Product was received in excellent condition 1 101968
# 2: Made with high quality materials 1 101968
# 3: Very Good product 1 101968
# 4: Inexpensive 2 101968
# 5: An improvement over integrated graphics 2 101968
# 6: I love that product so excite 3 210546
# 7: I will order again if I need more 3 210546
# 8: Excellent card, great graphics 4 112546

LibreCalc Search and Replace, Search for [] and replace it, along with its contents

I am compiling a list of video games.
At this time, I am currently using Wikipedia to do so.
As I copied ps3 games over to LibreCalc, the copied titles of the video games include citation brackets at the end of the line. Rather than remove this line by, I am trying to search and replace the brackets and their contents.
I continue to fail in this endeavor. An example below,
Rune Factory: Tides of Destiny[629]
Fight Night Champion[268]
Dragon Age II[209]
Major League Baseball 2K11[427]
MLB 11: The Show[459]
Warriors: Legends of Troy[817]
Dynasty Warriors 7[222]
Homefront[334]
Top Spin 4[773]
MotorStorm: Apocalypse[474]
Crysis 2[164]
Lego Star Wars III: The Clone Wars
The Tomb Raider Trilogy[765]
NASCAR 2011: The Game[488]
Shift 2: Unleashed[650]
Tiger Woods PGA Tour 12: The Masters[746]
WWE All Stars[839]
Michael Jackson: The Experience[448]
Rio[614]
Mortal Kombat[469]
Portal 2[563]
SOCOM 4: U.S. Navy SEALs[20]
AFL Live[16]
Operation Flashpoint: Red River[542]
Man vs. Wild[430]
Sniper: Ghost Warrior[679]
El Shaddai: Ascension of the Metatron[233]
Virtua Tennis 4[808]
Thor: God of Thunder[740]
MX vs. ATV Alive[478]
Brink[116]
Lego Pirates of the Caribbean: The Video Game[391]
Battle vs. Chess[82]
L.A. Noire[379]
Dirt 3[196]
Kung Fu Panda 2[377]
Hunted: The Demon's Forge[336]
Infamous 2[345]
Red Faction: Armageddon[599]
Yakuza: Dead Souls[849]
Duke Nukem Forever[217]
Alice: Madness Returns[29]
Child of Eden[146]
Transformers: Dark of the Moon[777]
Dungeon Siege III[218]
Cars 2: The Video Game[138]
F.E.A.R. 3[247]
Shadows of the Damned[647]
Atelier Meruru: The Apprentice of Arland[67]
Bleach: Soul Resurrección[108]
Angel Love Online[38]
Angel Senki
Air Conflicts: Secret Wars[24]
Harry Potter and the Deathly Hallows: Part II[322]
NCAA Football 12[511]
Captain America: Super Soldier[137]
Call of Juarez: The Cartel[136]
Phineas and Ferb: Across the 2nd Dimension[558]
Hyperdimension Neptunia Mk2[338]
Deus Ex: Human Revolution[191]
Bodycount[111]
Madden NFL 12[415]
Driver: San Francisco[216]
Dead Island[175]
Resistance 3[609]
Warhammer 40000: Space Marine[815]
NHL 12[526]
Tales of Xillia[718]
God of War: Origins Collection[298]
Tom Clancy's Splinter Cell Classic Trilogy HD[762]
Supremacy MMA[712]
Dark Souls[169]
Ico & Shadow of the Colossus Collection[340]
FIFA 12[263]
PES 2012: Pro Evolution Soccer[557]
Dynasty Warriors 7: Xtreme Legends[223]
Ra.One: The Game[584]
Crysis[163]
Rage[586]
Spider-Man: Edge of Time[692]
NBA 2K12[498]
The Cursed Crusade[733]
Ace Combat: Assault Horizon[12]
Skylanders: Spyro's Adventure[675]
Batman: Arkham City[79]
Ratchet & Clank: All 4 One[591]
Rocksmith[627]
The Sims 3: Pets[658]
The Adventures of Tintin: The Secret of the Unicorn[14]
Back to the Future: The Game[71]
Battlefield 3[83]
Dragon Ball Z: Ultimate Tenkaichi[212]
Puss in Boots[581]
The Idolmaster 2[736]
Uncharted 3: Drake's Deception[795]
GoldenEye 007: Reloaded[301]
The Lord of the Rings: War in the North[401]
Sonic Generations[683]
Call of Duty: Modern Warfare 3[131]
Metal Gear Solid HD Collection[445]
The Elder Scrolls V: Skyrim[236]
Lego Harry Potter: Years 5–7[388]
Assassin's Creed: Revelations[63]
Jurassic Park: The Game[358]
Cartoon Network: Punch Time Explosion XL[141]
Need for Speed: The Run[516]
Saints Row: The Third[633]
Apache: Air Assault[42]
After Hours Athletes[21]
Ni no Kuni[531]
WWE '12[838]
The King of Fighters XIII[371]
Just Dance 3[361]
Order Up![543]
Final Fantasy XIII-2[273]
Zack Zero[853]
Armored Core V[52]
NeverDead[520]
Soulcalibur V[689]
Kingdoms of Amalur: Reckoning[374]
The Darkness II[735]
Grand Slam Tennis 2[306]
Twisted Metal[787]
UFC Undisputed 3[792]
Binary Domain[95]
Asura's Wrath[64]
Syndicate[714]
Gal*Gun[288]
Naruto Shippuden: Ultimate Ninja Storm Generations[485]
SSX[698]
One Piece: Pirate Warriors[539][540]
Blades of Time[101]
Major League Baseball 2K12[428]
Mass Effect 3[436]
MLB 12: The Show[460]
Street Fighter X Tekken[706]
Top Gun: Hard Lock[771]
Mobile Suit Gundam Unicorn[465]
FIFA Street[265]
Silent Hill: Downpour[653]
Silent Hill HD Collection[654]
Ninja Gaiden 3[535]
Resident Evil: Operation Raccoon City[605]
Ridge Racer Unbounded[613]
Battleship[87]
Prototype 2[577]
Max Payne 3[438]
Dragon's Dogma[214]
Tom Clancy's Ghost Recon: Future Soldier[756]
Dirt: Showdown[197]
Inversion[350]
Tokyo Jungle[753]
Part of my problem seems to be that brackets are characters used in regular expressions.
Can some one assist me, or toss me in the right direction to solving this problem.
You can escape the brackets with a backslash so they are treated as regular characters. On that base, you could use the following regex to match all square brackets containing only digits:
\[[:digit:]*\]
When leaving the Replace with box empty, a search/replace run should remove all footnote marks in your example.
Since only the opening bracket is a special character for LO Calc, the following should work, too:
\[[:digit:]*]

Replace words in program generated txt files in batch with loop

Basically the poker client program generates files (txt) as handhistory logs, I would like to have some program that edits the handhistory automaticly, just adding "$" signs in front of numbers then generates new txt files in a new directory so I can have the newly created files analyzed in another software.
below is a sample handhistory log:
Full Tilt Poker Game #23461961057: Table .COM Play 463 (deep) - 3000/6000 - No Limit Hold'em - 15:16:29 ET - 2010/08/29
Seat 2: Player1 (795,425)
Seat 5: Player2 (1,200,000)
Player1 posts the small blind of 3,000
Player2 posts the big blind of 6,000
The button is in seat #2
**** HOLE CARDS ****
Dealt to Player1 [Ac 4c]
Player1 raises to 12,000
Player2 raises to 687,000
Player1 raises to 795,425, and is all in
Player2 folds
Player2 adds 687,000
Uncalled bet of 108,425 returned to Player1
Player1 mucks
Player1 wins the pot (1,374,000)
*** SUMMARY ***
Total pot 1,374,000 | Rake 0
Seat 2: Player1 (small blind) collected (1,374,000), mucked
Seat 5: Player2 (big blind) folded before the Flop
below is the processed file I'd like:
Full Tilt Poker Game #23461961057: Table .COM 463 (deep) - $3000/$6000 - No Limit Hold'em - 15:16:29 ET - 2010/08/29
Seat 2: Player1 ($795,425)
Seat 5: Player2 ($1,200,000)
Player1 posts the small blind of $3,000
Player2 posts the big blind of $6,000
The button is in seat #2
*** HOLE CARDS ***
Dealt to Player1 [Ac 4c]
Player1 raises to $12,000
Player2 raises to $687,000
Player1 raises to $795,425, and is all in
Player2 folds
Player2 adds $687,000
Uncalled bet of $108,425 returned to Player1
Player1 mucks
Player1 wins the pot ($1,374,000)
*** SUMMARY ***
Total pot $1,374,000 | Rake $0
Seat 2: Player1 (small blind) collected ($1,374,000), mucked
Seat 5: Player2 (big blind) folded before the Flop
I did some research and came up with AutoHotKey as a result for doing something like this, but I am a newbie when it comes to programming, regular expression is raping my brain as I am typing this. Any help would be nice.
(?<!Seat )(?<![a-zA-Z#])([0-9]+(?:,[0-9]+)*)
Replace with $\1
Essentially, what this does is finds all numbers which can be separted by commas which don't start with '#' or a letter or Seat (since I noticed that in the strings "Full Tilt Poker Game #23461961057" and "Seat 2", you didn't add $). Algorithm is greedy, so that should take into account the fact that the pattern repeats after every comma in a number divided with commas.
If you're using javascript, unfortunately you can't use the look behind.