Splitting a title into separate parts - regex

I need a to split a string of the form
2,9.1,The Godfather (1972), (it's a csv line)
to:
2
9.1
The Godfather
1972
any ideas for a good regular expression?
BTW,
if you know a good regular expressions creator based on examples you provide it'd be great.
I'm a bit new to this..
10x!!

(\d+)\.(\d+\.\d+),(.*?)(?= \()\((\d{4})\)
^^^^^ ^^^^^^^^^^ ^^^^^^^^^^^^ ^^^^^^^
2 9.1 Title Year

I wouldn't recommend using regex to split the csv files as it can't handle comma escaping well. But having that said, how about using the simplest available solution?
A simplest regex like this should solve your problem
'(.*?),(.*?),(.*?)\((\d+)\)'

A little time with Google gave me this: /,(?!(?:[^",]|[^"],[^"])+")/. Seeems to split CSV just fine.
>>> '2,9.1,The Godfather (1972)'.split(/,(?!(?:[^",]|[^"],[^"])+")/)
["2", "9.1", "The Godfather (1972)"]

If you are sure that the format is static, you can use this:
(\d+),(\d+\.\d+),(.*?) \((\d+)\)
But if it can contain more information, use a real CSV parser to read the line and then just split The Godfather (1972) using (.*?) \((\d+)\).

CSV has a lot of corner cases, your regexp approach might take you into a world of pain.
For example if the title has a comma in it, the title would then be double quoted. Which would screw up with all of the regexps given so far.

Related

Regex to emulate GitHub autolink references in Markdown

What would be the regex emulating GitHub's autolinked references?
It takes Markdown on input and outputs enriched Markdown where strings like #123 are converted to [#123](https://github.com/owner/repo/issues/123).
These are some examples of the transformations that I'd like the regex to do:
Input:
1. #123
2. https://github.com/owner/repo/issues/123
3. https://github.com/shoptet/sofa/pull/456
4. owner/repo#123
5. https://github.com/owner/repo/issues/123#issuecomment-123456789
Output:
1. [#123](https://github.com/owner/repo/issues/123)
2. [#123](https://github.com/owner/repo/issues/123)
3. [#123](https://github.com/owner/repo/pull/456)
4. [owner/repo#123](https://github.com/owner/repo/issues/123)
5. [#123 (comment)](https://github.com/owner/repo/issues/123#issuecomment-123456789)
I'd prefer one giant regex if possible (I know it's not going to be nice but would allow me to process Markdown in a couple of my favorite editors directly).
If you don't mind changing the format a little (using [#123-comment] instead of [#123 (comment)] for comments), you may use this:
(?:(owner/repo)?#(\d+)\b|https?://github\.com/([^/]+/[^/]+/(?:issues|pull))/(\d+)(#issue(comment)(-)\d+)?)
Replace by: [\1#\2\4\7\6](https://github.com/owner/repo/issues/\2\4\5)
You have a demo here.
I'd still prefer a (complex) regex but if anyone is looking for the same post-processing like me, this package can solve it in a Node.js script:
https://github.com/remarkjs/remark-github

How do i capture text between two string values? [duplicate]

This question already has answers here:
C# Regex find string between two strings with newLine
(3 answers)
Closed 3 years ago.
The Problem
I have a hobby project that has the aim to sync multiple calendars by requesting the ics file from persons calendars in a group, for the optimal time to plan a meeting. Some sort of lazy way to schedule meetings :)
But i got stuck on reading the ics file for two reasons:
I don't really understand Regex.
And i don't know how to achieve my goal with string manipulation.
The ics file is already structured, so i know that i want to start from BEGIN:VEVENT and gather the that text down to END:VEVENT.
I want every event to in a later stage become a class so i can read the data and come up with a decision to present for the end user.
Background
I tried the regex expression: BEGIN:VEVENT(?:[\w\s\:\#\.\;\-\=\ä\å\ö\\\,\/\#]*)END:VEVENT but that is not a very valid approach. Because it gathers all of the the events and does not divide them into separate groups.
I have been using regexer.com to test my regex expression.
Not code but what i work on
This is some of the text from the ics file:
BEGIN:VEVENT
DTSTART:20121220T180000Z
DTEND:20121220T190000Z
DTSTAMP:20190503T064840Z
UID:SomeHash#google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=Name;X-NUM-GUESTS=0:mailto:MailAddress
CREATED:20121212T061002Z
LAST-MODIFIED:20121212T061003Z
LOCATION:ALocation
SEQUENCE:1
STATUS:TENTATIVE
SUMMARY:SomeText
TRANSP:OPAQUE
CATEGORIES:http://schemas.google.com/g/2005#event
END:VEVENT
BEGIN:VEVENT
DTSTART:20121213T143000Z
DTEND:20121213T153000Z
DTSTAMP:20190503T064840Z
UID:SomeHash#google.com
ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;CN=Name;X-NUM-GUESTS=0:mailto:MailAddress
CREATED:20121212T061146Z
LAST-MODIFIED:20121212T061146Z
LOCATION:ALocation
SEQUENCE:1
STATUS:TENTATIVE
SUMMARY:SomeText
TRANSP:OPAQUE
CATEGORIES:http://schemas.google.com/g/2005#event
END:VEVENT
Desired outcome
Is to get a array of matches with the string text so i can split it even more and create classes.
Disclaimer
As this is a hobby project i want to take on a challenge and not use a plugin or helping library. But links to these are appreciated if i can see how they solve the problem.
try using
BEGIN:VEVENT([\s\S]*?)END:VEVENT
I use Regex101.com for regex, Hope this helps
Regular expressions are usually a good fit for extracting text. But in this simple case you could try something like this:
var preambleLenght = "BEGIN:VEVENT\r\n".Length;
var text = ics.Substring(preambleLenght, ics.LastIndexOf("\r\nEND:VEVENT") - preambleLenght);

Emacs regexp - replacing text strings, query replace regexp

It seems simple enough but I can't get it done.
My text file looks like this :
Johnson Cary, 2009, This important article, 109 pages.
Smith Tom, 2003, Much ado about nothing: a study, 89 pages.
I need this :
Johnson Cary%2009%This important article%109 pages.
Any special character unlikely to appear in text will do. The end goal is to end up with a .csv then a .xls file.
I am using
^\([^,]+\)\([,]\)
to find the first occuring comma but when I try to replace with
\1 %
it does not work, nor any kind of close combination of that sort for that matter.
Any help will be dearly welcome!
Thank you much in advance.
Replace this:
^\([^,]*\), \([^,]*\), \([^,]*\), \(.*\)$
with this:
\1%\2%\3%\4
to get the correct result.

Using regex to eliminate chunks in a file (categorized events in iCal file)

I have one .ics file from which I would like to create individual new .ics files depending on the event categories (I can't get egroupware to export only events of one category, I want to create new calendars depending on category). My intended approach is to repeatedly eliminate all events but those of one category and then save the file using EditPad Lite 7 (Windows).
I am struggling to get the regular expression right. .+? is still too greedy and negating the string (e.g. to eliminate all but events from one category) doesn't work either.
Sample
BEGIN:VEVENT
DESCRIPTION:Event 2
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 3
CATEGORIES:Sports
END:VEVENT
BEGIN:VEVENT
DESCRIPTION:Event 4
END:VEVENT
The regex BEGIN:VEVENT.+?CATEGORIES:Sports.+?END:VEVENT should only match sports events but it catches everything from the first BEGINto the first ENDfollowing the category.
Edit: negating doesn't work either: BEGIN:VEVENT.+?((?!CATEGORIES:Sports).).+?END:VEVENT.
What am I missing? Any pointers are highly appreciated.
I guess newlines are removed or ignored, because your regex does not care about them.
I only have a correction to the match after CATEGORIES
BEGIN:VEVENT.+?CATEGORIES:Sports.*?END:VEVENT
^
Zero or more
The first part of your regex looks good, maybe the regex engine in EditPad is not so good.
Try it with a different editor or scripting language (like Eclipse or perl or Notepad+ or Notepad2)
You could split the input and then grep the matching Sports events
#sportevents = grep /Sports/, split /END:VEVENT/, $input
map $_.="END:VEVENT", #sportevents
This was perl, maybe you can launch a script from EditPad to do it.
The second line just restores the END:VEVENT that was stripped during split.
OK. Solved it. I found something here which can be used to split ics files. I tweaked it to use the category rather than the summary in the file name and then merged the individually generated files according to category. I added the usual ics header and footer to all files and, voilà, I had individual calendar files.

Transform a date using a regular expression from 24Dec to 24/12 or 24/12/2009

I get travel confirmations that look like this:
"SQ 966 E 27JUL SINCGK"
= "Airline Space Flight Space BookingClass Space Date_with_Month_as_name Space 3LetterFrom 2LetterTo".
I can chop all this into pieces using a regex to submit it to a website. But the site would expect instead of 27JUL 27/07/2009 or at least 27/07. Is there a way to transform a regex result based on a piece in the input. Jan -> 01, Feb -> 02 ... Dec -> 12.
(Regex flavour is Java)
DateFormat is a more appropriate class:
DateFormat output = new SimpleDateFormat("dd/MM", Locale.US);
DateFormat input = new SimpleDateFormat("dd MMM", Locale.US);
System.out.println(output.format(input.parse("24 Dec")));
output:
24/12
In Perl syntax (s{pattern}{replacement}):
s{([0-9][0-9])JAN}{\1/01}
s{([0-9][0-9])FEB}{\1/02}
s{([0-9][0-9])MAR}{\1/03}
s{([0-9][0-9])APR}{\1/04}
s{([0-9][0-9])MAY}{\1/05}
s{([0-9][0-9])JUN}{\1/06}
s{([0-9][0-9])JUL}{\1/07}
s{([0-9][0-9])AUG}{\1/08}
s{([0-9][0-9])SEP}{\1/09}
s{([0-9][0-9])OCT}{\1/10}
s{([0-9][0-9])NOV}{\1/11}
s{([0-9][0-9])DEC}{\1/12}
(Yes this is long and ugly, but it would probably work).
I would be very careful with doing this with regular expressions as they don't tell you how the conversion went.
Extract every bit of information manually. Sanity check everything, and then use the SimpleDateFormat parser to get a Date object you can use from there on.
It isnt a regex solution, but you could use SimpleDateFormat to help you with your final formatting. You should note in the JavaDoc that this is not a thread-safe option out of the box.
Alternatively, you could use DateFormatSymbols.getShortMonths() and iterate over the months to identify the index* and format your string manually.
*dont forget to add 1 ;)
edit:
I am not sure what you are looking for is possible in Java regex without the ablility to make code changes. The conditional constructs that Perl supports are not supported by Java because Java provides if-then-else support as a language feature.