Separate text from hyphens line - regex

I am wondering if there is a better approach than what I am currently taking to parse this file. I have a string that is in the general format of:
[Chunk of text]
--------------------
[Another chunk of text]
(There can be multiple chunks of text with the same separator between them)
I am trying to parse the chunks of text into elements of a list, which I can do with data.split('-'*20) [in this case], however if there are not exactly 20 hyphens the split will not work as intended. I have been playing around with regex however am currently unsure of a proper regex that could be used.
Are there any better methods that I should use in this situation, or is there a regex I should use oppose to the .split() method?

You want a regex split. I'm not python-literate, but I found the function in the official 2.7.10 documentation, and modified to your case:
>>> re.split('\n\-{4,}\n', input)
4 is the minimum amount of dashes you want to match.
\n are the newlines before and after. You probably don't want those in your text.

I would try to use re.split() with the regex --+ which means:
- - one hyphen
-+ - one or more hyphens
... this way it would not match a single hyphen, but everything more than one, alternatively you could use -{2,} which means two or more.

Related

Regex: Trying to extract all values (separated by new lines) within an XML tag

I have a project that demands extracting data from XML files (values inside the <Number>... </Number> tag), however, in my regular expression, I haven't been able to extract lines that had multiple data separated by a newline, see the below example:
As you can see above, I couldn't replicate the multiple lines detection by my regular expression.
If you are using a script somewhere, your first plan should be to use a XML parser. Almost every language has one and it should be far more accurate compared to using regex. However, if you just want to use regex to search for strings inside npp, then you can use \s+ to capture multiple new lines:
<Number>(\d+\s)+<\/Number>
https://regex101.com/r/MwvBxz/1
I'm not sure I fully understand what you are trying to do so if this doesn't do it then let me know what you are going for.
You can use this find+replace combo to remove everything which is not a digit in between the <Number> tag:
Find:
.*?<Number>(.*?)<\/Number>.*
Replace:
$1
finally i was able to find the right regular expression, I'll leave it below if anyone needs it:
<Type>\d</Type>\n<Number>(\d+\n)+(\d+</Number>)
Explanation:
\d: Shortcut for digits, same as [1-9]
\n: Newline.
+: Find the previous element 1 to many times.
Have a good day everybody,
After giving it some more thought I decided to write a second answer.
You can make use of look arounds:
(?<=<Number>)[\d\s]+(?=<\/Number>)
https://regex101.com/r/FiaTKD/1

Are there any characters that are not allowed/used in regex

I have the somehow weird requirement that several regex should be passed as one single string to a jenkins plugin.
They should be entered in one single textfield and I have to split this string in a List of Regex later on.
Now the issue is, I can't think of any way to delimit the regexes in the string so I can later split this string as a character like a , could also be considered part of a regex itself.
E.g. if I'd use a , for the two regex "(\d+,?\s+\d{1})\.xls" and "\w+\.exe" :
"(\d+,?\s+\d{1})\.xls,\w+\.exe"
would be split into 3 regexes: "(\d+", "?\s+\d{1})\.xls" and "\w+\.exe"
where the first 2 are obviously invalid.
So my actual question is, are there any characters, that can never appear in a regex which I could use to delimit my regexes?
No, any and all characters can appear in a regex. Use any serialisation format to serialise your list of strings into a clearly expressed list format, e.g. JSON:
["(\\d+", "?\\s+\\d{1})\\.xls", "\\w+\\.exe"]
Alternatively CSV or anything else that can express a list of things and properly escapes characters used to denote item separators.

Regex to match reocurring character groups

I'm trying to write a regex that would match groups of exactly three characters, that reoccur within the text at least one time.
What I came up with is this simple regex:(.{3}).*\g1, using the \g (global) and \s (dot also matches newline) flags. However, it is clearly faulty, as it only finds a part of the groups I'm hoping to capture. Any idea how can I improve it? Here is the link to an example input https://regex101.com/r/Cuiva1/2
Edit: Here's the full list of groups I was hoping to capture as requested in the comment:GLT,VIW,IWK,KTL,GLT,LTK,LIS,KTX,TXK,XDL,KTL
If your input is always multiple triplets of uppercase characters and you're only looking for ones that repeat, then you need something more complex to avoid backtracking into a previous triplet:
/(?>[^A-Z]*+([A-Z]{3}))(?=(?:[^A-Z]*+[A-Z]{3})*?\1)|(?>[^A-Z]*+[A-Z]{3})/g
The matches from index 1 will hold what you want. If your strings are not that well formatted (i.e. may contain any length string in between repeating patterns, then you can use a simpler pattern but you'll get totally inconsistent results and miss some matches.
I re-read your desired output, you're not going to achieve this with regex. VIW and IWK are overlapping, which won't work in a single preg_match_all(). Just use string functions.

"Multiple pass" RegEx for fixing space indentations

Searched and found some apparently similar questions, that weren't quite.
I often find myself needing to replace leading 4-space indentations with tabs. I always do this with RegEx ^(\t*) {4}, replacing with $1\t. And then I just do multiple passes to catch nested indents. It works, it's easy. But I'm wondering, is it possible to write a RegEx that can do this in one pass (to handle nested indents)?
EDIT
Apologies for lack of input/output examples, I was in a hurry. Here's an example, let s mean space and t mean tab:
SMA
ssssRTP
ssssssssATR
ssssssssOLN
ssssOWH
ssssERE
TOGO
Output:
SMA
tRTP
ttATR
ttOLN
tOWH
tERE
TOGO
Essentially, the RegEx would need to allow for arbitrarily deeply nested chunks of 4 spaces. It does not need to allow for tabs following spaces in the initial input.
PCRE
(^\t*|\G) {4} replace with $1\t or (^|\G)( {4}|\t) replace with \t. You should use multiline mode.
How this works:
^\t* — this match start of string followed by any numbers of tabs.
\G — this match end of previous match.
​ {4} — this match four spaces.
So this regular expression match four spaces at the start of string or four spaces following four spaces already matched by this regular expression.
Tested this with .NET's regex engine. JavaScript's (at least Mozilla's) won't work, though; it relies on lookbehind, which isn't available. PCRE wants fixed-length lookbehinds, so this won't work there either, unfortunately.
(?<=^( {4}|\t)*) {4}
Basic idea is to match four spaces preceded by the beginning of a line plus all the spots where a previous match would naturally go. Since replacement is done atomically, there's no chance of missing such a spot; all such matches are gathered at once. Then make sure you're using Multiline flag and replace with a single tab character and you're good to go.
Test data, which is just random pseudocode in a vaguely Pythonesque style:
def a:
return true
# comment with embedded spaces etc.

REGEX: How to match several lines?

I have a huge CSV list that needs to be broken up into smaller pieces (say, groups of 100 values each). How do I match 100 lines? The following does not work:
(^.*$){100}
If you must, you can use (flags: multi-line, not global):
(^.*[\r\n]+){100}
But, realistically, using regex to find lines probably is the worst-performing method you could come up with. Avoid.
You don't need regex for this, there should be other tools in your language, even if there is none, you can do simple string processing to get those lines.
However this is the regular expression that should match 100 lines:
/([^\n]+\n){100}/
But you really shouldn't use that, it's just to show how to do such task if ever needed (it searches for non newlines [^\n]+ followed by a newline \n repeated for {100} times).