Splitting text lines, whilst appending prefix

Splitting text lines, whilst appending prefix - regex

EDITED
I have about 50x9Gb .mer files that look like this:
"xxxxx";"123\t123\t123\v234\t234\v234\t224\t234\v"
"yyyyy";"123\t234\t224\v234\t234\v234\t224\t234\v"
"zzzzz";"123\t456\t565\v234\t774"
A uuid followed by ";" and then possibly extra tabbed entries, followed by a vertical tab delimited list of further multiple tabbed entries, all enclosed in quotes. I've shown them as 3-digit numbers here, but in reality they're variable length strings, which can include doubled quotes "".
I need to turn them into this:
xxxxx\t123\t123\t123
xxxxx\t234\t234
xxxxx\t234\t224\t234
yyyyy\t123\t234\t224
yyyyy\t234\t234
yyyyy\t234\t224\t234
zzzzz\t123\t456\t565
zzzzz\t234\t774
That is, split the lines on the vertical tabs, prefixing each line with the first field of the line that it came from.
Currently, I'm using a noddy regex, which does at least work, but requires multiple runs and manual checking.
How would I do this using awk or sed ? I've tried adapting the the current answers below, but I'm having trouble finding out what the ;P and ;D suffixes mean.
(Note: I'm using GitBash on Windows, so I guess that's gnu sed and awk ?)

awk -F';' -v OFS='\t' #set Field separator is ';',
'{for(i=1;i<=NF;i++) #then we have 2 fields, remove leading and trailing doubled qoutes
gsub(/^"|"$/,"",$i)
c=split($2,a,"\v") #split by vertical tab, save result in array 'a'
for(i=1;i<=c;i++) #for each element in a, if it is not empty, print field1 (the uuid)
if(a[i])print $1,a[i]}' file #and the element, separated by Tab
explanation is inline.
It outputs:
xxxxx 123 123 123
xxxxx 234 234
xxxxx 234 224 234
yyyyy 123 234 224
yyyyy 234 234
yyyyy 234 224 234
zzzzz 123 456 565
zzzzz 234 774

This might work for you (GNU sed):
sed -r 's/^((\S*\t)\S*)\v/\1\n\2/;P;D' file
Replace each \v by a newline, the first field and a tab. Print and Delete the first line and repeat.
EDIT: As per the new question;
sed -r '/\n/!s/"(")?/\1/g;/\n/!s/;/\t/;s/^((\S*\t)[^\v]*)\v/\1\n\2/;/\t$/!P;D' file
Remove any single double quotes (replace double double quotes by single double quotes) and replace the semi colon by a tab character. Then replace any \v's by a newline and the first field and a tab and repeat.

You can use this awk command for this output:
awk 'BEGIN{FS=OFS="\t"} n = split($2, a, "\x0b") {
for (i=1; i<=n; i++) print $1, a[i]}' file
195a664e-e0d0-4488-99d6-5504f9178115 1234
195a664e-e0d0-4488-99d6-5504f9178115 1412
195a664e-e0d0-4488-99d6-5504f9178115 1231
195a664e-e0d0-4488-99d6-5504f9178115 4324
195a664e-e0d0-4488-99d6-5504f9178115 1421
195a664e-e0d0-4488-99d6-5504f9178115 3214
a1d61289-7864-40e6-83a7-8bdb708c459e 1412
a1d61289-7864-40e6-83a7-8bdb708c459e 6645
a1d61289-7864-40e6-83a7-8bdb708c459e 5334
a1d61289-7864-40e6-83a7-8bdb708c459e 3453
a1d61289-7864-40e6-83a7-8bdb708c459e 5453
How it works:
BEGIN{FS=OFS="\t"} # sets input and output field separator as tab
n = split($2, a, "\x0b") # splits second field using Hex 0B (ASCII 11) i.e. vertical tab
for (i=1; i<=n; i++) ... # prints pair of field 1 with each item from split array a

gnu sed
sed 's/"\|..$//g;s/;/\t/;:r;s/^\([^\t]*\)\t\(.*\)\\v/\1\t\2\n\1\t/;t r;s/\\t/\t/g;' YourFile
recursive replace of \v by first "field" + tab + cleaning extra char on the way

Another solution using awk
awk '
BEGIN{FS="[\v;]"}
{
gsub("[\"]","");
for(i=2; i<=NF; ++i)
if($i) printf "%s\t%s\n", $1, $i;
}' file.mer
Another solution using sed
sed -r 's/\v\n/\v/g; s/"//g;
:a; s/([^;]*);([^\v]*)\v/\1;\2\n\1;/g; ta;
s/;/\t/g;' file.mer | sed -r '/^[^\t]+\t$/d'
you get,
xxxxx 123 123 123
xxxxx 234 234
xxxxx 234 224 234
yyyyy 123 234 224
yyyyy 234 234
yyyyy 234 224 234
zzzzz 123 456 565
zzzzz 234 774

Okay, I waited on purpose until Kent's answer was accepted and awarded the bounty because the question was about awk/sed. Thus, my answer might be somewhat off-topic, but anyway here is my Java solution which I did just for fun as a kata.
MER input file generator:
I thought it would be nice to generate some sample input files with randomised values. Each line consists of
a UUID,
0-9 groups, separated by vertical tabs,
within each group, 1-4 strings, separated by horizontal tabs,
each string consisting of 1-20 characters, among them double quotes escaped by other double quotes, i.e. "".
I thought this is diverse enough to get some good test data.
package de.scrum_master.stackoverflow;
import org.apache.commons.lang.RandomStringUtils;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Random;
import java.util.UUID;
public class RandomFileGenerator {
private static final int BUFFER_SIZE = 1024 * 1024;
private final static Random RANDOM = new Random();
private final static char VERTICAL_TAB = '\u000b';
private final static char[] LEGAL_CHARS =
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüÄÖÜß. -\""
.toCharArray();
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
// final long maxOutputSize = 9L * 1024 * 1024 * 1024;
// final String outputFile = "src/main/resources/sample-9gb.mer";
final long maxOutputSize = 1L * 1024 * 1024;
final String outputFile = "src/main/resources/sample-1mb.mer";
long totalOutputSize = 0;
long lineCount = 0;
String line;
try (PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))) {
while (totalOutputSize < maxOutputSize) {
line = generateLine();
writer.println(generateLine());
totalOutputSize += line.length() + 1;
lineCount++;
}
}
System.out.println(lineCount);
System.out.println(totalOutputSize);
System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
}
private static String generateLine() {
StringBuilder buffer = new StringBuilder();
buffer
.append('"')
.append(UUID.randomUUID().toString())
.append("\";\"");
int numItems = RANDOM.nextInt(10);
for (int i = 0; i < numItems; i++) {
int numSubItems = 1 + RANDOM.nextInt(4);
for (int j = 0; j < numSubItems; j++) {
buffer.append(
RandomStringUtils.random(1 + RANDOM.nextInt(20), 0, LEGAL_CHARS.length, false, false, LEGAL_CHARS)
.replaceAll("\"", "\"\"")
);
if (j + 1 < numSubItems)
buffer.append('\t');
}
if (i + 1 < numItems) {
buffer.append(VERTICAL_TAB);
}
}
buffer.append('"');
return buffer.toString();
}
}
You can see that it is easy to create test files of the desired file size, e.g.
1 MB: maxOutputSize = 1L * 1024 * 1024
9 GB: maxOutputSize = 9L * 1024 * 1024 * 1024
I mostly used the smaller one to check the algorithms during development and the really big one for performance tuning.
File splitter in 4 different variants:
The variants shown here use different approaches, but in common they have that they read from a BufferedReader with Java streams via reader.lines(). Switching from streams to a simple for loop made it slower, BTW. All solutions write the results into a PrintWriter.
reader.lines().forEach() and then regex matching + splitting. This solution has the best trade-off between readability, brevity and performance.
reader.lines().flatMap(), i.e. using sub-streams for the vertical-tab-separated groups after the UUID, also using regex matching + splitting. This solution is also quite short and elegant, but a little harder to read than #1 and also about 15% slower.
Because regex matching calls like replace() and split() can be pretty expensive, I developed a solution that rather iterates over the string and uses indexOf() and substring() instead of regexes. This was substantially faster than #1 and #2, but the code is harder to read in a way I started to dislike. This should only be done if performance is really important, i.e if the file splitter is used regularly. For a one-time solution or if it runs just once a month, I think it is not really worth it from a maintainability standpoint.
A further optimised version of #3 which avoids some more overhead and is again a little bit faster, but not substantially. Now the code really needs source code comments so as to convey to the reader what the algorithm does. From a clean code perspective this is a nightmare. (Don't do this at home, kids!)
package de.scrum_master.stackoverflow;
import java.io.*;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FileSplitter {
private static final int BUFFER_SIZE = 1024 * 1024;
private static final Pattern LINE_PATTERN = Pattern.compile("^\"([^\"]+)\";\"(.*)\"$");
private final static char VERTICAL_TAB = '\u000b';
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
String inputFile = "src/main/resources/sample-9gb.mer";
String outputFile = inputFile.replaceFirst("mer$", "txt");
try (
BufferedReader reader = new BufferedReader(new FileReader(inputFile), BUFFER_SIZE);
PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))
) {
// forEachVariant(reader, writer);
// flatMapVariant(reader, writer);
noRegexSimpleVariant(reader, writer);
// noRegexOptimisedVariant(reader, writer);
}
System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
}
private static void forEachVariant(BufferedReader reader, PrintWriter writer) {
Matcher matcher = LINE_PATTERN.matcher("dummy");
reader.lines()
.forEach(line -> {
matcher.reset(line).matches();
for (String record : matcher.group(2).replace("\"\"", "\"").split("\\v"))
writer.println(matcher.group(1) + "\t" + record);
});
}
private static void flatMapVariant(BufferedReader reader, PrintWriter writer) {
Matcher matcher = LINE_PATTERN.matcher("dummy");
reader.lines()
.flatMap(line -> {
matcher.reset(line).matches();
return Arrays
.stream(matcher.group(2).replace("\"\"", "\"").split("\\v"))
.map(record -> matcher.group(1) + "\t" + record);
})
.forEach(writer::println);
}
private static void noRegexSimpleVariant(BufferedReader reader, PrintWriter writer) {
reader.lines()
.forEach(line -> {
final int lineLength = line.length();
// UUID + '\t'
int indexLeft = 1;
int indexRight = line.indexOf('"', indexLeft);
final String uuid = line.substring(indexLeft, indexRight) + "\t";
indexLeft = indexRight + 3;
String record;
int quoteIndex;
while (indexLeft < lineLength) {
writer.print(uuid);
indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
if (indexRight == -1)
indexRight = lineLength - 1;
while (indexLeft < indexRight) {
quoteIndex = line.indexOf('"', indexLeft);
if (quoteIndex == -1 || quoteIndex >= indexRight)
quoteIndex = indexRight;
else
quoteIndex++;
record = line.substring(indexLeft, quoteIndex);
writer.print(record);
indexLeft = quoteIndex + 1;
}
writer.println();
indexLeft = indexRight + 1;
}
});
}
private static void noRegexOptimisedVariant(BufferedReader reader, PrintWriter writer) throws IOException {
reader.lines()
.forEach(line -> {
// UUID + '\t'
int indexLeft = 1;
int indexRight = line.indexOf('"', indexLeft);
final String uuid = line.substring(indexLeft, indexRight) + "\t";
// Skip '";"' after UUID
indexLeft = indexRight + 3;
final int lineLength = line.length();
String recordChunk;
int quoteIndex;
// If search for '"' has once reached end of line, search no more
boolean doQuoteSearch = true;
// Iterate over records per UUID, separated by vertical tab
while (indexLeft < lineLength) {
writer.print(uuid);
indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
if (indexRight == -1)
indexRight = lineLength - 1;
// Search for '""' within record incrementally, + replace each of them by '"'.
// BTW, if '"' is found, it actually always will be an escaped '""'.
while (indexLeft < indexRight) {
if (doQuoteSearch) {
// Only search for quotes if we never reached the end of line before
quoteIndex = line.indexOf('"', indexLeft);
assert quoteIndex != -1;
if (quoteIndex >= lineLength - 1)
doQuoteSearch = false;
if (quoteIndex >= indexRight)
quoteIndex = indexRight;
else
quoteIndex++;
}
else {
// No more '"' within record
quoteIndex = indexRight;
}
// Write record chunk, skipping 2nd '"'
recordChunk = line.substring(indexLeft, quoteIndex);
writer.print(recordChunk);
indexLeft = quoteIndex + 1;
}
// Do not forget newline before reading next line/UUID
writer.println();
indexLeft = indexRight + 1;
}
});
}
}
Updated awk script:
Besides: Each Java solution writes out one UUID without any content in case there was none in the input file. This would be easy to avoid, but I did it on purpose. This is the only difference to this slightly updated awk script (based on Dave's, but also replaces "" by ") which I used as a benchmark:
#!/usr/bin/awk
{
for(i=1;i<=NF;i++) {
gsub(/^"|"$/,"",$i)
gsub(/""/,"\"",$i)
}
c=split($2,a,"\\v")
for(i=1;i<=c;i++)
print $1,a[i]
}
Performance results:
I measured both parsing and writing performance.
Parsing means reading a 9 GB file from disk and splitting it, but writing the output to /dev/null or not at all.
Writing means reading the same 9 GB file and writing it back onto the same disk partition (hybrid HD + SSD type), i.e. this could be further optimised by writing onto another physical disk. The output file has a size of 18 GB.
Reading file, splitting into lines but not parsing the lines: 66 s
Awk
parsing only: 533 s
parsing + writing: 683 s
reader.lines().forEach() and then regex matching + splitting
parsing only: 212 s
parsing + writing: 425 s
reader.lines().flatMap(), i.e. using sub-streams
parsing only: 245 s
parsing + writing: not measured
Without using regex, but using String.replace("\"\"", "\"") (not shown in code here)
parsing only: 154 s
parsing + writing: 369 s
No regex, no replace(), simple version
parsing only: 86 s
parsing + writing: 342 s
No regex, no replace(), optimised version
parsing only: 84 s
parsing + writing: 342 s
Sorry for the lengthy treatise, but I wanted to share my findings for others reading the question and the other answers, speculating whether Java (or C?) might be faster than awk - yes, it is by a substantial bit, but not by an order of magnitude because disk performance is also a factor. And I think it is a warning to those who tend to over-optimise for optimisation's sake. It is not worth it if you go to far, just try to hit the sweet spot between effort, readability and performance. Amen.

Related

Which has better performance when matching a string: testing against the regex /regex1|regex2|regex3/, or test one pattern after the other?

Suppose I have a few million (possibly long) strings, and need to know if each of them contains any of these given patterns:
regex1
regex2
regex3
...
Performance wise, would it be better to:
Test each string against the "full" regular expression /regex1|regex2|regex3|.../, or
Test each string against regex1, if didn't match then test against regex2, and so on...?
I was wondering about this and, as my knowledge about regex implementations is very limited, I have no idea if these would output similar behavior or not.
Edit: I just did a quick benchmarking. Didn't think too much, just blurted out some code. Please point out anything that might be biasing the output.
This is JavaScript, and I did the test with Node.js.
Note: I tried running with 5 million strings and 500 regexes, but the process ran out of memory, so I lowered the numbers
"use strict";
var strMinSize = 50;
var strMaxSize = 500;
var howManyStrings = 100000; // hundred thousand
var howManyRegex = 50; // fifty
var possible = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
function makestr() {
var text = "";
var strSize = Math.floor(Math.random() * strMaxSize) + strMinSize;
for (var i=0; i < strSize; i++) {
text += possible.charAt(Math.floor(Math.random() * possible.length));
}
return text;
}
function makeregex() {
var regexstr = "";
var regexSize = Math.floor(Math.random() * 50) + 5;
for (var i=0; i < regexSize; ++i) {
regexstr += possible.charAt(Math.floor(Math.random() * possible.length));
}
return regexstr;
}
var stringList = [];
for (var i=0; i < howManyStrings; ++i) {
stringList.push(makestr());
}
var regexList = [];
var fullRegex = ""; // aux to build the disjunction
for (var i=0; i < howManyRegex; ++i) {
var localRegex = makeregex();
regexList.push(new RegExp(localRegex));
fullRegex += '|' + localRegex;
}
fullRegex = new RegExp(fullRegex.substr(1));
// let's do this...
for (var kase=1; kase < 10; ++kase) {
// Test 1: one disjunction with every regex
var time1 = 0;
var time2 = 0;
var start = new Date().getTime();
stringList.forEach( function(str) {
fullRegex.test(str);
});
var end = new Date().getTime();
time1 = end - start;
// Test 2: one regex at a time
start = new Date().getTime();
stringList.forEach( function(str) {
regexList.every( function(rx) {
if (rx.test(str)) {
return false;
} else {
return true;
}
});
});
end = new Date().getTime();
time2 = end - start;
console.log(time1 + ";" + time2);
}
The running times were:
+--------+---------+
| Test 1 | Test 2 |
+--------+---------+
| 813 | 1817 |
| 558 | 1750 |
| 566 | 1756 |
| 558 | 1783 |
| 560 | 1755 |
| 559 | 1736 |
| 551 | 1749 |
| 552 | 1743 |
| 558 | 1746 |
+--------+---------+
So, as I suspected, the second alternative is way worse... But why so much?

One regex will always be faster, because each regex test requires a pass over the input, and even though the combined regex is (slightly) more complex than the individual expressions, it is still a constant time computation.
Expressing the problem using "big O" notation:
single regex evaluation at given location in input = O(1)
combined regex evaluation at given location in input = effectively O(1)
regex match on string = O(n) (where n = string length)
From these, we can say that individual passes for each term = O(n * k) where k is the number of regexes/terms, but one regex is O(n).
This is born out from your tests, which show roughly 3 times as slow for the separate regexes.
This all hinges on the premise the the combined regex is "about as fast" as a simple one. This is the case because the regex state engine is extremely efficient, reducing the execution time to practically the same for an simple alternation as a plain pattern. It is a little slower, but no where near slow enough to warrant separate passes for separate regexes, no matter how long the list of terms became.

Error in writing output file through AWK scripting

I have a AWK script to write specific values matching with specific pattern to a .csv file.
The code is as follows:
BEGIN{print "Query Start,Query End, Target Start, Target End,Score, E,P,GC"}
/^\>g/ { Query=$0 }
/Query =/{
split($0,a," ")
query_start=a[3]
query_end=a[5]
query_end=gsub(/,/,"",query_end)
target_start=a[8]
target_end=a[10]
}
/Score =/{
split($0,a," ")
score=a[3]
score=gsub(/,/,"",score)
e=a[6]
e=gsub(/,/,"",e)
p=a[9]
p=gsub(/,/,"",p)
gc=a[12]
printf("%s,%s,%s,%s,%s,%s,%s,%s\n",query_start, query_end,target_start,target_end,score,e,p,gc)
}
The input file is as follows:
>gi|ABCDEF|
Plus strand results:
Query = 100 - 231, Target = 100 - 172
Score = 20.92, E = 0.01984, P = 4.309e-08, GC = 51
But I received the output in a .csv file as provided below:
100 0 100 172 0 0 0 51
The program failed to copy the values of:
Query end
Score
E
P
(Note: all the failed values are present before comma (,))
Any help to obtain the right output will be great.
Best regards,
Amit

As #Jidder mentioned, you don't need to call split() and as #jaypal mentioned you're using gsub() incorrectly, but also you don't need to call gsub() at all if you just include , in your FS.
Try this:
BEGIN {
FS = "[[:space:],]+"
OFS = ","
print "Query Start","Query End","Target Start","Target End","Score","E","P","GC"
}
/^\>g/ { Query=$0 }
/Query =/ {
query_start=$4
query_end=$6
target_start=$9
target_end=$11
}
/Score =/ {
score=$4
e=$7
p=$10
gc=$13
print query_start,query_end,target_start,target_end,score,e,p,gc
}
That work? Note the field numbers are bumped out by 1 because when you don't use the default FS awk no longer skips leading white space so there's an empty field before the white space in your input.
Obviously, you are not using your Query variable so the line that populates it is redundant.

Add two different numbers in a single text field space separated in Access VBA

I am using Access and VBA to tidy up a database before a migration. One field is going from text to an INT. So I need to convert and possibly add some numbers which exist in a singular field.
Examples:
F/C 3 other 8 should become 11
Calender-7 should become 7
21 F/C and 1 other should become 22
29 (natural ways) should become 29
The second and fourth line are simple enough, just use the following regex in VBA
Dim rgx As New RegExp
Dim inputText As String
Dim outputText As String
rgx.Pattern = "[^0-9]*"
rgx.Global = True
inputText = "29 (natural ways)"
outputText = rgx.Replace(inputText, "")
The downside is if I use it on option 1 or 3:
F/C 3 other 8 will become 38
Calender-7 will become 7
21 F/C and 1 other will become 211
29 (natural ways) will become 29
This is simple enough in bash, I can just keep the spaces by adding one to [^0-9 ]* and then piping it into awk which will add every field using a space as a delimiter like so:
sed 's/[^0-9 ]*//g' | awk -F' ' 's=0; {for (i=1; i<=NF; i++) s=s+$i; print s}'
F/C 3 other 8 will become 11
21 F/C and 1 other will become 22
The problem is I cannot use bash, and there are far too many values to do it by hand. Is there any way to use VBA to accomplish this?

Instead of using the replace method, just capture and then add up all the numbers. For example:
Option Explicit
Function outputText(inputText)
Dim rgx As RegExp
Dim mc As MatchCollection, m As Match
Dim I As Integer
Set rgx = New RegExp
rgx.Pattern = "[0-9]+"
rgx.Global = True
Set mc = rgx.Execute(inputText)
For Each m In mc
I = I + CInt(m) 'may Need to be cast as an int in Access VBA; not required in Excel VBA
Next m
outputText = I
End Function

I'm not sure if there are any easier way for your question. Here I've wrote small function for you.
Requirement: add all numbers in a string, identify "consecutive" digits as one number.
pseudo:
Loop through given text
find the first number and check/loop if following chars are numbers
if following chars are numbers treat as one number else pass the
result
continue searching from last point and add the result to the total
in code:
Public Function ADD_NUMB(iText As String) As Long
Dim I, J As Integer
Dim T As Long
Dim TM As String
For I = 1 To Len(iText)
If (InStr(1, "12346567890", Mid$(iText, I, 1)) >= 1) Then
TM = Mid(iText, I, 1)
For J = I + 1 To Len(iText)
If (InStr(1, "12346567890", Mid$(iText, J, 1)) >= 1) Then
TM = TM & Mid$(iText, J, 1)
Else
Exit For
End If
Next J
T = T + Val(Nz(TM, 0))
I = J
End If
Next I
ADD_NUMB = T
End Function
usage:
dim total as integer
total = ADD_NUMB("21 F/C and 1 other")
not sure about performance but it will get you what you need :)

Generating the shortest regex to match an arbitrary word list

I'm hoping someone might know of a script that can take an arbitrary word list and generated the shortest regex that could match that list exactly (and nothing else).
For example, suppose my list is
1231
1233
1234
1236
1238
1247
1256
1258
1259
Then the output should be:
12(3[13468]|47|5[589])

This is an old post, but for the benefit of those finding it through web searches as I did, there is a Perl module that does this, called Regexp::Optimizer, here: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm
It takes a regular expression as input, which can consist just of the list of input strings separated with |, and outputs an optimal regular expression.
For example, this Perl command-line:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"
generates this output:
(?^:(?^:12(?:3[13468]|5[689]|47)))
(assuming you have installed Regex::Optimizer), which matches the OP's expectation quite well.
Here's another example:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"
And the output:
(?^:(?^:3(?:[1238]|57)4))
For comparison, an optimal trie-based version would output 3(14|24|34|574|84). In the above output, you can also search and replace (?: and (?^: with just ( and eliminate redundant parentheses, to obtain this:
3([1238]|57)4

You are probably better off saving the entire list, or if you want to get fancy, create a Trie:
1231
1234
1247
1
|
2
/ \
3 4
/ \ \
1 4 7
Now when you take a string check if it reaches a leaf node. It does, it's valid.
If you have variable length overlapping strings (eg: 123 and 1234) you'll need to mark some nodes as possibly terminal.
You can also use the trie to generate the regex if you really like the regex idea:
Nodes from the root to the first branching are fixed (eg: 12)
Branches create |: (eg: 12(3|4)
Leaf nodes generate a character class (or single character) that follows the parent node: (eg 12(3[14]|47))
This might not generate the shortest regex, to do that you'll might some extra work:
"Compact" ranges if you find them (eg [12345] becomes [1-4])
Add quantifiers for repeated elements (eg: [1234][1234] becomes [1234]{2}
???
I really don't think it's worth it to generate the regex.

This project generates a regexp from a given list of words: https://github.com/bwagner/wordhierarchy
It almost does the same as the above JavaScript solution, but avoids certain superfluous parentheses.
It only uses "|", non-capturing group "(?:)" and option "?".
There's room for improvement when there's a row of single characters:
Instead of e.g. (?:3|8|1|6|4) it could generate [38164].
The generated regexp could easily be adapted to other regexp dialects.
Sample usage:
java -jar dist/wordhierarchy.jar 1231 1233 1234 1236 1238 1247 1256 1258 1259
-> 12(?:5(?:6|9|8)|47|3(?:3|8|1|6|4))

Here's what I came up with (JavaScript). It turned a list of 20,000 6-digit numbers into a 60,000-character regular expression. Compared to a naive (word1|word2|...) construction, that's almost 60% "compression" by character count.
I'm leaving the question open, as there's still a lot of room for improvement and I'm holding out hope that there might be a better tool out there.
var list = new listChar("");
function listChar(s, p) {
this.char = s;
this.depth = 0;
this.parent = p;
this.add = function(n) {
if (!this.subList) {
this.subList = {};
this.increaseDepth();
}
if (!this.subList[n]) {
this.subList[n] = new listChar(n, this);
}
return this.subList[n];
}
this.toString = function() {
var ret = "";
var subVals = [];
if (this.depth >=1) {
for (var i in this.subList) {
subVals[subVals.length] = this.subList[i].toString();
}
}
if (this.depth === 1 && subVals.length > 1) {
ret = "[" + subVals.join("") + "]";
} else if (this.depth === 1 && subVals.length === 1) {
ret = subVals[0];
} else if (this.depth > 1) {
ret = "(" + subVals.join("|") + ")";
}
return this.char + ret;
}
this.increaseDepth = function() {
this.depth++;
if (this.parent) {
this.parent.increaseDepth();
}
}
}
function wordList(input) {
var listStep = list;
while (input.length > 0) {
var c = input.charAt(0);
listStep = listStep.add(c);
input = input.substring(1);
}
}
words = [/* WORDS GO HERE*/];
for (var i = 0; i < words.length; i++) {
wordList(words[i]);
}
document.write(list.toString());
Using
words = ["1231","1233","1234","1236","1238","1247","1256","1258","1259"];
Here's the output:
(1(2(3[13468]|47|5[689])))

Parse time string using regex

My time string may be in one of the following formates (x and y - integer numbers, h and m - symbols):
xh ym
xh
ym
y
Examples:
1h 20m
45m
2h
120
What regular expression should I write to get x and y numbers from such string?

(\d+)([mh]?)(?:\s+(\d+)m)?
You can then inspect groups 1-3. For your examples those would be
('1', 'h', '20')
('45', 'm', '')
('2', 'h', '')
('120', '', '')
As always, you might want to use some anchors ^, $, \b...

I'm going to assume you're using .NET due to your username. :)
I think in this case, it's easier to use TimeSpan.ParseExact for this task.
You can specify a list of permitted formats (see here for the format for these) and ParseExact will read in the TimeSpan according to them.
Here is an example:
var formats = new[]{"h'h'", "h'h 'm'm'", "m'm'", "%m"};
// I have assumed that a single number means minutes
foreach (var item in new[]{"23","1h 45m","1h","45m"})
{
TimeSpan timespan;
if (TimeSpan.TryParseExact(item, formats, CultureInfo.InvariantCulture, out timespan))
{
// valid
Console.WriteLine(timespan);
}
}
Output:
00:23:00
01:45:00
01:00:00
00:45:00
The only problem with this is that it is rather inflexible. Additional whitespace in the middle will fail to validate. A more robust solution using Regex is:
var items = new[]{"23","1h 45m", "45m", "1h", "1h 45", "1h 45", "1h45m"};
foreach (var item in items)
{
var match = Regex.Match(item, #"^(?=\d)((?<hours>\d+)h)?\s*((?<minutes>\d+)m?)?$", RegexOptions.ExplicitCapture);
if (match.Success)
{
int hours;
int.TryParse(match.Groups["hours"].Value, out hours); // hours == 0 on failure
int minutes;
int.TryParse(match.Groups["minutes"].Value, out minutes);
Console.WriteLine(new TimeSpan(0, hours, minutes, 0));
}
}
Breakdown of the regex:
^ - start of string
(?=\d) - must start with a digit (do this because both parts are marked optional, but we want to make sure at least one is present)
((?<hours>\d+)h)? - hours (optional, capture into named group)
\s* - whitespace (optional)
((?<minutes>\d+)m?)? - minutes (optional, capture into named group, the 'm' is optional too)
$ - end of string

I would say that mhyfritz' solution is simple, efficient and good if your input is only what you shown.
If you ever need to handle corner cases, you can use a more discriminative expression:
^(\d+)(?:(h)(?:\s+(\d+)(m))?|(m?))$
But it can be overkill...
(get rid of ^ and $ if you need to detect such pattern in a larger body of text, of course).

Try this one: ^(?:(\d+)h\s*)?(?:(\d+)m?)?$:
var s = new[] { "1h 20m", "45m", "2h", "120", "1m 20m" };
foreach (var ss in s)
{
var m = Regex.Match(ss, #"^(?:(\d+)h\s*)?(?:(\d+)m?)?$");
int hour = m.Groups[1].Value == "" ? 0 : int.Parse(m.Groups[1].Value);
int min = m.Groups[2].Value == "" ? 0 : int.Parse(m.Groups[2].Value);
if (hour != 0 || min != 0)
Console.WriteLine("Hours: " + hour + ", Mins: " + min);
else
Console.WriteLine("No match!");
}

in bash
echo $string | awk '{for(i=1;i<=NF;i++) print $i}' | sed s/[hm]/""/g

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Splitting text lines, whilst appending prefix - regex

gnu sed sed 's/"\|..$//g;s/;/\t/;:r;s/^\([^\t]\)\t\(.\)\\v/\1\t\2\n\1\t/;t r;s/\\t/\t/g;' YourFile recursive replace of \v by first "field" + tab + cleaning extra char on the way

Related

Which has better performance when matching a string: testing against the regex /regex1|regex2|regex3/, or test one pattern after the other?

Error in writing output file through AWK scripting

Add two different numbers in a single text field space separated in Access VBA

Generating the shortest regex to match an arbitrary word list

Parse time string using regex

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Splitting text lines, whilst appending prefix - regex

gnu sed sed 's/"\|..$//g;s/;/\t/;:r;s/^\([^\t]*\)\t\(.*\)\\v/\1\t\2\n\1\t/;t r;s/\\t/\t/g;' YourFile recursive replace of \v by first "field" + tab + cleaning extra char on the way

Related

Which has better performance when matching a string: testing against the regex /regex1|regex2|regex3/, or test one pattern after the other?

Error in writing output file through AWK scripting

Add two different numbers in a single text field space separated in Access VBA

Generating the shortest regex to match an arbitrary word list

Parse time string using regex

Categories

Resources

gnu sed sed 's/"\|..$//g;s/;/\t/;:r;s/^\([^\t]\)\t\(.\)\\v/\1\t\2\n\1\t/;t r;s/\\t/\t/g;' YourFile recursive replace of \v by first "field" + tab + cleaning extra char on the way