Related
EDITED
I have about 50x9Gb .mer files that look like this:
"xxxxx";"123\t123\t123\v234\t234\v234\t224\t234\v"
"yyyyy";"123\t234\t224\v234\t234\v234\t224\t234\v"
"zzzzz";"123\t456\t565\v234\t774"
A uuid followed by ";" and then possibly extra tabbed entries, followed by a vertical tab delimited list of further multiple tabbed entries, all enclosed in quotes. I've shown them as 3-digit numbers here, but in reality they're variable length strings, which can include doubled quotes "".
I need to turn them into this:
xxxxx\t123\t123\t123
xxxxx\t234\t234
xxxxx\t234\t224\t234
yyyyy\t123\t234\t224
yyyyy\t234\t234
yyyyy\t234\t224\t234
zzzzz\t123\t456\t565
zzzzz\t234\t774
That is, split the lines on the vertical tabs, prefixing each line with the first field of the line that it came from.
Currently, I'm using a noddy regex, which does at least work, but requires multiple runs and manual checking.
How would I do this using awk or sed ? I've tried adapting the the current answers below, but I'm having trouble finding out what the ;P and ;D suffixes mean.
(Note: I'm using GitBash on Windows, so I guess that's gnu sed and awk ?)
awk -F';' -v OFS='\t' #set Field separator is ';',
'{for(i=1;i<=NF;i++) #then we have 2 fields, remove leading and trailing doubled qoutes
gsub(/^"|"$/,"",$i)
c=split($2,a,"\v") #split by vertical tab, save result in array 'a'
for(i=1;i<=c;i++) #for each element in a, if it is not empty, print field1 (the uuid)
if(a[i])print $1,a[i]}' file #and the element, separated by Tab
explanation is inline.
It outputs:
xxxxx 123 123 123
xxxxx 234 234
xxxxx 234 224 234
yyyyy 123 234 224
yyyyy 234 234
yyyyy 234 224 234
zzzzz 123 456 565
zzzzz 234 774
This might work for you (GNU sed):
sed -r 's/^((\S*\t)\S*)\v/\1\n\2/;P;D' file
Replace each \v by a newline, the first field and a tab. Print and Delete the first line and repeat.
EDIT: As per the new question;
sed -r '/\n/!s/"(")?/\1/g;/\n/!s/;/\t/;s/^((\S*\t)[^\v]*)\v/\1\n\2/;/\t$/!P;D' file
Remove any single double quotes (replace double double quotes by single double quotes) and replace the semi colon by a tab character. Then replace any \v's by a newline and the first field and a tab and repeat.
You can use this awk command for this output:
awk 'BEGIN{FS=OFS="\t"} n = split($2, a, "\x0b") {
for (i=1; i<=n; i++) print $1, a[i]}' file
195a664e-e0d0-4488-99d6-5504f9178115 1234
195a664e-e0d0-4488-99d6-5504f9178115 1412
195a664e-e0d0-4488-99d6-5504f9178115 1231
195a664e-e0d0-4488-99d6-5504f9178115 4324
195a664e-e0d0-4488-99d6-5504f9178115 1421
195a664e-e0d0-4488-99d6-5504f9178115 3214
a1d61289-7864-40e6-83a7-8bdb708c459e 1412
a1d61289-7864-40e6-83a7-8bdb708c459e 6645
a1d61289-7864-40e6-83a7-8bdb708c459e 5334
a1d61289-7864-40e6-83a7-8bdb708c459e 3453
a1d61289-7864-40e6-83a7-8bdb708c459e 5453
How it works:
BEGIN{FS=OFS="\t"} # sets input and output field separator as tab
n = split($2, a, "\x0b") # splits second field using Hex 0B (ASCII 11) i.e. vertical tab
for (i=1; i<=n; i++) ... # prints pair of field 1 with each item from split array a
gnu sed
sed 's/"\|..$//g;s/;/\t/;:r;s/^\([^\t]*\)\t\(.*\)\\v/\1\t\2\n\1\t/;t r;s/\\t/\t/g;' YourFile
recursive replace of \v by first "field" + tab + cleaning extra char on the way
Another solution using awk
awk '
BEGIN{FS="[\v;]"}
{
gsub("[\"]","");
for(i=2; i<=NF; ++i)
if($i) printf "%s\t%s\n", $1, $i;
}' file.mer
Another solution using sed
sed -r 's/\v\n/\v/g; s/"//g;
:a; s/([^;]*);([^\v]*)\v/\1;\2\n\1;/g; ta;
s/;/\t/g;' file.mer | sed -r '/^[^\t]+\t$/d'
you get,
xxxxx 123 123 123
xxxxx 234 234
xxxxx 234 224 234
yyyyy 123 234 224
yyyyy 234 234
yyyyy 234 224 234
zzzzz 123 456 565
zzzzz 234 774
Okay, I waited on purpose until Kent's answer was accepted and awarded the bounty because the question was about awk/sed. Thus, my answer might be somewhat off-topic, but anyway here is my Java solution which I did just for fun as a kata.
MER input file generator:
I thought it would be nice to generate some sample input files with randomised values. Each line consists of
a UUID,
0-9 groups, separated by vertical tabs,
within each group, 1-4 strings, separated by horizontal tabs,
each string consisting of 1-20 characters, among them double quotes escaped by other double quotes, i.e. "".
I thought this is diverse enough to get some good test data.
package de.scrum_master.stackoverflow;
import org.apache.commons.lang.RandomStringUtils;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Random;
import java.util.UUID;
public class RandomFileGenerator {
private static final int BUFFER_SIZE = 1024 * 1024;
private final static Random RANDOM = new Random();
private final static char VERTICAL_TAB = '\u000b';
private final static char[] LEGAL_CHARS =
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüÄÖÜß. -\""
.toCharArray();
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
// final long maxOutputSize = 9L * 1024 * 1024 * 1024;
// final String outputFile = "src/main/resources/sample-9gb.mer";
final long maxOutputSize = 1L * 1024 * 1024;
final String outputFile = "src/main/resources/sample-1mb.mer";
long totalOutputSize = 0;
long lineCount = 0;
String line;
try (PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))) {
while (totalOutputSize < maxOutputSize) {
line = generateLine();
writer.println(generateLine());
totalOutputSize += line.length() + 1;
lineCount++;
}
}
System.out.println(lineCount);
System.out.println(totalOutputSize);
System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
}
private static String generateLine() {
StringBuilder buffer = new StringBuilder();
buffer
.append('"')
.append(UUID.randomUUID().toString())
.append("\";\"");
int numItems = RANDOM.nextInt(10);
for (int i = 0; i < numItems; i++) {
int numSubItems = 1 + RANDOM.nextInt(4);
for (int j = 0; j < numSubItems; j++) {
buffer.append(
RandomStringUtils.random(1 + RANDOM.nextInt(20), 0, LEGAL_CHARS.length, false, false, LEGAL_CHARS)
.replaceAll("\"", "\"\"")
);
if (j + 1 < numSubItems)
buffer.append('\t');
}
if (i + 1 < numItems) {
buffer.append(VERTICAL_TAB);
}
}
buffer.append('"');
return buffer.toString();
}
}
You can see that it is easy to create test files of the desired file size, e.g.
1 MB: maxOutputSize = 1L * 1024 * 1024
9 GB: maxOutputSize = 9L * 1024 * 1024 * 1024
I mostly used the smaller one to check the algorithms during development and the really big one for performance tuning.
File splitter in 4 different variants:
The variants shown here use different approaches, but in common they have that they read from a BufferedReader with Java streams via reader.lines(). Switching from streams to a simple for loop made it slower, BTW. All solutions write the results into a PrintWriter.
reader.lines().forEach() and then regex matching + splitting. This solution has the best trade-off between readability, brevity and performance.
reader.lines().flatMap(), i.e. using sub-streams for the vertical-tab-separated groups after the UUID, also using regex matching + splitting. This solution is also quite short and elegant, but a little harder to read than #1 and also about 15% slower.
Because regex matching calls like replace() and split() can be pretty expensive, I developed a solution that rather iterates over the string and uses indexOf() and substring() instead of regexes. This was substantially faster than #1 and #2, but the code is harder to read in a way I started to dislike. This should only be done if performance is really important, i.e if the file splitter is used regularly. For a one-time solution or if it runs just once a month, I think it is not really worth it from a maintainability standpoint.
A further optimised version of #3 which avoids some more overhead and is again a little bit faster, but not substantially. Now the code really needs source code comments so as to convey to the reader what the algorithm does. From a clean code perspective this is a nightmare. (Don't do this at home, kids!)
package de.scrum_master.stackoverflow;
import java.io.*;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FileSplitter {
private static final int BUFFER_SIZE = 1024 * 1024;
private static final Pattern LINE_PATTERN = Pattern.compile("^\"([^\"]+)\";\"(.*)\"$");
private final static char VERTICAL_TAB = '\u000b';
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
String inputFile = "src/main/resources/sample-9gb.mer";
String outputFile = inputFile.replaceFirst("mer$", "txt");
try (
BufferedReader reader = new BufferedReader(new FileReader(inputFile), BUFFER_SIZE);
PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))
) {
// forEachVariant(reader, writer);
// flatMapVariant(reader, writer);
noRegexSimpleVariant(reader, writer);
// noRegexOptimisedVariant(reader, writer);
}
System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
}
private static void forEachVariant(BufferedReader reader, PrintWriter writer) {
Matcher matcher = LINE_PATTERN.matcher("dummy");
reader.lines()
.forEach(line -> {
matcher.reset(line).matches();
for (String record : matcher.group(2).replace("\"\"", "\"").split("\\v"))
writer.println(matcher.group(1) + "\t" + record);
});
}
private static void flatMapVariant(BufferedReader reader, PrintWriter writer) {
Matcher matcher = LINE_PATTERN.matcher("dummy");
reader.lines()
.flatMap(line -> {
matcher.reset(line).matches();
return Arrays
.stream(matcher.group(2).replace("\"\"", "\"").split("\\v"))
.map(record -> matcher.group(1) + "\t" + record);
})
.forEach(writer::println);
}
private static void noRegexSimpleVariant(BufferedReader reader, PrintWriter writer) {
reader.lines()
.forEach(line -> {
final int lineLength = line.length();
// UUID + '\t'
int indexLeft = 1;
int indexRight = line.indexOf('"', indexLeft);
final String uuid = line.substring(indexLeft, indexRight) + "\t";
indexLeft = indexRight + 3;
String record;
int quoteIndex;
while (indexLeft < lineLength) {
writer.print(uuid);
indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
if (indexRight == -1)
indexRight = lineLength - 1;
while (indexLeft < indexRight) {
quoteIndex = line.indexOf('"', indexLeft);
if (quoteIndex == -1 || quoteIndex >= indexRight)
quoteIndex = indexRight;
else
quoteIndex++;
record = line.substring(indexLeft, quoteIndex);
writer.print(record);
indexLeft = quoteIndex + 1;
}
writer.println();
indexLeft = indexRight + 1;
}
});
}
private static void noRegexOptimisedVariant(BufferedReader reader, PrintWriter writer) throws IOException {
reader.lines()
.forEach(line -> {
// UUID + '\t'
int indexLeft = 1;
int indexRight = line.indexOf('"', indexLeft);
final String uuid = line.substring(indexLeft, indexRight) + "\t";
// Skip '";"' after UUID
indexLeft = indexRight + 3;
final int lineLength = line.length();
String recordChunk;
int quoteIndex;
// If search for '"' has once reached end of line, search no more
boolean doQuoteSearch = true;
// Iterate over records per UUID, separated by vertical tab
while (indexLeft < lineLength) {
writer.print(uuid);
indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
if (indexRight == -1)
indexRight = lineLength - 1;
// Search for '""' within record incrementally, + replace each of them by '"'.
// BTW, if '"' is found, it actually always will be an escaped '""'.
while (indexLeft < indexRight) {
if (doQuoteSearch) {
// Only search for quotes if we never reached the end of line before
quoteIndex = line.indexOf('"', indexLeft);
assert quoteIndex != -1;
if (quoteIndex >= lineLength - 1)
doQuoteSearch = false;
if (quoteIndex >= indexRight)
quoteIndex = indexRight;
else
quoteIndex++;
}
else {
// No more '"' within record
quoteIndex = indexRight;
}
// Write record chunk, skipping 2nd '"'
recordChunk = line.substring(indexLeft, quoteIndex);
writer.print(recordChunk);
indexLeft = quoteIndex + 1;
}
// Do not forget newline before reading next line/UUID
writer.println();
indexLeft = indexRight + 1;
}
});
}
}
Updated awk script:
Besides: Each Java solution writes out one UUID without any content in case there was none in the input file. This would be easy to avoid, but I did it on purpose. This is the only difference to this slightly updated awk script (based on Dave's, but also replaces "" by ") which I used as a benchmark:
#!/usr/bin/awk
{
for(i=1;i<=NF;i++) {
gsub(/^"|"$/,"",$i)
gsub(/""/,"\"",$i)
}
c=split($2,a,"\\v")
for(i=1;i<=c;i++)
print $1,a[i]
}
Performance results:
I measured both parsing and writing performance.
Parsing means reading a 9 GB file from disk and splitting it, but writing the output to /dev/null or not at all.
Writing means reading the same 9 GB file and writing it back onto the same disk partition (hybrid HD + SSD type), i.e. this could be further optimised by writing onto another physical disk. The output file has a size of 18 GB.
Reading file, splitting into lines but not parsing the lines: 66 s
Awk
parsing only: 533 s
parsing + writing: 683 s
reader.lines().forEach() and then regex matching + splitting
parsing only: 212 s
parsing + writing: 425 s
reader.lines().flatMap(), i.e. using sub-streams
parsing only: 245 s
parsing + writing: not measured
Without using regex, but using String.replace("\"\"", "\"") (not shown in code here)
parsing only: 154 s
parsing + writing: 369 s
No regex, no replace(), simple version
parsing only: 86 s
parsing + writing: 342 s
No regex, no replace(), optimised version
parsing only: 84 s
parsing + writing: 342 s
Sorry for the lengthy treatise, but I wanted to share my findings for others reading the question and the other answers, speculating whether Java (or C?) might be faster than awk - yes, it is by a substantial bit, but not by an order of magnitude because disk performance is also a factor. And I think it is a warning to those who tend to over-optimise for optimisation's sake. It is not worth it if you go to far, just try to hit the sweet spot between effort, readability and performance. Amen.
I have a word document which contains 6 series of numbers (plain text, not numbered style) as following:
1) blah blah blah
2) again blah blah blah
.
.
.
20) something
And this pattern has been repeated six times. How can I used Regex and serialise all numbers before parentheses so that they start with 1 and end up with 120?
You can use VBA - add this to the ThisDocument module:
Public Sub FixNumbers()
Dim p As Paragraph
Dim i As Long
Dim realCount As Long
realCount = 1
Set p = Application.ActiveDocument.Paragraphs.First
'Iterate through paragraphs with Paragraph.Next - using For Each doesn't work and I wouldn't trust indexing since we're making changes
Do While Not p Is Nothing
digitCount = 0
For i = 1 To Len(p.Range.Text)
'Keep track of how many characters are in the number
If IsNumeric(Mid(p.Range.Text, i, 1)) Then
digitCount = digitCount + 1
Else
'We check the first non-number character we find to see if it is the list delimiter ")" and we make sure that there were some digits before it
If Mid(p.Range.Text, i, 1) = ")" And digitCount > 0 Then
'If so, we get rid of the original number and put the correct one
p.Range.Text = realCount & Right(p.Range.Text, Len(p.Range.Text) - digitCount) 'It's important to note that a side effect of assigning the text is that p is set to p.Next
'realCount holds the current "real" line number - everytime we assign a line, we increment it
realCount = realCount + 1
Exit For
Else
'If not, we skip the line assuming it's not part of the list numbering
Set p = p.Next
Exit For
End If
End If
Next
Loop
End Sub
You can run it by clicking anywhere inside of the code and clicking the "play" button in the VBA IDE.
I'm hoping someone might know of a script that can take an arbitrary word list and generated the shortest regex that could match that list exactly (and nothing else).
For example, suppose my list is
1231
1233
1234
1236
1238
1247
1256
1258
1259
Then the output should be:
12(3[13468]|47|5[589])
This is an old post, but for the benefit of those finding it through web searches as I did, there is a Perl module that does this, called Regexp::Optimizer, here: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm
It takes a regular expression as input, which can consist just of the list of input strings separated with |, and outputs an optimal regular expression.
For example, this Perl command-line:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"
generates this output:
(?^:(?^:12(?:3[13468]|5[689]|47)))
(assuming you have installed Regex::Optimizer), which matches the OP's expectation quite well.
Here's another example:
perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"
And the output:
(?^:(?^:3(?:[1238]|57)4))
For comparison, an optimal trie-based version would output 3(14|24|34|574|84). In the above output, you can also search and replace (?: and (?^: with just ( and eliminate redundant parentheses, to obtain this:
3([1238]|57)4
You are probably better off saving the entire list, or if you want to get fancy, create a Trie:
1231
1234
1247
1
|
2
/ \
3 4
/ \ \
1 4 7
Now when you take a string check if it reaches a leaf node. It does, it's valid.
If you have variable length overlapping strings (eg: 123 and 1234) you'll need to mark some nodes as possibly terminal.
You can also use the trie to generate the regex if you really like the regex idea:
Nodes from the root to the first branching are fixed (eg: 12)
Branches create |: (eg: 12(3|4)
Leaf nodes generate a character class (or single character) that follows the parent node: (eg 12(3[14]|47))
This might not generate the shortest regex, to do that you'll might some extra work:
"Compact" ranges if you find them (eg [12345] becomes [1-4])
Add quantifiers for repeated elements (eg: [1234][1234] becomes [1234]{2}
???
I really don't think it's worth it to generate the regex.
This project generates a regexp from a given list of words: https://github.com/bwagner/wordhierarchy
It almost does the same as the above JavaScript solution, but avoids certain superfluous parentheses.
It only uses "|", non-capturing group "(?:)" and option "?".
There's room for improvement when there's a row of single characters:
Instead of e.g. (?:3|8|1|6|4) it could generate [38164].
The generated regexp could easily be adapted to other regexp dialects.
Sample usage:
java -jar dist/wordhierarchy.jar 1231 1233 1234 1236 1238 1247 1256 1258 1259
-> 12(?:5(?:6|9|8)|47|3(?:3|8|1|6|4))
Here's what I came up with (JavaScript). It turned a list of 20,000 6-digit numbers into a 60,000-character regular expression. Compared to a naive (word1|word2|...) construction, that's almost 60% "compression" by character count.
I'm leaving the question open, as there's still a lot of room for improvement and I'm holding out hope that there might be a better tool out there.
var list = new listChar("");
function listChar(s, p) {
this.char = s;
this.depth = 0;
this.parent = p;
this.add = function(n) {
if (!this.subList) {
this.subList = {};
this.increaseDepth();
}
if (!this.subList[n]) {
this.subList[n] = new listChar(n, this);
}
return this.subList[n];
}
this.toString = function() {
var ret = "";
var subVals = [];
if (this.depth >=1) {
for (var i in this.subList) {
subVals[subVals.length] = this.subList[i].toString();
}
}
if (this.depth === 1 && subVals.length > 1) {
ret = "[" + subVals.join("") + "]";
} else if (this.depth === 1 && subVals.length === 1) {
ret = subVals[0];
} else if (this.depth > 1) {
ret = "(" + subVals.join("|") + ")";
}
return this.char + ret;
}
this.increaseDepth = function() {
this.depth++;
if (this.parent) {
this.parent.increaseDepth();
}
}
}
function wordList(input) {
var listStep = list;
while (input.length > 0) {
var c = input.charAt(0);
listStep = listStep.add(c);
input = input.substring(1);
}
}
words = [/* WORDS GO HERE*/];
for (var i = 0; i < words.length; i++) {
wordList(words[i]);
}
document.write(list.toString());
Using
words = ["1231","1233","1234","1236","1238","1247","1256","1258","1259"];
Here's the output:
(1(2(3[13468]|47|5[689])))
My time string may be in one of the following formates (x and y - integer numbers, h and m - symbols):
xh ym
xh
ym
y
Examples:
1h 20m
45m
2h
120
What regular expression should I write to get x and y numbers from such string?
(\d+)([mh]?)(?:\s+(\d+)m)?
You can then inspect groups 1-3. For your examples those would be
('1', 'h', '20')
('45', 'm', '')
('2', 'h', '')
('120', '', '')
As always, you might want to use some anchors ^, $, \b...
I'm going to assume you're using .NET due to your username. :)
I think in this case, it's easier to use TimeSpan.ParseExact for this task.
You can specify a list of permitted formats (see here for the format for these) and ParseExact will read in the TimeSpan according to them.
Here is an example:
var formats = new[]{"h'h'", "h'h 'm'm'", "m'm'", "%m"};
// I have assumed that a single number means minutes
foreach (var item in new[]{"23","1h 45m","1h","45m"})
{
TimeSpan timespan;
if (TimeSpan.TryParseExact(item, formats, CultureInfo.InvariantCulture, out timespan))
{
// valid
Console.WriteLine(timespan);
}
}
Output:
00:23:00
01:45:00
01:00:00
00:45:00
The only problem with this is that it is rather inflexible. Additional whitespace in the middle will fail to validate. A more robust solution using Regex is:
var items = new[]{"23","1h 45m", "45m", "1h", "1h 45", "1h 45", "1h45m"};
foreach (var item in items)
{
var match = Regex.Match(item, #"^(?=\d)((?<hours>\d+)h)?\s*((?<minutes>\d+)m?)?$", RegexOptions.ExplicitCapture);
if (match.Success)
{
int hours;
int.TryParse(match.Groups["hours"].Value, out hours); // hours == 0 on failure
int minutes;
int.TryParse(match.Groups["minutes"].Value, out minutes);
Console.WriteLine(new TimeSpan(0, hours, minutes, 0));
}
}
Breakdown of the regex:
^ - start of string
(?=\d) - must start with a digit (do this because both parts are marked optional, but we want to make sure at least one is present)
((?<hours>\d+)h)? - hours (optional, capture into named group)
\s* - whitespace (optional)
((?<minutes>\d+)m?)? - minutes (optional, capture into named group, the 'm' is optional too)
$ - end of string
I would say that mhyfritz' solution is simple, efficient and good if your input is only what you shown.
If you ever need to handle corner cases, you can use a more discriminative expression:
^(\d+)(?:(h)(?:\s+(\d+)(m))?|(m?))$
But it can be overkill...
(get rid of ^ and $ if you need to detect such pattern in a larger body of text, of course).
Try this one: ^(?:(\d+)h\s*)?(?:(\d+)m?)?$:
var s = new[] { "1h 20m", "45m", "2h", "120", "1m 20m" };
foreach (var ss in s)
{
var m = Regex.Match(ss, #"^(?:(\d+)h\s*)?(?:(\d+)m?)?$");
int hour = m.Groups[1].Value == "" ? 0 : int.Parse(m.Groups[1].Value);
int min = m.Groups[2].Value == "" ? 0 : int.Parse(m.Groups[2].Value);
if (hour != 0 || min != 0)
Console.WriteLine("Hours: " + hour + ", Mins: " + min);
else
Console.WriteLine("No match!");
}
in bash
echo $string | awk '{for(i=1;i<=NF;i++) print $i}' | sed s/[hm]/""/g
I have this
/([^\/\|\#\<\(\>\;\s][0-9]*[\s][KB]{2})
in order to be specific i had to use [KB]{2}
I get the value needed, but can I convert the final print to MB?
Exemple: match= 2000KB = 2MB?
Thanks
Sure you can; capture the unit and the number separately like this:
/[^\/\|\#\<\(\>\;\s]([0-9]*)[\s]([KB]{2})
Assuming your original regex is correct, of course. Then:
if ($2 eq "KB" && $1 > 1024) {
$1 /= 1024;
$2 = "MB";
}
lol on the sting "2000 KK" your regex matches
$1 = 000
$2 = KK
better try this one ;)
/(\d+)\s*(KB)/