Regular expression for csv with commas and no quotes - regex

I'm trying to parse really complicated csv, which is generated wittout any quotes for columns with commas.
The only tip I get, that commas with whitespace before or after are included in field.
Jake,HomePC,Microsoft VS2010, Microsoft Office 2010
Should be parsed to
Jake
HomePC
Microsoft VS2010, Microsoft Office 2010
Can anybody advice please on how to include "\s," and ,"\s" to column body.

If your language supports lookbehind assertions, split on
(?<!\s),(?!\s)
In C#:
string[] splitArray = Regex.Split(subjectString,
#"(?<!\s) # Assert that the previous character isn't whitespace
, # Match a comma
(?!\s) # Assert that the following character isn't whitespace",
RegexOptions.IgnorePatternWhitespace);

split by r"(?!\s+),(?!\s+)"
in python you can do this like
import re
re.split(r"(?!\s+),(?!\s+)", s) # s is your string

Try this. It gave me the desired result which you have mentioned.
StringBuilder testt = new StringBuilder("Jake,HomePC,Microsoft VS2010, Microsoft Office 2010,Microsoft VS2010, Microsoft Office 2010");
Pattern varPattern = Pattern.compile("[a-z0-9],[a-z0-9]", Pattern.CASE_INSENSITIVE);
Matcher varMatcher = varPattern.matcher(testt);
List<String> list = new ArrayList<String>();
int startIndex = 0, endIndex = 0;
boolean found = false;
while (varMatcher.find()) {
endIndex = varMatcher.start()+1;
if (startIndex == 0) {
list.add(testt.substring(startIndex, endIndex));
} else {
startIndex++;
list.add(testt.substring(startIndex, endIndex));
}
startIndex = endIndex;
found = true;
}
if (found) {
if (startIndex == 0) {
list.add(testt.substring(startIndex));
} else {
list.add(testt.substring(startIndex + 1));
}
}
for (String s : list) {
System.out.println(s);
}
Please note that the code is in Java.

Related

Eliminate newlines in google app script using regex

I'm trying to write part of an add-on for Google Docs that eliminates newlines within selected text using replaceText. The obvious text.replaceText("\n",""); gives the error Invalid argument: searchPattern. I get the same error with text.replaceText("\r","");. The following attempts do nothing: text.replaceText("/\n/","");, text.replaceText("/\r/","");. I don't know why Google App Script does not allow for the recognition of newlines in regex.
I am aware that there is an add-on that does this already, but I want to incorporate this function into my add-on.
This error occurs even with the basic
DocumentApp.getActiveDocument().getBody().textReplace("\n","");
My full function:
function removeLineBreaks() {
var selection = DocumentApp.getActiveDocument().getSelection();
if (selection) {
var elements = selection.getRangeElements();
for (var i = 0; i < elements.length; i++) {
var element = elements[i];
// Only deal with text elements
if (element.getElement().editAsText) {
var text = element.getElement().editAsText();
if (element.isPartial()) {
text.replaceText("\n","");
}
// Deal with fully selected text
else {
text.replaceText("\n","");
}
}
}
}
// No text selected
else {
DocumentApp.getUi().alert('No text selected. Please select some text and try again.');
}
}
It seems that in replaceText, to remove soft returns entered with Shift-ENTER, you can use \v:
.replaceText("\\v+", "")
If you want to remove all "other" control characters (C0, DEL and C1 control codes), you may use
.replaceText("\\p{Cc}+", "")
Note that the \v pattern is a construct supported by JavaScript regex engine, and is considered to match a vertical tab character (≡ \013) by the RE2 regex library used in most Google products.
The Google Apps Script function replaceText() still doesn't accept escape characters, but I was able to get around this by using getText(), then the generic JavaScript replace(), then setText():
var doc = DocumentApp.getActiveDocument();
var body = doc.getBody();
var bodyText = body.getText();
//DocumentApp.getUi().alert( "Does document contain \\t? " + /\t/.test( bodyText ) ); // \n true, \r false, \t true
bodyText = bodyText.replace( /\n/g, "" );
bodyText = bodyText.replace( /\t/g, "" );
body.setText( bodyText );
This worked within a Doc. Not sure if the same is possible within a Sheet (and, even if it were, you'd probably have to run this once cell at a time).
here is my pragmatic solution to eliminate newlines in Google Docs, or, more exact, to eliminate newlines from Gmail message.getPlainBody().
It looks that Google uses '\r\n\r\n' as a plain EOL and '\r\n' as a manuell Linefeed (Shift-Enter). The code should be self explainable.
It might help to get alone with the newline problem in Docs.
A solution possibly not very elegant, but works like a charm :-)
function GetEmails2Doc() {
var doc = DocumentApp.getActiveDocument();
var body = doc.getBody();
var pc = 0; // Paragraph Counter
var label = GmailApp.getUserLabelByName("_Send2Sheet");
var threads = label.getThreads();
var i = threads.length;
// LOOP Messages within a THREAT
for (i=threads.length-1; i>=0; i--) {
for (var j = 0; j < messages.length; j++) {
var message = messages[j];
/* Here I do some ...
body.insertParagraph(pc++, Utilities.formatDate(message.getDate(), "GMT",
"dd.MM.yyyy (HH:mm)")).setHeading(DocumentApp.ParagraphHeading.HEADING4)
str = message.getFrom() + ' to: ' + message.getTo();
if (message.getCc().length >0) str = str + ", Cc: " + message.getCc();
if (message.getBcc().length >0) str = str + ", Bcc: " + message.getBcc();
body.insertParagraph(pc++,str);
*/
// Body !!
var str = processBody(message.getPlainBody()).split("pEOL");
Logger.log(str.length + " EOLs");
for (var k=0; k<str.length; k++) body.insertParagraph(pc++,str[k]);
}
}
}
function processBody(tx) {
var s = tx.split(/\r\n\r\n/g);
// it looks like message.getPlainBody() [of mail] uses \r\n\r\n as EOL
// so, I first substitute the 'EOL's with the string pattern "pEOL"
// to be replaced with body.insertParagraph in the main function
tx = '';
for (k=0; k<s.length; k++) tx = tx + s[k] + "pEOL";
// then replace all remaining simple \r\n with a blank
s = tx.split(/\r\n/g);
tx = '';
for (k=0; k<s.length; k++) tx = tx + s[k] + " ";
return tx;
}
I have now found out through much trial and error -- and some much needed help from Wiktor Stribiżew (see other answer) -- that there is a solution to this, but it relies on the fact that Google Script does not recognise \n or \r in regex searches. The solution is as follows:
function removeLineBreaks() {
var selection = DocumentApp.getActiveDocument()
.getSelection();
if (selection) {
var elements = selection.getRangeElements();
for (var i = 0; i < elements.length; i++) {
var element = elements[i];
// Only deal with text elements
if (element.getElement()
.editAsText) {
var text = element.getElement()
.editAsText();
if (element.isPartial()) {
var start = element.getStartOffset();
var finish = element.getEndOffsetInclusive();
var oldText = text.getText()
.slice(start, finish);
if (oldText.match(/\r/)) {
var number = oldText.match(/\r/g)
.length;
for (var j = 0; j < number; j++) {
var location = oldText.search(/\r/);
text.deleteText(start + location, start + location);
text.insertText(start + location, ' ');
var oldText = oldText.replace(/\r/, ' ');
}
}
}
// Deal with fully selected text
else {
text.replaceText("\\v+", " ");
}
}
}
}
// No text selected
else {
DocumentApp.getUi()
.alert('No text selected. Please select some text and try again.');
}
}
Explanation
Google Docs allows searching for vertical tabs (\v), which match newlines.
Partial text is a whole other problem. The solution to dealing with partially selected text above finds the location of newlines by extracting a text string from the text element and searching in that string. It then uses these locations to delete the relevant characters. This is repeated until the number of newlines in the selected text has been reached.
This Stack Overflow answer removes, specifically, "\n". It may help, it helped me indeed.

Program hangs due to too many regex-based replacements

In my input file, I need to do lots of string-manipulations (find/replaces) using Regex depending on various conditions. Like, if the content's one block meets the condition, I need to go to previous block and do replacement in that block.
For this reason, I am splitting the content to many substrings, so that I can move back to previous block (here, previous substring); and do the REGEX replacement.
But the Program hangs in the middle if the File content is more(or may be, no. of substrings exceeds).
Here is the code snippet.
string content = string.Empty;
string target_content = string.Empty;
string[] active_doc_nos;
byte[] content_bytes;
FileInfo input_fileinfo = new FileInfo(input_file);
long file_length = input_fileinfo.Length;
using (FileStream fs_read = new FileStream(input_file, FileMode.Open, FileAccess.Read))
{
content_bytes = new byte[Convert.ToInt32(file_length)];
fs_read.Read(content_bytes, 0, Convert.ToInt32(file_length));
fs_read.Close();
}
content = ASCIIEncoding.ASCII.GetString(content_bytes);
if (Regex.IsMatch(content, "<\\?CLG.MDFO ([^>]*) LEVEL=\"STRUCTURE\""))
{
#region Logic-1: TWO PAIRS of MDFO-MDFC-s one pair following the other
content = Regex.Replace(content, "(<\\?CLG.MDFO)([^>]*)(LEVEL=\"STRUCTURE\")", "<MDFO_VALIDATOR>$1$2$3");
string[] MDFO_Lines = Regex.Split(content, "<MDFO_VALIDATOR>");
active_doc_nos = new string[MDFO_Lines.GetLength(0)];
active_doc_nos[0] = Regex.Match(MDFO_Lines[0], "ACTIVE DOC=\"([^>]*)\"\\s+").ToString();
for (int i = 1; i < MDFO_Lines.GetLength(0); i++)
{
active_doc_nos[i] = Regex.Match(MDFO_Lines[i], "ACTIVE DOC=\"([^>]*)\"\\s+").ToString();
if (Regex.IsMatch(MDFO_Lines[i - 1], "(<\\?CLG.MDFC)([^>]*)(\\?>)(<\\S*\\s*\\S*>)*$"))
{
MDFO_Lines[i - 1] = Regex.Replace(MDFO_Lines[i - 1], "(<\\?CLG.MDFC)([^>]*)(\\?>)(<\\S*\\s*\\S*>)*$", "<?no_smark?>$1$2$3$4");
if (Regex.IsMatch(MDFO_Lines[i - 1], "^<\\?CLG.MDFO ([^>]*) ACTION=\"DELETED\""))
{
MDFO_Lines[i - 1] = Regex.Replace(MDFO_Lines[i - 1], "^<\\?CLG.MDFO ([^>]*) ACTION=\"DELETED\"", "<?no_bmark?><?CLG.MDFO $1 ACTION=\"DELETED\"");
}
if (active_doc_nos[i] == active_doc_nos[i - 1])
{
MDFO_Lines[i] = Regex.Replace(MDFO_Lines[i], "^<\\?CLG.MDFO ([^>]*) " + active_doc_nos[i], "<?no_smark?><?CLG.MDFO $1 " + active_doc_nos[i]);
}
}
}
foreach (string str_piece in MDFO_Lines)
{
target_content += str_piece;
}
byte[] target_bytes = ASCIIEncoding.ASCII.GetBytes(target_content);
using (FileStream fs_write = new FileStream(input_file, FileMode.Create, FileAccess.Write))
{
fs_write.Write(target_bytes, 0, target_bytes.Length);
fs_write.Close();
}
Do I have any other option to achieve this task??
Hard to say without seeing your data, but I have a suspicion that this part of some of your regexes may be the culprit:
(<\\S*\\s*\\S*>)*
Because \S can also match < and >, because everything is optional, and because you've got nested quantifiers, it's possible that this part of the regex leads to catastrophic backtracking.
What happens if you replace these parts with (?>(<\\S*\\s*\\S*>))*?

c# and regular expression

I want to get 100 and example from this string
?connect:100/username:example/
I searched in google but cannot find some useful regex patterns form my solution
Please help
try {
Regex RegexObj = new Regex(":(?<Number>\\d+)/.+?:(?<Text>\\w+)/");
Match MatchResults = RegexObj.Match(SubjectString);
while (MatchResults.Success) {
for (int i = 1; i < MatchResults.Groups.Count; i++) {
Group GroupObj = MatchResults.Groups[i];
if (GroupObj.Success) {
}
}
MatchResults = MatchResults.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
This is the regex:
\?connect:([0-9]+)/username:([^/]*)/
You don't need to use a regex for this, use Linq:
var url = "?connect:100/username:example/";
var data = url.Substring(1, url.Length-2).Split('/')
.Select(x => x.Split(':'))
.ToDictionary(x => x[0], x => x[1]);
Console.WriteLine(data["connect"]); // 100
Console.WriteLine(data["username"]); // example
You could remove the SubString(1, url.Length-2) call if you got the string back without the starting ? and trailing /.

GWT Regex and empty string

Could someone explain why this snip :
// import com.google.gwt.regexp.shared.MatchResult;
// import com.google.gwt.regexp.shared.RegExp;
RegExp regExp = RegExp.compile("^$");
MatchResult matcher;
while ((matcher = regExp.exec("")) != null)
{
System.out.println("match " + matcher);
}
give an incredible count of matches? I tested with different modifier allowed by GWT implementation of compile(), g, i and m. It works only with m (multiline).
I just want to check for empty string.
[EDIT] the new method
private ArrayList<MatchResult> getMatches(String input, String pattern)
{
ArrayList<MatchResult> matches = new ArrayList<MatchResult>();
if(null == regExp)
{
regExp = RegExp.compile(pattern, "g");
}
if(input.isEmpty())
{
// empty string : just check if pattern validate and
// don't try to extract matches : it will resutl in infinite
// loop.
if(regExp.test(input))
{
matches.add(new MatchResult(0, "", new ArrayList<String>(0)));
}
}
else
{
for(MatchResult matcher = regExp.exec(input); matcher != null; matcher = regExp
.exec(input))
{
matches.add(matcher);
}
}
return matches;
}
Your regExp.exec("") with RegExp.compile("^$") will never return null, as the empty string "" is a match for regex ^$, which reads "nothing between beginning and the end of line/string".
So your while is infinity loop.
Also, you print is
System.out.println("match " + matcher);
...but you probably wanted to use
System.out.println("match " + matcher.getGroup(0));
Also see GWT checking if textbox is empty.

How do I check if a filename matches a wildcard pattern

I've got a wildcard pattern, perhaps "*.txt" or "POS??.dat".
I also have list of filenames in memory that I need to compare to that pattern.
How would I do that, keeping in mind I need exactly the same semantics that IO.DirectoryInfo.GetFiles(pattern) uses.
EDIT: Blindly translating this into a regex will NOT work.
I have a complete answer in code for you that's 95% like FindFiles(string).
The 5% that isn't there is the short names/long names behavior in the second note on the MSDN documentation for this function.
If you would still like to get that behavior, you'll have to complete a computation of the short name of each string you have in the input array, and then add the long name to the collection of matches if either the long or short name matches the pattern.
Here is the code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace FindFilesRegEx
{
class Program
{
static void Main(string[] args)
{
string[] names = { "hello.t", "HelLo.tx", "HeLLo.txt", "HeLLo.txtsjfhs", "HeLLo.tx.sdj", "hAlLo20984.txt" };
string[] matches;
matches = FindFilesEmulator("hello.tx", names);
matches = FindFilesEmulator("H*o*.???", names);
matches = FindFilesEmulator("hello.txt", names);
matches = FindFilesEmulator("lskfjd30", names);
}
public string[] FindFilesEmulator(string pattern, string[] names)
{
List<string> matches = new List<string>();
Regex regex = FindFilesPatternToRegex.Convert(pattern);
foreach (string s in names)
{
if (regex.IsMatch(s))
{
matches.Add(s);
}
}
return matches.ToArray();
}
internal static class FindFilesPatternToRegex
{
private static Regex HasQuestionMarkRegEx = new Regex(#"\?", RegexOptions.Compiled);
private static Regex IllegalCharactersRegex = new Regex("[" + #"\/:<>|" + "\"]", RegexOptions.Compiled);
private static Regex CatchExtentionRegex = new Regex(#"^\s*.+\.([^\.]+)\s*$", RegexOptions.Compiled);
private static string NonDotCharacters = #"[^.]*";
public static Regex Convert(string pattern)
{
if (pattern == null)
{
throw new ArgumentNullException();
}
pattern = pattern.Trim();
if (pattern.Length == 0)
{
throw new ArgumentException("Pattern is empty.");
}
if(IllegalCharactersRegex.IsMatch(pattern))
{
throw new ArgumentException("Pattern contains illegal characters.");
}
bool hasExtension = CatchExtentionRegex.IsMatch(pattern);
bool matchExact = false;
if (HasQuestionMarkRegEx.IsMatch(pattern))
{
matchExact = true;
}
else if(hasExtension)
{
matchExact = CatchExtentionRegex.Match(pattern).Groups[1].Length != 3;
}
string regexString = Regex.Escape(pattern);
regexString = "^" + Regex.Replace(regexString, #"\\\*", ".*");
regexString = Regex.Replace(regexString, #"\\\?", ".");
if(!matchExact && hasExtension)
{
regexString += NonDotCharacters;
}
regexString += "$";
Regex regex = new Regex(regexString, RegexOptions.Compiled | RegexOptions.IgnoreCase);
return regex;
}
}
}
}
You can simply do this. You do not need regular expressions.
using Microsoft.VisualBasic.CompilerServices;
if (Operators.LikeString("pos123.txt", "pos?23.*", CompareMethod.Text))
{
Console.WriteLine("Filename matches pattern");
}
Or, in VB.Net,
If "pos123.txt" Like "pos?23.*" Then
Console.WriteLine("Filename matches pattern")
End If
In c# you could simulate this with an extension method. It wouldn't be exactly like VB Like, but it would be like...very cool.
You could translate the wildcards into a regular expression:
*.txt -> ^.+\.txt$
POS??.dat _> ^POS..\.dat$
Use the Regex.Escape method to escape the characters that are not wildcars into literal strings for the pattern (e.g. converting ".txt" to "\.txt").
The wildcard * translates into .+, and ? translates into .
Put ^ at the beginning of the pattern to match the beginning of the string, and $ at the end to match the end of the string.
Now you can use the Regex.IsMatch method to check if a file name matches the pattern.
Just call the Windows API function PathMatchSpecExW().
[Flags]
public enum MatchPatternFlags : uint
{
Normal = 0x00000000, // PMSF_NORMAL
Multiple = 0x00000001, // PMSF_MULTIPLE
DontStripSpaces = 0x00010000 // PMSF_DONT_STRIP_SPACES
}
class FileName
{
[DllImport("Shlwapi.dll", SetLastError = false)]
static extern int PathMatchSpecExW([MarshalAs(UnmanagedType.LPWStr)] string file,
[MarshalAs(UnmanagedType.LPWStr)] string spec,
MatchPatternFlags flags);
/*******************************************************************************
* Function: MatchPattern
*
* Description: Matches a file name against one or more file name patterns.
*
* Arguments: file - File name to check
* spec - Name pattern(s) to search foe
* flags - Flags to modify search condition (MatchPatternFlags)
*
* Return value: Returns true if name matches the pattern.
*******************************************************************************/
public static bool MatchPattern(string file, string spec, MatchPatternFlags flags)
{
if (String.IsNullOrEmpty(file))
return false;
if (String.IsNullOrEmpty(spec))
return true;
int result = PathMatchSpecExW(file, spec, flags);
return (result == 0);
}
}
Some kind of regex/glob is the way to go, but there are some subtleties; your question indicates you want identical semantics to IO.DirectoryInfo.GetFiles. That could be a challenge, because of the special cases involving 8.3 vs. long file names and the like. The whole story is on MSDN.
If you don't need an exact behavioral match, there are a couple of good SO questions:
glob pattern matching in .NET
How to implement glob in C#
For anyone who comes across this question now that it is years later, I found over at the MSDN social boards that the GetFiles() method will accept * and ? wildcard characters in the searchPattern parameter. (At least in .Net 3.5, 4.0, and 4.5)
Directory.GetFiles(string path, string searchPattern)
http://msdn.microsoft.com/en-us/library/wz42302f.aspx
Plz try the below code.
static void Main(string[] args)
{
string _wildCardPattern = "*.txt";
List<string> _fileNames = new List<string>();
_fileNames.Add("text_file.txt");
_fileNames.Add("csv_file.csv");
Console.WriteLine("\nFilenames that matches [{0}] pattern are : ", _wildCardPattern);
foreach (string _fileName in _fileNames)
{
CustomWildCardPattern _patetrn = new CustomWildCardPattern(_wildCardPattern);
if (_patetrn.IsMatch(_fileName))
{
Console.WriteLine("{0}", _fileName);
}
}
}
public class CustomWildCardPattern : Regex
{
public CustomWildCardPattern(string wildCardPattern)
: base(WildcardPatternToRegex(wildCardPattern))
{
}
public CustomWildCardPattern(string wildcardPattern, RegexOptions regexOptions)
: base(WildcardPatternToRegex(wildcardPattern), regexOptions)
{
}
private static string WildcardPatternToRegex(string wildcardPattern)
{
string patternWithWildcards = "^" + Regex.Escape(wildcardPattern).Replace("\\*", ".*");
patternWithWildcards = patternWithWildcards.Replace("\\?", ".") + "$";
return patternWithWildcards;
}
}
For searching against a specific pattern, it might be worth using File Globbing which allows you to use search patterns like you would in a .gitignore file.
See here: https://learn.microsoft.com/en-us/dotnet/core/extensions/file-globbing
This allows you to add both inclusions & exclusions to your search.
Please see below the example code snippet from the Microsoft Source above:
Matcher matcher = new Matcher();
matcher.AddIncludePatterns(new[] { "*.txt" });
IEnumerable<string> matchingFiles = matcher.GetResultsInFullPath(filepath);
The use of RegexOptions.IgnoreCase will fix it.
public class WildcardPattern : Regex {
public WildcardPattern(string wildCardPattern)
: base(ConvertPatternToRegex(wildCardPattern), RegexOptions.IgnoreCase) {
}
public WildcardPattern(string wildcardPattern, RegexOptions regexOptions)
: base(ConvertPatternToRegex(wildcardPattern), regexOptions) {
}
private static string ConvertPatternToRegex(string wildcardPattern) {
string patternWithWildcards = Regex.Escape(wildcardPattern).Replace("\\*", ".*");
patternWithWildcards = string.Concat("^", patternWithWildcards.Replace("\\?", "."), "$");
return patternWithWildcards;
}
}