Regex to remove double-quotes in CSV fields that are delineated by double-quotes

Regex to remove double-quotes in CSV fields that are delineated by double-quotes - regex

This is for a VB.NET project. My existing method converts a comma-delimited file to a pipe-delimited file. It got a little challenging because some of the fields had commas within them, so those fields had double-quotes around the fields contents.
Here's the working code (thanks a million to The Blue Dog for the research on this):
Private Function ConvertCommaSepToPipeSep() As Boolean
Dim line, result As String
Dim pattern As String = ",([^,""]*(?:""[^""]*"")?[^,""]*)(?=,|$)"
Dim replacement As String = "|$1"
Dim rgx As New Regex(pattern)
'Console.WriteLine("Conversion start time: " & DateTime.Now.ToLongTimeString())
Try
Using sw As New StreamWriter("output.csv")
Using sr As New StreamReader("source.csv")
While Not sr.EndOfStream
line = sr.ReadLine
result = rgx.Replace(line, replacement)
sw.WriteLine(result.Replace(Chr(34), ""))
End While
End Using
End Using
Catch ex As Exception
MessageBox.Show("There was a problem converting the file." & vbcrlf & ex.message)
Return False
End Try
'Console.WriteLine("Conversion end time: " & DateTime.Now.ToLongTimeString())
Return True
End Function
I found out, however, that some of the fields have double-quotes within them as well.
Here are some sample lines from the source file that I am converting.
122749,JOHN DOE,ACS155,7/5/2014,P,SCH/RC Activation Week 2,HRLY,1299577,Scheduler IT,2204,CVISA-Client Activation,1220000,Svcs Clin Implement,34
110310,JANE DOE,ACS150,2/8/2014,P,"Developed Employee Interface""",HRLY,1267305,Project Management - Client Implementation Services,2500,PJM -Project Management,1410000,Tech Services Development,8
110310,MARY DOE,ACS160,2/8/2014,P,EDManage+ CSV data extract,HRLY,1527401,Project Management - Client Implementation Services,2500,PJM -Project Management,1410000,Tech Services Development,8
129084,ROBERT SMITH,ACS80,9/27/2014,P,,PTO,0,Company General Services,1030,"Time Off - PTO, Holiday, Personal Holiday, FTO",1100000,Client Services Technical,40
117592,HARRY JOHNSON,ACS64,5/10/2014,P,"helped penny post AP ""E"" cks",HRLY,1554404,General Financials IT,2120,CCON-Client Conference Call,1100000,Client Services Technical,1.5
110310,MARK WILSON,ACS130,2/8/2014,P,"""Charge Vs Payment""",HRLY,1267305,Project Management - Clinical Implementation Services,2500,PJM -Project Management,1410000,Tech Services Development,8
Those same rows need to be converted to look like this:
122749|JOHN DOE|ACS155|7/5/2014|P|SCH/RC Activation Week 2|HRLY|1299577|Scheduler IT|2204|CVISA-Client Activation|1220000|Svcs Clin Implement|34
110310|JANE DOE|ACS150|2/8/2014|P|Developed Employee Interface""|HRLY|1267305|Project Management - Client Implementation Services|2500|PJM -Project Management|1410000|Tech Services Development|8
110310|MARY DOE|ACS160|2/8/2014|P|EDManage+ CSV data extract|HRLY|1527401|Project Management - Client Implementation Services|2500|PJM -Project Management|1410000|Tech Services Development|8
129084|ROBERT SMITH|ACS80|9/27/2014|P||PTO|0|Company General Services|1030|Time Off - PTO, Holiday, Personal Holiday, FTO|1100000|Client Services Technical|40
117592|HARRY JOHNSON|ACS64|5/10/2014|P|helped penny post AP E cks|HRLY|1554404|General Financials IT|2120|CCON-Client Conference Call|1100000|Client Services Technical|1.5
110310|MARK WILSON|ACS130|2/8/2014|P|Charge Vs Payment|HRLY|1267305|Project Management - Clinical Implementation Services|2500|PJM -Project Management|1410000|Tech Services Development|8
In this CSV, columns that have commas in the text are given double-quotes around the column and the regex above accounts for that. But I found out that some fields also have double-quotes within them. Any instances of double-quotes within a field can be removed, but in some cases the field can end or start with a double quote, resulting in three double-quotes, but I can't just remove all double-quotes because they help delineate where fields that have commas in them start and end.
What needs to be added to the regex to do that?

The "" are supposed to be converted to a single ". Are you sure you want to remove them completely?
– nhahtdh
Can't you just csvString = csvString.Replace( ... ) before you run the RE
– Alex K.

Related

Can i make Excel save after a specific day of the week?

My idea is that i want my workbook to save automatically when the workbook is opend after every sunday. so if i open the workbook at monday morning it will save the workbook at a folder with a new weeknr in the name every week.
my first thought was doing it with IF statements but im not sure thats the way.

If you wish to go the VBA route, you can start with something like this:
First save your initial woorkbook as filename.xlsm (excel with macros enabled). Otherwise nothing will work.
Then enter VBA editor using ALT-F11. Click "This project folder" and make an _open event macro.
Order of action as here:
You can use this code as a skeleton:
Const myBaseName As String = "opopen"
Const myBasePath As String = "c:\temp\"
Private Sub Workbook_Open()
' get a new date
d = Format(Now(), "yyyymmdd_hhnnss")
newname = myBasePath & myBaseName & "_" & d & ".xlsm"
MsgBox "NEW NAME IS ==> " & newname, vbOKOnly, "Information"
ActiveWorkbook.SaveAs newname
End Sub
Obviously you can / should add some logic to make this change file only once per week. Use some date formatting to get week number, check file existence etc.
In my example, I make a new filename based on time, accurate to seconds - to prove the concept.
The weeknumber can be acquired using
Dim wk As Integer
wk = Application.WorksheetFunction.WeekNum(Now())
wks = wk ' as string
If wk < 10 Then
wks = "0" & wk
End If
' use wks for weeknumbers, formatted to two digits.
First time you open this file you will have to confirm activation of macros. If you do saveAs from VBA, you should know that
you immediately work with the new filename. You do not "save a copy as"
the new file will have VBA macros enabled as well
if you rename the file from Windows, you will have to reconfirm macros enabled.
Is this enough to get you started ?

How to insert a new line after each occurrence of a particular format in a text field

I have a system that I can output a spreadsheet from. I then take this outputted spreadsheet and import it into MS Access. There, I run some basic update queries before merging the final result into a SharePoint 2013 Linked List.
The spreadsheet I output has an unfortunate Long Text field which has some comments in it, which are vital. On the system that hosts the spreadsheet, these comments are nicely formatted. When the spreadsheet it output though, the field turns into a long, very unpretty string like so:
09:00 on 01/03/2017, Firstname Surname. :- Have responded to request for more information. 15:12 on 15/02/2017, Firstname Surname. :- Need more information to progress request. 17:09 on 09/02/2017, Firstname Surname. :- Have placed request.
What I would like to do is run a query (either in MS Access or MS Excel) which can scan this field, detect occurrences of "##:## on ##/##/####, Firstname Surname. :-" and then automatically insert a line break before them, so this text is more neatly formatted. It would obviously skip the first occurrence of this format, as otherwise it would enter a new line at the start of the field. Ideal end result would be:
09:00 on 01/03/2017, Firstname Surname. :- Have responded to request
for more information.
15:12 on 15/02/2017, Firstname Surname. :- Need more information to progress request.
17:09 on 09/02/2017, Firstname Surname. :- Have placed request.
To be honest, I haven't tried much myself so far, as I really don't know where to start. I don't know if this can be done without regular expressions, or within a simple query versus VBA code.
I did start building a regular expression, like so:
[0-9]{2}:[0-9]{2}\s[o][n]\s[0-9]{2}\/[0-9]{2}\/[0-9]{4}\,\s
But this looks a little ridiculous and I'm fairly certain I'm going about it in a very unnecessary way. From what I can see from the text, detecting the next occurrence of "##:## on ##/##/####" should be enough. If I take a new line after this, that will suffice.

You have your RegExp pattern, now you need to create a function to append found items with your extra delimiter.
look at this function. It takes, your long string and finds your date-stamp using your pattern and appends with your delimiter.
Ideally, i would run each line twice and add delimiters after each column so you have a string like,
datestamp;firstname lastname;comment
you can then use arr = vba.split(text, ";") to get your data into an array and use it as
date-stamp = arr(0)
name = arr(1)
comment = arr(2)
Public Function FN_REGEX_REPLACE(iText As String, iPattern As String, iDelimiter As String) As String
Dim objRegex As Object
Dim allmatches As Variant
Dim I As Long
On Error GoTo FN_REGEX_REPLACE_Error
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Multiline = True
.Global = True
.IgnoreCase = True
.Pattern = iPattern
If .test(iText) Then
Set allmatches = .Execute(iText)
If allmatches.count > 0 Then
For I = 1 To allmatches.count - 1 ' for i = 0 to count will start from first match
iText = VBA.Replace(iText, allmatches.item(I), iDelimiter & allmatches.item(I))
Next I
End If
End If
End With
FN_REGEX_REPLACE = Trim(iText)
Set objRegex = Nothing
On Error GoTo 0
Exit Function
FN_REGEX_REPLACE_Error:
MsgBox Err.description
End Function
use above function as
mPattern = "[0-9]{2}:[0-9]{2}\s[o][n]\s[0-9]{2}\/[0-9]{2}\/[0-9]{4}\,"
replacedText = FN_REGEX_REPLACE(originalText,mPattern,vbnewline)

Excel uses LF for linebreaks, Access uses CRLF.
So it should suffice to run a simple replacement query:
UPDATE myTable
SET LongTextField = Replace([LongTextField], Chr(10), Chr(13) & Chr(10))
WHERE <...>
You need to make sure that this runs only once on newly imported records, not repeatedly on all records.

Find-Replace text contained in textboxes and tables

I'm hoping I can get come help from a programmer.
What I want to do is to translate a word report generated by a software, so I turned to macros. I already have a word file containing the original word/phrases and the translated ones.
I 'stole' the code to translate from some forum online, which works great with normal text. My problem is that the text of the report I want to translate is within various "text boxes" and "tables".
I was able to manually remove the tables, but keep the text. This totally ruined the formatting, but I can deal with that latter.
Now, unfortunately I cannot do the same with textboxes. There is no 'delete, but keep the text" function for textboxes.
I can send you the macro code, the original report automatically generated by the software and the file to get all translated words from.
I really appreciate your time.

Ok. This is code that translates normal text.
Sub Translate()
Dim oChanges As Document, oDoc As Document
Dim oTable As Table
Dim oRng As Range
Dim rFindText As Range, rReplacement As Range
Dim i As Long
Dim sFname As String
'Change the path in the line below to reflect the path of the table document
sFname = "C:\Users\user\Desktop\Dictionary.doc"
Set oDoc = ActiveDocument
Set oChanges = Documents.Open(FileName:=sFname, Visible:=False)
Set oTable = oChanges.Tables(1)
For i = 1 To oTable.Rows.Count
Set oRng = oDoc.Range
Set rFindText = oTable.Cell(i, 1).Range
rFindText.End = rFindText.End - 1
Set rReplacement = oTable.Cell(i, 2).Range
rReplacement.End = rReplacement.End - 1
With oRng.Find
.ClearFormatting
.Replacement.ClearFormatting
Do While .Execute(findText:=rFindText, _
MatchWholeWord:=True, _
MatchWildcards:=False, _
Forward:=True, _
Wrap:=wdFindContinue) = True
oRng.Text = rReplacement
Loop
End With
Next i
oChanges.Close wdDoNotSaveChanges
End Sub
I'm guessing you'd need to see the format of the document that is being translated, which contains all the tables and text boxes. But it is too large and I'm not sure if I can send it as an attachment here somehow. (sorry, its my first time on this forum). Any advise?
Thanks a lot
JD

Regular Expression Rules in Outlook 2007?

Is it possible to create rules in Outlook 2007 based on a regex string?
I'm trying to add a filter for messages containing a string such as: 4000-10, a four digit number followed by a dash and then a two digit number, which can be anything from 0000-00 to 9999-99.
I was using this as a regex: \b[0-9]{4}\-[0-9]{2}\b but the filter isn't working. I've tried a few other modifications as well with no luck. I wasn't able to find anything concrete online about whether Outlook even supports entering regexes into a rule, though, so I figured I would ask here in case I'm wasting my time.
EDIT: Thanks to Chris's comment below, I was able to implement this filter via a macro. I thought I would share my code below in case it is able to help anyone else:
Sub JobNumberFilter(Message As Outlook.MailItem)
Dim MatchesSubject, MatchesBody
Dim RegEx As New RegExp
'e.g. 1000-10'
RegEx.Pattern = "([0-9]{4}-[0-9]{2})"
'Check for pattern in subject and body'
If (RegEx.Test(Message.Subject) Or RegEx.Test(Message.Body)) Then
Set MatchesSubject = RegEx.Execute(Message.Subject)
Set MatchesBody = RegEx.Execute(Message.Body)
If Not (MatchesSubject Is Nothing And MatchesBody Is Nothing) Then
'Assign "Job Number" category'
Message.Categories = "Job Number"
Message.Save
End If
End If
End Sub

I do not know if a regex can be used directly in a rule, but you can have a rule trigger a script and the script can use regexes. I hate Outlook.
First, you have to open the script editor via Tools - Macro - Open Visual Basic Editor (Alt-F11 is the shortcut).
The editor will open. It should contain a project outline in a small panel in the top-left corner. The project will be listed as VBAProject.OTM. Expand this item to reveal Microsoft Office Outlook Objects. Expand that to reveal ThisOutlookSession. Double-click ThisOutlookSession to open the code editing pane (which will probably be blank).
Next select Tools menu | References and enable the RegExp references called something like "Microsoft VBScript Regular Expressions 5.5"
You can now create a subroutine to perform your filtering action. Note that a subroutine called by a rule must have a single parameter of type Outlook.MailItem. For example:
' note that Stack Overflow's syntax highlighting doesn't understand VBScript's
' comment character (the single quote) - it treats it as a string delimiter. To
' make the code appear correctly, each comment must be closed with another single
' quote so that the syntax highlighter will stop coloring everything as a string.'
Public Enum Actions
ACT_DELIVER = 0
ACT_DELETE = 1
ACT_QUARANTINE = 2
End Enum
Sub MyNiftyFilter(Item As Outlook.MailItem)
Dim Matches, Match
Dim RegEx As New RegExp
RegEx.IgnoreCase = True
' assume mail is good'
Dim Message As String: Message = ""
Dim Action As Actions: Action = ACT_DELIVER
' SPAM TEST: Illegal word in subject'
RegEx.Pattern = "(v\|agra|erection|penis|boner|pharmacy|painkiller|vicodin|valium|adderol|sex med|pills|pilules|viagra|cialis|levitra|rolex|diploma)"
If Action = ACT_DELIVER Then
If RegEx.Test(Item.Subject) Then
Action = ACT_QUARANTINE
Set Matches = RegEx.Execute(Item.Subject)
Message = "SPAM: Subject contains restricted word(s): " & JoinMatches(Matches, ",")
End If
End If
' other tests'
Select Case Action
Case Actions.ACT_QUARANTINE
Dim ns As Outlook.NameSpace
Set ns = Application.GetNamespace("MAPI")
Dim junk As Outlook.Folder
Set junk = ns.GetDefaultFolder(olFolderJunk)
Item.Subject = "SPAM: " & Item.Subject
If Item.BodyFormat = olFormatHTML Then
Item.HTMLBody = "<h2>" & Message & "</h2>" & Item.HTMLBody
Else
Item.Body = Message & vbCrLf & vbCrLf & Item.Body
End If
Item.Save
Item.Move junk
Case Actions.ACT_DELETE
' similar to above, but grab Deleted Items folder as destination of move'
Case Actions.ACT_DELIVER
' do nothing'
End Select
End Sub
Private Function JoinMatches(Matches, Delimeter)
Dim RVal: RVal = ""
For Each Match In Matches
If Len(RVal) <> 0 Then
RVal = RVal & ", " & Match.Value
Else
RVal = RVal & Match.Value
End If
Next
JoinMatches = RVal
End Function
Next, you have to create a rule (Tools - Rules and Alerts) to trigger this script. Click the New Rule button on the dialog to launch the wizard. Select a template for the rule. Choose the "Check messages when they arrive" template from the "Start from a blank rule" category. Click Next.
Choose the "On this machine only" condition (intuitive isn't it?) and click next.
Choose the "run a script" option. At the bottom of the wizard where it shows your new rule, it should read:
Apply this rule after the message arrives
on this machine only
run a script
The phrase "a script" is a clickable link. Click it and Outlook will display a dialog that should list the subroutine you created earlier. Select your subroutine and click the OK button.
You can click Next to add exceptions to the rule or click Finish if you have no exceptions.
Now, as though that process was not convoluted enough, this rule will deactivate every time you stop and restart Outlook unless you sign the script with a code signing key.
If you don't already have a code signing key, you can create one with OpenSSL.
Did I mention that I hate Outlook?

Microsoft Outlook does not support regular expressions. You can perform wildcard searches, although for some inexplicable reason the wildcard character is %, not *.

How to do ANDing of conditions in a regular expression?

I want to match and modify part of a string if following conditions are true:
I want to capture information regarding a project, like project duration, client, technologies used, etc..
So, I want to select string starting with word "project" or string may start with other words like "details of project" or "project details" or "project #1".
RegEx. should first look at word "project" and it should select the string only when few or all of the following words are found after word "project".
1) client
2) duration
3) environment
4) technologies
5) role
I want to select a string if it matches at least 2 of the above words. Words can appear in any order and if the string contains ANY two or three of these words, then the string should get selected.
I have sample text given below.
Details of Projects :
*Project #1: CVC â€“ Customer Value Creation (Sep 2007 â€“ till now) Time
Warner Cable is the world's leading
media and entertainment company, Time
Warner Cable (TWC) makes coaxial
quiver.
Client : Time Warner Cable,US. ETL
Tool : Informatica 7.1.4
Database : Oracle 9i.
Role : ETL Developer/Team Lead.
O/S : UNIX.
Responsibilities: Created Test Plan and Test Case Book. Peer reviewed team members > Mappings. Documented Mappings. Leading the Development Team. Sending Reports to onsite. Bug >fixing for Defects, Data and Performance related.
Details of Project #2: MYER â€“ Sales
Analysis system (Nov 2005 â€“ till now)
Coles Myer is one of Australia's largest retailers with more than 2,000 > stores throughout Australia,
Client : Coles Myer
Retail, Australia. ETL Tool :
Informatica 7.1.3 Database : Oracle
8i. Role : ETL Developer. O/S :
UNIX. Responsibilities: Extraction,
Transformation and Loading of the data
using Informatica. Understanding the
entire source system.
Created and Run Sessions and
Workflows. Created Sort files using
Syncsort Application.*
Does anyone know how to achieve this using regular expressions?
Any clues or regular expressions are welcome!
Many thanks!

(client|duration|environment|technologies|role).+(client|duration|environment|technologies|role)(?!\1)

I would break it down into a few simpler regex's to get these results. The first would select only the chunk of text between projects: (?=Project #).*(?<=Project #)
With the match that this produces, i would run a seperate regex to ask if it contains any of those words : client | duration | environment | technologies | role
If this match comes back with a count of more then 2 distinct matches, you know to select the original string!
Edit:
string originalText;
MatchCollection projectDescriptions = Regex.Matches(originalText, "(?=Project #).(?:(?!Project #).)*", RegexOptions.IgnoreCase | RegexOptions.Singleline);
Foreach(Match projectDescription in projectDescriptions)
{
MatchCollection keyWordMatches = Regex.Matches(projectDescription.value, "client | duration | environment | technologies | role ", RegexOptions.IgnoreCase);
if(keyWordMatches.Distinct.Count > 2)
{
//At this point, do whatever you need to with the original projectDescription match, the Match object will give you the index etc of the match inside the original string.
}
}

Maybe you need to break that requirements in two steps: first, take your key/value pairs from your string, than apply your filter.
string input = #"Project #...";
Regex projects = new Regex(#"(?<key>\S+).:.(?<value>.*?\.)");
foreach (Match project in projects.Matches(input))
{
Console.WriteLine ("{0} : {1}",
project.Groups["key" ].Value,
project.Groups["value"].Value);
}

Try
^(details of )?project.*?((client|duration|environment|technologies|role).*?){2}.*$
One note: This will also match if only one of the terms appears twice.
In C#:
foundMatch = Regex.IsMatch(subjectString, #"\A(?:(details of )?project.*?((client|duration|environment|technologies|role).*?){2}.*)\Z", RegexOptions.Singleline | RegexOptions.IgnoreCase);

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js