I've seen a bunch of examples on this online but I can't seem to find the one I'm looking for which uses Regex. I've seen many that use a loop and use a lot of lines of code but I'd like to see an example of Regex.
What I'm trying to create is an app that will connect too a webpage take the source search for a keyword once its found copy the text from that keyword to another keyword and save it into a string or to a textbox whatever.
I'm already using web request to get the information and put it into a string I just need to search the string for what I am looking for.
The reason for this app is to search webpage for an updated version of some software I'm using. I want to monitor for updates and the app to notify me when an update is available. Just a simple app but having issues searching for what I need.
For Example:
first words to search for: Server 64-bit
second words/characters to search for: </div>
grab first words everything in between and last word saved into a string.
EDIT: The information I am trying to grab is this....
Server 64-bit
<span class="version">
3.0.13.6
</span>
</h3>
<div class="checksum">SHA256: c7eeb1937b0bce0b99e7c7e20de030a4b71adcaf09750481801cfa361433522f</div>
you can use the following code with RegEx to return the whole sentence including the two keywords you are providing
Dim str As String = "first words to search for: Server 64-bit second words/characters to search for: </div>"
str = str.Replace(vbNewLine,"|")
Dim strA As String = Regex.Match(str, "Server 64-bit(.*?)</div>", RegexOptions.Singleline).Value
Msgbox(strA)
Or you can use the following expression to get only value between this two keywords:
Dim strA As String = Regex.Match(str, "(?<=Server 64-bit)(.*)(?=</div>)", RegexOptions.Singleline).Value
Maybe not the prettiest solution, but i would save it into a string. Then iterate through it with the string.contain("Server 64-Bit") and then split the whole thing and then split the remaining part of the string at the next and retrieve only the first part.
Dim Information As String
Dim Splitstring As String
If Information.Contains("Server 64-Bit") Then
Dim parts As String() = Information.Split("Server 64-Bit")
For Each part In parts
SplitString As String = part(1)
Next
If SplitString.Contains("</div>") then
Dim parts As String() = Information.Split("</div>")
For Each part In parts
Dim ResultString As String = part(0)
'Displaying Result in a MsgBox
MsgBox(ResultString)
Next
End If
End If
Im currently only at my Phone, so I cant actually test this, but this should work.
I have the following tag from an XML file:
<msg><![CDATA[Method=GET URL=http://test.de:80/cn?OP=gtm&Reset=1(Clat=[400441379], Clon=[-1335259914], Decoding_Feat=[], Dlat=[0], Dlon=[0], Accept-Encoding=gzip, Accept=*/*) Result(Content-Encoding=[gzip], Content-Length=[7363], ntCoent-Length=[15783], Content-Type=[text/xml; charset=utf-8]) Status=200 Times=TISP:270/CSI:-/Me:1/Total:271]]>
Now I try to get from this message: Clon, Dlat, Dlon and Clat.
However, I already created the following regex:
(?<=Clat=)[\[\(\d+\)\n\n][^)n]+]
But the problem is here, I would like to get only the numbers without the brackets. I tried some other expressions.
Do you maybe know, how I can expand this expression, in order to get only the values without the brackets?
Thank you very much in advance.
Best regards
The regex
(clon|dlat|dlon|clat)=\[(-?\d+)\]
Gives
As I stated before, if you use this regex to extract the information out of this CDATA element, that's okay. But you really want to get to the contents of that element using an XML parser.
Example usage
Regex r = new Regex(#"(clon|dlat|dlon|clat)=\[(-?\d+)\]");
string s = ".. here's your cdata content .. ";
foreach (Match match in Regex.Matches(input, pattern, RegexOptions.IgnoreCase))
{
var name = match.Groups[1].Value; //will contain "clon", "dlat", "dlon" or "clat"
var inner_value = match.Groups[2].Value; //will contin the value inside the square-brackets, e.g. "400441379"
//Do something with the matches
}
I have a very short xml String passed to my app from another app and I'm only interested in extracting the content between the "level" tags. Which solution is better between these two:
String xmlString =
"<type>
<perm>
<date>99999999</date>
<level>admin</level>
</perm>
</type>";
String level = xmlString.substring(xmlString.indexOf("<level>") + "<level>".length(),
xmlString.indexOf("</level>"));
or
Pattern p1 = Pattern.compile("<level>(\\S+)</level>");
Matcher m = p1.matcher(xmlString);
if (m.find()) {
String level = m.group(1);
}
Have you tried bench-marking this on your own? From what I've read it seems that you generally want to go regex first and if you can't optimize that then try substring. However I'm a little confused why you aren't using something like XmlObject.factory to handling your XML parsing. https://xmlbeans.apache.org/docs/2.0.0/reference/org/apache/xmlbeans/XmlObject.Factory.html
I have a text file that is in a comma separated format, delimited by " on most fields. I am trying to get that into something I can enumerate through (Generic Collection, for example). I don't have control over how the file is output nor the character it uses for the delimiter.
In this case, the fields are separated by a comma and text fields are enclosed in " marks. The problem I am running into is that some fields have quotation marks in them (i.e. 8" Tray) and are accidentally being picked up as the next field. In the case of numeric fields, they don't have quotes around them, but they do start with a + or a - sign (depicting a positive/negative number).
I was thinking of a RegEx, but my skills aren't that great so hopefully someone can come up with some ideas I can try. There are about 19,000 records in this file, so I am trying to do it as efficiently as possible. Here are a couple of example rows of data:
"00","000000112260 ","Pie Pumpkin ","RET","6.99 "," ","ea ",+0000000006.99000
"00","000000304078 ","Pie Apple caramel ","RET","9.99 "," ","ea ",+0000000009.99000
"00","StringValue here","8" Tray of Food ","RET","6.99 "," ","ea ",-00000000005.3200
There are a lot more fields, but you can get the picture....
I am using VB.NET and I have a generic List setup to accept the data. I have tried using CSVReader and it seems to work well until you hit a record like the 3rd one (with a quote in the text field). If I could somehow get it to handle the additional quotes, than the CSVReader option will work great.
Thanks!
I recommend looking at the TextFieldParserClass in .Net. You need to include
Imports Microsoft.VisualBasic.FileIO.TextFieldParser
Here's a quick sample:
Dim afile As FileIO.TextFieldParser = New FileIO.TextFieldParser(FileName)
Dim CurrentRecord As String() ' this array will hold each line of data
afile.TextFieldType = FileIO.FieldType.Delimited
afile.Delimiters = New String() {","}
afile.HasFieldsEnclosedInQuotes = True
' parse the actual file
Do While Not afile.EndOfData
Try
CurrentRecord = afile.ReadFields
Catch ex As FileIO.MalformedLineException
Stop
End Try
Loop
From here:
Encoding fileEncoding = GetFileEncoding(csvFile);
// get rid of all doublequotes except those used as field delimiters
string fileContents = File.ReadAllText(csvFile, fileEncoding);
string fixedContents = Regex.Replace(fileContents, #"([^\^,\r\n])""([^$,\r\n])", #"$1$2");
using (CsvReader csv =
new CsvReader(new StringReader(fixedContents), true))
{
// ... parse the CSV
As this link says... Don't roll your own CSV parser!
Use TextFieldParser as Avi suggested. Microsoft has already done this for you. If you ended up writing one, and you find a bug in it, consider replacing it instead of fixing the bug. I did just that recently and it saved me a lot of time.
Give a look to the FileHelpers library.
You could give CsvHelper (a library I maintain) a try and it's available via NuGet. It follows the RFC 4180 standard for CSV. It will be able to handle any content inside of a field including commas, quotes, and new lines.
CsvHelper is simple to use, but it's also easy to configure it to work with many different types of delimited files.
CsvReader csv = new CsvReader( streamToFile );
IEnumerable<MyObject> myObjects = csv.GetRecords<MyObject>();
If you want to read CSV files on a lower level, you can use the parser directly, which will return each row as a string array.
var parser = new CsvParser( myTextReader );
while( true )
{
string[] line = parser.ReadLine();
if( line == null )
{
break;
}
}
I am posting this as an answer so I can explain how I did it and why.... The answer from Mitch Wheat was the one that gave me the best solution for this case and I just had to modify it slightly due to the format this data was exported in.
Here is the VB Code:
Dim fixedContents As String = Regex.Replace(
File.ReadAllText(csvFile, fileEncoding),
"(?<!,)("")(?!,)",
AddressOf ReplaceQuotes)
The RegEx that was used is what I needed to change because certain fields had non-escaped quotes in them and the RegEx provided didn't seem to work on all examples. This one uses 'Look Ahead' and 'Look Behind' to see if the quote is just after a comma or just before. In this case, they are both negative (meaning show me where the double quote is not before or after a comma). This should mean that the quote is in the middle of a string.
In this case, instead of doing a direct replacement, I am using the function ReplaceQuotes to handle that for me. The reason I am using this is because I needed a little extra logic to detect whether it was at the beginning of a line. If I would have spent even more time on it, I am sure I could have tweaked the RegEx to take into consideration the beginning of the line (using MultiLine, etc) but when I tried it quickly, it didn't seem to work at all.
With this in place, using CSV reader on a 32MB CSV file (about 19000 rows), it takes about 2 seconds to read the file, perform the regex, load it into the CSV Reader, add all the data to my generic class and finish. Real quick!!
RegEx to exclude first and last quote would be (?<!^)(?<!,)("")(?!,)(?!$). Of course, you need to use RegexOptions.Multiline.
That way there is no need for evaluator function. My code replaces undesired double quotes with single quotes.
Complete C# code is as below.
string fixedCSV = Regex.Replace(
File.ReadAllText(fileName),
#"(?<!^)(?<!;)("")(?!;)(?!$)", "'", RegexOptions.Multiline);
There are at least ODBC drivers for CSV files. But there are different flavors of CSV.
What produced these files? It's not unlikely that there's a matching driver based on the requirements of the source application.
Your problem with CSVReader is that the quote in the third record isn't escaped with another quote (aka double quoting). If you don't escape them, then how would you expect to handle ", in the middle of a text field?
http://en.wikipedia.org/wiki/Comma-separated_values
(I did end up having to work with files (with different delimiters) but the quote characters inside a text value weren't escaped and I ended up writing my own custom parser. I do not know if this was absolutely necessary or not.)
The logic of this custom approach is: Read through file 1 line at a time, split each line on the comma, remove the first and last character (removing the outer quotes but not affecting any inside quotes), then adding the data to your generic list. It's short and very easy to read and work with.
Dim fr As StreamReader = Nothing
Dim FileString As String = ""
Dim LineItemsArr() as String
Dim FilePath As String = HttpContext.Current.Request.MapPath("YourFile.csv")
fr = New System.IO.StreamReader(FilePath)
While fr.Peek <> -1
FileString = fr.ReadLine.Trim
If String.IsNullOrEmpty(FileString) Then Continue While 'Empty Line
LineItemsArr = FileString.Split(",")
For Each Item as String In LineItemsArr
'If every item will have a beginning and closing " (quote) then you can just
'cut the first and last characters of the string here.
'i.e. UpdatedItems = Item. remove first and last character
'Then stick the data into your Generic List (Of String()?)
Next
End While
public static Encoding GetFileEncoding(String fileName)
{
Encoding Result = null;
FileInfo FI = new FileInfo(fileName);
FileStream FS = null;
try
{
FS = FI.OpenRead();
Encoding[] UnicodeEncodings = { Encoding.BigEndianUnicode, Encoding.Unicode, Encoding.UTF8 };
for (int i = 0; Result == null && i < UnicodeEncodings.Length; i++)
{
FS.Position = 0;
byte[] Preamble = UnicodeEncodings[i].GetPreamble();
bool PreamblesAreEqual = true;
for (int j = 0; PreamblesAreEqual && j < Preamble.Length; j++)
{
PreamblesAreEqual = Preamble[j] == FS.ReadByte();
}
if (PreamblesAreEqual)
{
Result = UnicodeEncodings[i];
}
}
}
catch (System.IO.IOException)
{
}
finally
{
if (FS != null)
{
FS.Close();
}
}
if (Result == null)
{
Result = Encoding.Default;
}
return Result;
}
I'm having trouble figuring out how to transform a string into camel case in groovy. Say I start out with a string that looks like "1-800 FOO.BAR". Ultimately, I want this to turn into "1800FooDotBar". I've been able to get 1800FOODotBar by doing the following:
String str = "1-800 FOO.BAR"
String tempStr = str.replaceAll(/(?i)\.com/, "DotCom")
String newStr = tempStr.replaceAll(/\\W/, "")
I'm just not sure how to get rid of those capital letters in the middle. I've come across some information about a capitalize() method that should be able to help, but I'm just not familiar enough with Groovy to know how to use it. I think I need to split the string into individual strings for each word and then capitalize the first letter of each of those strings, but then how do I build the end result back up? I know that similar questions have been asked, but I'm just not seeing how to take that information and make complete Groovy code from it. Thanks in advance!
Very roughly:
String str = "1-800 FOO.BAR"
println str.replaceAll(/\./, " Dot ").split(/[^\w]/).collect { it.toLowerCase().capitalize() }.join("")
=> 1800FooDotBar