Screenscraping with ASP.Net?



  • So I'm working on something to screenscrape webpages, looking for upcoming events then dropping them into a database. My problem is when there are multiple events on a page.

     

    For example, I'm scraping a calendar page, matching on certain tags that appear in the code before and after dates, venues, times, etc. So I'll suck the entire page into a string with streamreader, but haven't found a way of appropriately stepping through it. Like I'd like to pull out all the things inbetween the tags that start with" <a class=""Title"""'  and then end with > - and all I can manage to do is pull out the first item, and I'm not up on ASP.Net enough to step through it all. Someone suggested RegEx.Matches, but I'm not quite sure how that would work - any ideas?

     

     

     

    J
     



  • Aren't there HTML/XML readers for ASP.NET? If you happen to stumble over one that supports XPath, scrambling becomes child's play.

    It's probably possible to do this via regex and a lot of additional code, but I think this would become very painful, because you normally can't extract structures with regexes that are arbitrarily deep nested like XML/HTML.



  • Assuming you just want to rip all the anchor tags out of the page, you can do a simple regex match and loop...  Here's an example in javascript; a similar example would be easy in .net.

    var data = document.body.outerHTML;
    var matches = data.match(/\<A.*?\>/g);
    var msg = "";

    for(var i=0;i<matches.length;i++) {
     msg += matches[i] + "\r\n";
    }
    alert(msg);



  • You might want to check this out:

    http://www.codeplex.com/htmlagilitypack 

    It's a nice open source .net library for parsing HTML, even if it isn't well-formed.
     


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.