Race conditions with recursion



  • I've been trying to make life easier for my company by writing up a simple web crawler that allows people to search our entire active site (thousands of pages) for any terms.

    I wrote up a simple crawler in C#, and it works great, except for one minor issue.

    ¬†We have some pages that append a referrer querystring to their links, and on some of them, they link back to themselves with this appended. Of course, this creates a race condition in the crawler, as it continues to visit the page, forever appending it to the querystring, or at least until the buffer overflows ūüôā

     It does not recognize these pages as previously visited pages, because they continue to append to the query string...IE:

     Pass 1:

    www.foo.com/page?referrer=www.foo.com/page 

     Pass 2:

    www.foo.com/page?referrer=www.foo.com/page?referrer=www.foo.com/page 

    And so on 

     Any ideas?
     



  • I should note that is an internal tool, not a public facing search engine.¬† Its for searching the output html from our framework, not for searching the pages themselves.



  • What does this have to do with race conditions? They are a different thing. Anyway, you can just strip out the referer parameter, can't you?

    (And the real WTF^H^H^Hquestion is: why such a parameter is being appended when browsers happily send a Referer HTTP header?)



  • I'd probably do something like‚Ķ

    // Take the referrer element (if any) out of the perspective link to follow
    var linkUrlWithoutReferrer = linkUrl.pregReplace("(?|&)referrer=[^&]*", "");
    // Take the referrer element (if any) out of the current page
    var currentUrlWithoutReferrer = currentUrl.pregReplace("(?|&)referrer=[^&]*", "");
    
    // Are they the same?
    if (linkUrlWithoutReferrrer != currentUrlWithoutReferrer) {
    	linkUrl.followLink();
    }
    

Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.