Race conditions with recursion



  • I've been trying to make life easier for my company by writing up a simple web crawler that allows people to search our entire active site (thousands of pages) for any terms.

    I wrote up a simple crawler in C#, and it works great, except for one minor issue.

     We have some pages that append a referrer querystring to their links, and on some of them, they link back to themselves with this appended. Of course, this creates a race condition in the crawler, as it continues to visit the page, forever appending it to the querystring, or at least until the buffer overflows :)

     It does not recognize these pages as previously visited pages, because they continue to append to the query string...IE:

     Pass 1:

    www.foo.com/page?referrer=www.foo.com/page 

     Pass 2:

    www.foo.com/page?referrer=www.foo.com/page?referrer=www.foo.com/page 

    And so on 

     Any ideas?
     



  • I should note that is an internal tool, not a public facing search engine.  Its for searching the output html from our framework, not for searching the pages themselves.



  • What does this have to do with race conditions? They are a different thing. Anyway, you can just strip out the referer parameter, can't you?

    (And the real WTF^H^H^Hquestion is: why such a parameter is being appended when browsers happily send a Referer HTTP header?)



  • I'd probably do something like…

    // Take the referrer element (if any) out of the perspective link to follow
    var linkUrlWithoutReferrer = linkUrl.pregReplace("(?|&)referrer=[^&]*", "");
    // Take the referrer element (if any) out of the current page
    var currentUrlWithoutReferrer = currentUrl.pregReplace("(?|&)referrer=[^&]*", "");
    

    // Are they the same?
    if (linkUrlWithoutReferrrer != currentUrlWithoutReferrer) {
    linkUrl.followLink();
    }


Log in to reply