Race conditions with recursion
-
I've been trying to make life easier for my company by writing up a simple web crawler that allows people to search our entire active site (thousands of pages) for any terms.
I wrote up a simple crawler in C#, and it works great, except for one minor issue.
We have some pages that append a referrer querystring to their links, and on some of them, they link back to themselves with this appended. Of course, this creates a race condition in the crawler, as it continues to visit the page, forever appending it to the querystring, or at least until the buffer overflows :)
It does not recognize these pages as previously visited pages, because they continue to append to the query string...IE:
Pass 1:
www.foo.com/page?referrer=www.foo.com/page
Pass 2:
www.foo.com/page?referrer=www.foo.com/page?referrer=www.foo.com/page
And so on
Any ideas?
-
I should note that is an internal tool, not a public facing search engine. Its for searching the output html from our framework, not for searching the pages themselves.
-
What does this have to do with race conditions? They are a different thing. Anyway, you can just strip out the referer parameter, can't you?
(And the real WTF^H^H^Hquestion is: why such a parameter is being appended when browsers happily send a Referer HTTP header?)
-
I'd probably do something like…
// Take the referrer element (if any) out of the perspective link to follow var linkUrlWithoutReferrer = linkUrl.pregReplace("(?|&)referrer=[^&]*", ""); // Take the referrer element (if any) out of the current page var currentUrlWithoutReferrer = currentUrl.pregReplace("(?|&)referrer=[^&]*", "");
// Are they the same?
if (linkUrlWithoutReferrrer != currentUrlWithoutReferrer) {
linkUrl.followLink();
}