So, hmm, Baiduspider
-
I have a small VPS that serves my personal homepage, and there's really not much in the way of content. By accident, I kinda noticed something odd:
$ ls -lh nginx_logs -rw-r--r-- 1 10101 10101 1,3G 15 mar 12.43 access-DOMAIN -rw-r--r-- 1 10101 10101 1,8G 15 mar 12.43 error-DOMAIN
So, that's a bit odd, especially considering
$ head nginx_logs/access-DOMAIN 180.76.15.29 - - [03/Feb/2016:time] "GET URL HTTP/1.1" 200 6580 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "2.59" ...
I.e., the first log entry is from third of February this year (which is pretty much the last time I did some house cleaning).
A quick glance shows that Baiduspider is kinda pretty prominent in those logs. In fact:
$ grep Baiduspider access-DOMAIN | wc -l 6051058
That's around 140000 requests every day, or around 6000 each hour for the last 42 days. Hadn't noticed, the server is running quite snappy. Neither nginx or the fast-CGI that's serving the page is really draining any resources (other than wasted network bandwidth). In fact,
$ lxc-info --name nginx Name: nginx State: RUNNING PID: 2422 IP: 10.0.X.3 CPU use: 13346.65 seconds Memory use: 26.35 MiB KMem use: 0 bytes Link: vethBLAH TX bytes: 80.71 GiB RX bytes: 6.94 GiB Total bytes: 87.65 GiB
and
$ lxc-info --name homepage Name: homepage State: RUNNING PID: 7421 IP: 10.0.X.6 CPU use: 49959.60 seconds Memory use: 30.33 MiB KMem use: 0 bytes Link: vethBLEH TX bytes: 882 bytes RX bytes: 19.65 KiB Total bytes: 20.51 KiB
It seems that the framework I'm using rewrites internal links from "/blah" to "/blah?foo=RANDOMGARBAGE". Baiduspider seems to think that each RANDOMGARBAGE indicates a unique page (eh, fair enough, I guess?). I found a config file that lists user agents that are to be considered spiders, and Baiduspider was missing from that list. Fixed that - it apparently prevents the link-rewriting and serves the non-AJAXy version of the page.
Now, considering I'm not an expert on webdev-things, I'm fairly sure that I'm the WTF here (if nothing else, then for using Wt (a C++ framework) to serve my homepage). Nevertheless, do I really need to manually maintain a list of spiders that might run across my page? Shouldn't Baiduspider figure out that after the first 1M-or-so kinda identical pages the next 5M shouldn't matter too much?
Also, now that I've (hopefully?) fixed the issue, how long should I let the Baidubombardment continue - i.e., will it eventually pick up on the changes? Should I (temporarily?) blacklist Baiduspider and/or associated IPs? Other actions?
-
(if nothing else, then for using Wt (a C++ framework) to serve my homepage).
Yes.
Nevertheless, do I really need to manually maintain a list of spiders that might run across my page?
Not if your code doesn't do silly things like the one you mentioned. Appending random query params is typically done to forcibly invalidate caches, but that's only necessary for aggressively cached static content, application-generated pages should just consume and generate proper cache headers.
There should be a public list of bots around somewhere, though.
Should I (temporarily?) blacklist Baiduspider and/or associated IPs?
They might drop the listing if the server stops responding. If it doesn't visibly strain your resources then it's not really necessary anyway.
Consider creating a sitemap and maybe disabling logging of those requests, though.
-
generate proper cache headers.
I'm guessing that "Cache-control: no-cache, no-store, must-revalidate" isn't the right answer here, then.
Guess I need to take a look at that. I was hoping that the framework wouldn't do anything too stupid there, but that's apparently putting a bit too much faith into it. Or probably rather a case of not RTFM:ing properly.
In my defense, the thing has been online for around 1.5 years or so, and this is the first time this problem pops up. I did keep a closer eye on things during the first 3-4 months.
They might drop the listing if the server stops responding. If it doesn't visibly strain your resources then it's not really necessary anyway.
As said, it's a personal homepage, so I could live without being listed by Baidu. But, no, it doesn't really strain the resources of the server, so I guess I'll leave it and see if the problem eventually goes away with listing Baidu explicitly as a spider.
Consider creating a sitemap and maybe disabling logging of those requests, though.
That almost sounds like work. Also, I'd need to figure out what exactly a sitemap is and how to create one. (But, yeah, this is actually in the TODO-list; I'm just not spending much (any) time on doing anything with it.)
-
I'm guessing that "Cache-control: no-cache, no-store, must-revalidate" isn't the right answer here, then.
If you don't want those pages to be cached at all, then it's fine. If you're doing that then you definitely don't need the random query arguments, though.
-
If you don't want those pages to be cached at all, then it's fine. If you're doing that then you definitely don't need the random query arguments, though.
Sorry, maybe should have mentioned this. The random query arguments are -I believe- generated by the framework to pass around some per-session data (or, more likely, a per-session ID), when it fails to Javascriptify the page and can't use cookies for some reason.
The default assumption is probably that neither cookies or JS work, so the initial page anybody gets contains those for that reason (at least that's what I'm guessing is going on).
Whether or not that's a good idea, I don't really have any opinion on, it's something the framework does automagically. It means that the page is viewable without JS but uses some fancy AJAXy loading stuff when it's available. Not that that's really needed either way (but I get it for "free", so yay?).
-
It seems that the framework I'm using rewrites internal links from "/blah" to "/blah?foo=RANDOMGARBAGE".
.... why?
EDIT: to answer the actual question:
Also, now that I've (hopefully?) fixed the issue, how long should I let the Baidubombardment continue - i.e., will it eventually pick up on the changes? Should I (temporarily?) blacklist Baiduspider and/or associated IPs? Other actions?
You already answered it:
Hadn't noticed, the server is running quite snappy.
If you hadn't even noticed it, then it's not an issue. Just let it ride.
But do fix your framework so it's not doing stupid things. Because Baidu is in the right here, and your site is clearly in the wrong.
-
.... why?
See post above -- at least that's what I think is going on.
But do fix your framework so it's not doing stupid things. Because Baidu is in the right here, and your site is clearly in the wrong.
I checked, with Baidu in the list of spiders and that user agent, a version without the random garbage appended is served. I guess it'll fix the current set of symptoms (~140k daily requests by Baidu).
Fixing the problem "for good" is a bit tougher. I'm not the author of the framework, and not really in the mood to go hacking around in it (and so far I've not had to touch that code at all - the docs and examples were good enough to make do with just that).
The path of least resistance would be to just keep an eye on the logs (probably a good idea anyway), and whenever the problem re-appears for some other bot/spider, I'll update the list of spiders/bots. (Or perhaps find a list by somebody else and use that.)
But since I now know that the site is misbehaving on my end, I can see if I can come up with a reasonable/proper fix eventually.
-
See post above -- at least that's what I think is going on.
It's still broken.
If it's using a URL param as a session ID, that means if you paste a link to someone via IM or something, they're suddenly in your session also. Bubye credit card. If it's using URL param + IP, maybe... maybe... that's barely acceptable. But... then you're breaking a session if a legit user's IP changes (say, his cellphone moves towers.) So it's just annoying, not like personal-data-leaking. (EDIT: of course the person you IM could also be on the same IP, via VPN or NAT or what-not, in which case we're back to bubye credit card.)
Cookies were invented for a good reason. If you don't have cookies, you really just can't do sessions and shouldn't even try.
EDIT: Oh and BTW, if the guy who wrote your web framework didn't know any of the above, are you sure you want to be using it? Because, seriously, it's his job to know shit like this. Who knows what else he got wrong.
-
It's still broken.
<snip />
Cookies were invented for a good reason. If you don't have cookies, you really just can't do sessions and shouldn't even try.
Ok, that's a fair point. A quick look into the config file reveals that there's a setting for the tracking. Currently it's actually set to URL only, but the only other option seems to be "Auto", which defaults to cookies but falls back to URL-rewriting.
The docs state:
tracking
How session tracking is implemented: automatically (using cookies when available, otherwise using URL rewriting) or strictly using URL rewriting (which allows multiple concurrent sessions from one user).Not exactly a lot of choice there.
EDIT: Oh and BTW, if the guy who wrote your web framework didn't know any of the above, are you sure you want to be using it? Because, seriously, it's his job to know shit like this. Who knows what else he got wrong.
FWIW, there's a entirely separate module for actual user authentication. A quick look at that seems to indicate that sessions and user-authentication are tracked separately (i.e., a single logged-in user can have several sessions even from the same browser). I haven't looked at that in much detail (since I don't need anything like that at the moment).
As for what else he's got wrong? Probably not much more (or less) than the vast majority of the other frameworks out there. But, I'm not planning on using it for anything even remotely critical, and if I did, well ... let's not go there, because honestly, I'm not the right person to write a webservice that handles any critical data. Not my area of expertise, and I'm aware of that.
-
EDIT: Oh and BTW, if the guy who wrote your web framework didn't know any of the above, are you sure you want to be using it? Because, seriously, it's his job to know shit like this. Who knows what else he got wrong.
Or he misunderstood EU cookie law. Seems to be common.
-
The law doesn't matter; putting a session identifier in the URL has always been technically wrong.
-
I think that spider honours robot.txt as well. It's suggested in their help center.
Seems you can try to stop robots follow links with specific querystrings too.
-
It's always been wrong because the web was never intended for this, HTTP is stateless and has no notion of sessions, cookies are an after-market bodge that appeared some time afterwards, so in the meantime, sessions in URLs were the only option, not that made it right.
-
I think that spider honours robot.txt as well. It's suggested in their help center.
I don't see them requesting robots.txt, probably they've cached some old version of it.
Anyway, the pages that they are being served now (to them) don't include the query params. OTOH, if they sit on a stash of to-be-processed-URLs from the 6M pages that they've previously requested, this might keep going on for a while...
Since it's not significantly impacting me, I'm OK with waiting it out. Call it an extended stress test. If I ever happen to get famous, I now know that my personal homepage can handle over 1M monthly page requests. ;-)
-
Currently it's actually set to URL only, but the only other option seems to be "Auto", which defaults to cookies but falls back to URL-rewriting.
URL rewriting is a miserable way of doing this. The only sane reason for pushing a varying value into the query is to work around dumb-ass caches, and those should be simple random values, not the session token.
Set it to auto, please.
-
It's always been wrong because the web was never intended for this, HTTP is stateless
It's always interesting to me how many people forget this. I figure, design as much of your API as you can without sessions, then add them where needed. It reduces bloat and cruft.
-
Well "after-market bodge" or not, cookies are literally the only way to have sessions. There's nothing else that both works, and is universally supported by all browsers.
-
No, the only way to have it work everywhere is fudge it into the URL. For a long time after cookies were a thing, Google couldn't actually use them correctly. Certainly in 2005 it couldn't send them correctly with requests though no doubt it does today.
This thread is actually evidence that Baidu doesn't get cookies right since if it did, I see no reason to be fudging session into the URL like that framework does unless specifically told to ignore it...
-
No, the only way to have it work everywhere is fudge it into the URL.
Dude, I just told you why that doesn't work. If you're going to say I'm wrong at least drop a few facts backing-up your opinion.
-
drop a few facts backing-up your opinion.
Not sourced, but there's some fact dropping in relation:
For a long time after cookies were a thing, Google couldn't actually use them correctly. Certainly in 2005 it couldn't send them correctly with requests though no doubt it does today.
-
Well yeah but I have no idea what "Google couldn't actually use cookies correctly" means.
-
Didn't send them at all.
-
what "Google couldn't actually use cookies correctly" means.
I inferred it to indicate that despite cookies being a thing, such popular entities such as Google were not able to make use of them, indicating that the URL fallback was indeed the only way to have sessions (at least, while using said programs).
-
... send them to whom? For what purpose? What Google product are we talking about? When did this happen?
-
... send them to whom?
The server and/or client?
For what purpose?
To fulfill the request.What Google product are we talking about?
Any of them that exhibits the mentions symptoms.When did this happen?
Certainly in 2005
Okay, now you're being intentionally ignorant.
-
So. Google has at least one product which, certainly in 2005, failed to store cookies?
And because I have no idea what Arantor or you are talking about, I'm being "intentionally ignorant". Ok. Time for the mute button.
-
Googlebot itself, you know the one for their search engine, the one everyone colloquially refers to as simple Google.
Sorry for being terse, I'm on mobile but wish I hadn't bothered having an opinion on the subject. I keep feeling like this.
Wake me when we're on NodeBB.
-
Ok. Time for the mute button.
Finally! Tool you long enough...
I keep feeling like this.
Blakey has that effect on people, it's not just you.
-
On the other hand, it doesn't sound wise to me to send object holding 5000+ fields on each postbacks, or send sensitive information across each time :P
These things are best be held in Session.
-
Anyway, the pages that they are being served now (to them) don't include the query params. OTOH, if they sit on a stash of to-be-processed-URLs from the 6M pages that they've previously requested, this might keep going on for a while...
Google takes 90 days to forget about a deleted page, and the other bots seems to behave similarly. I have a lot of deleted pages that the bots keep indexing too, from long ago.
Other than whatever you did to solve the problem, you could have a meta tag with a canonical URL as explained at https://support.google.com/webmasters/answer/139066 That should signal any bot that any page with the same canonical URL is the same page.
I suspect you're not being really crawled by Baidu anyway. Search engines don't drown your server in requests like this.
You're probably being crawled by an e-mail harvesting bot or something like that. These don't respect anything. You can try a firewall, mod_security, bad-behavior, etc.
-
I remember that if you want an URL be forgotten by Google, all you need to do is to configure your webserver to return HTTP 410 (Gone). The next time Google's bot visit the URL and see that status code, it'll remove the URL from index.
Not sure if Baidu implemented that too, but they seems to response to other HTTP status code that you could try.
And btw, seems Baidu will only refresh your page stat. on 2-4 week interval, referenced from question 8 of their CS site.
-
-
Say, phpinfo.php ...
-
I suspect you're not being really crawled by Baidu anyway. Search engines don't drown your server in requests like this.
Might be, but
whois
on the originating IPs (e.g, 180.76.15.X) returns stuff that point towards Baidu. Searching for that IP range (on Google, not Baidu), doesn't immediately reveal anything bad (except for a post on Reddit, where somebody seems to have run into a similar problem as I have).Other than whatever you did to solve the problem, you could have a meta tag with a canonical URL as explained at https://support.google.com/webmasters/answer/139066 That should signal any bot that any page with the same canonical URL is the same page.
That sounds like a good idea either way.
-
Set it to auto, please.
Will do, probably during the weekend. I don't actually use the sessions for anything ATM, so it shouldn't break anything, but I want to have a bit of time for testing, just to make sure that nothing has broken too badly.
Filed under: I've heard "this shouldn't break anything" often enough to know that that's probably not the case.