So, hmm, Baiduspider

cvi

I have a small VPS that serves my personal homepage, and there's really not much in the way of content. By accident, I kinda noticed something odd:

$ ls -lh nginx_logs
-rw-r--r-- 1 10101 10101 1,3G 15 mar 12.43 access-DOMAIN
-rw-r--r-- 1 10101 10101 1,8G 15 mar 12.43 error-DOMAIN

So, that's a bit odd, especially considering

$ head nginx_logs/access-DOMAIN
180.76.15.29 - - [03/Feb/2016:time] "GET URL HTTP/1.1" 200 6580 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" "2.59"
...

I.e., the first log entry is from third of February this year (which is pretty much the last time I did some house cleaning).

A quick glance shows that Baiduspider is kinda pretty prominent in those logs. In fact:

$ grep Baiduspider access-DOMAIN | wc -l
6051058

That's around 140000 requests every day, or around 6000 each hour for the last 42 days. Hadn't noticed, the server is running quite snappy. Neither nginx or the fast-CGI that's serving the page is really draining any resources (other than wasted network bandwidth). In fact,

$ lxc-info --name nginx
Name:           nginx
State:          RUNNING
PID:            2422
IP:             10.0.X.3
CPU use:        13346.65 seconds
Memory use:     26.35 MiB
KMem use:       0 bytes
Link:           vethBLAH
 TX bytes:      80.71 GiB
 RX bytes:      6.94 GiB
 Total bytes:   87.65 GiB

and

$ lxc-info --name homepage
Name:           homepage
State:          RUNNING
PID:            7421
IP:             10.0.X.6
CPU use:        49959.60 seconds
Memory use:     30.33 MiB
KMem use:       0 bytes
Link:           vethBLEH
 TX bytes:      882 bytes
 RX bytes:      19.65 KiB
 Total bytes:   20.51 KiB

It seems that the framework I'm using rewrites internal links from "/blah" to "/blah?foo=RANDOMGARBAGE". Baiduspider seems to think that each RANDOMGARBAGE indicates a unique page (eh, fair enough, I guess?). I found a config file that lists user agents that are to be considered spiders, and Baiduspider was missing from that list. Fixed that - it apparently prevents the link-rewriting and serves the non-AJAXy version of the page.

Now, considering I'm not an expert on webdev-things, I'm fairly sure that I'm the WTF here (if nothing else, then for using Wt (a C++ framework) to serve my homepage). Nevertheless, do I really need to manually maintain a list of spiders that might run across my page? Shouldn't Baiduspider figure out that after the first 1M-or-so kinda identical pages the next 5M shouldn't matter too much?

Also, now that I've (hopefully?) fixed the issue, how long should I let the Baidubombardment continue - i.e., will it eventually pick up on the changes? Should I (temporarily?) blacklist Baiduspider and/or associated IPs? Other actions?

CatPlusPlus

@cvi said:

(if nothing else, then for using Wt (a C++ framework) to serve my homepage).

Yes.

@cvi said:

Nevertheless, do I really need to manually maintain a list of spiders that might run across my page?

Not if your code doesn't do silly things like the one you mentioned. Appending random query params is typically done to forcibly invalidate caches, but that's only necessary for aggressively cached static content, application-generated pages should just consume and generate proper cache headers.

There should be a public list of bots around somewhere, though.

@cvi said:

Should I (temporarily?) blacklist Baiduspider and/or associated IPs?

They might drop the listing if the server stops responding. If it doesn't visibly strain your resources then it's not really necessary anyway.

Consider creating a sitemap and maybe disabling logging of those requests, though.

cvi

@CatPlusPlus said:

generate proper cache headers.

I'm guessing that "Cache-control: no-cache, no-store, must-revalidate" isn't the right answer here, then.

Guess I need to take a look at that. I was hoping that the framework wouldn't do anything too stupid there, but that's apparently putting a bit too much faith into it. Or probably rather a case of not RTFM:ing properly.

In my defense, the thing has been online for around 1.5 years or so, and this is the first time this problem pops up. I did keep a closer eye on things during the first 3-4 months.

@CatPlusPlus said:

They might drop the listing if the server stops responding. If it doesn't visibly strain your resources then it's not really necessary anyway.

As said, it's a personal homepage, so I could live without being listed by Baidu. But, no, it doesn't really strain the resources of the server, so I guess I'll leave it and see if the problem eventually goes away with listing Baidu explicitly as a spider.

@CatPlusPlus said:

Consider creating a sitemap and maybe disabling logging of those requests, though.

That almost sounds like work. Also, I'd need to figure out what exactly a sitemap is and how to create one. (But, yeah, this is actually in the TODO-list; I'm just not spending much (any) time on doing anything with it.)

CatPlusPlus

@cvi said:

I'm guessing that "Cache-control: no-cache, no-store, must-revalidate" isn't the right answer here, then.

If you don't want those pages to be cached at all, then it's fine. If you're doing that then you definitely don't need the random query arguments, though.

cvi

@CatPlusPlus said:

If you don't want those pages to be cached at all, then it's fine. If you're doing that then you definitely don't need the random query arguments, though.

Sorry, maybe should have mentioned this. The random query arguments are -I believe- generated by the framework to pass around some per-session data (or, more likely, a per-session ID), when it fails to Javascriptify the page and can't use cookies for some reason.

The default assumption is probably that neither cookies or JS work, so the initial page anybody gets contains those for that reason (at least that's what I'm guessing is going on).

Whether or not that's a good idea, I don't really have any opinion on, it's something the framework does automagically. It means that the page is viewable without JS but uses some fancy AJAXy loading stuff when it's available. Not that that's really needed either way (but I get it for "free", so yay?).

blakeyrat

@cvi said:

It seems that the framework I'm using rewrites internal links from "/blah" to "/blah?foo=RANDOMGARBAGE".

.... why?

EDIT: to answer the actual question:

@cvi said:

Also, now that I've (hopefully?) fixed the issue, how long should I let the Baidubombardment continue - i.e., will it eventually pick up on the changes? Should I (temporarily?) blacklist Baiduspider and/or associated IPs? Other actions?

You already answered it:

@cvi said:

Hadn't noticed, the server is running quite snappy.

If you hadn't even noticed it, then it's not an issue. Just let it ride.

But do fix your framework so it's not doing stupid things. Because Baidu is in the right here, and your site is clearly in the wrong.

cvi

@blakeyrat said:

.... why?

See post above -- at least that's what I think is going on.

@blakeyrat said:

But do fix your framework so it's not doing stupid things. Because Baidu is in the right here, and your site is clearly in the wrong.

I checked, with Baidu in the list of spiders and that user agent, a version without the random garbage appended is served. I guess it'll fix the current set of symptoms (~140k daily requests by Baidu).

Fixing the problem "for good" is a bit tougher. I'm not the author of the framework, and not really in the mood to go hacking around in it (and so far I've not had to touch that code at all - the docs and examples were good enough to make do with just that).

The path of least resistance would be to just keep an eye on the logs (probably a good idea anyway), and whenever the problem re-appears for some other bot/spider, I'll update the list of spiders/bots. (Or perhaps find a list by somebody else and use that.)

But since I now know that the site is misbehaving on my end, I can see if I can come up with a reasonable/proper fix eventually.

blakeyrat

@cvi said:

See post above -- at least that's what I think is going on.

It's still broken.

If it's using a URL param as a session ID, that means if you paste a link to someone via IM or something, they're suddenly in your session also. Bubye credit card. If it's using URL param + IP, maybe... maybe... that's barely acceptable. But... then you're breaking a session if a legit user's IP changes (say, his cellphone moves towers.) So it's just annoying, not like personal-data-leaking. (EDIT: of course the person you IM could also be on the same IP, via VPN or NAT or what-not, in which case we're back to bubye credit card.)

Cookies were invented for a good reason. If you don't have cookies, you really just can't do sessions and shouldn't even try.

EDIT: Oh and BTW, if the guy who wrote your web framework didn't know any of the above, are you sure you want to be using it? Because, seriously, it's his job to know shit like this. Who knows what else he got wrong.

cvi

@blakeyrat said:

It's still broken.

<snip />

Cookies were invented for a good reason. If you don't have cookies, you really just can't do sessions and shouldn't even try.

Ok, that's a fair point. A quick look into the config file reveals that there's a setting for the tracking. Currently it's actually set to URL only, but the only other option seems to be "Auto", which defaults to cookies but falls back to URL-rewriting.

The docs state:

tracking
How session tracking is implemented: automatically (using cookies when available, otherwise using URL rewriting) or strictly using URL rewriting (which allows multiple concurrent sessions from one user).

Not exactly a lot of choice there.

@blakeyrat said:

EDIT: Oh and BTW, if the guy who wrote your web framework didn't know any of the above, are you sure you want to be using it? Because, seriously, it's his job to know shit like this. Who knows what else he got wrong.

FWIW, there's a entirely separate module for actual user authentication. A quick look at that seems to indicate that sessions and user-authentication are tracked separately (i.e., a single logged-in user can have several sessions even from the same browser). I haven't looked at that in much detail (since I don't need anything like that at the moment).

As for what else he's got wrong? Probably not much more (or less) than the vast majority of the other frameworks out there. But, I'm not planning on using it for anything even remotely critical, and if I did, well ... let's not go there, because honestly, I'm not the right person to write a webservice that handles any critical data. Not my area of expertise, and I'm aware of that.

PleegWat

@blakeyrat said:

EDIT: Oh and BTW, if the guy who wrote your web framework didn't know any of the above, are you sure you want to be using it? Because, seriously, it's his job to know shit like this. Who knows what else he got wrong.

Or he misunderstood EU cookie law. Seems to be common.

blakeyrat

The law doesn't matter; putting a session identifier in the URL has always been technically wrong.

cheong

I think that spider honours robot.txt as well. It's suggested in their help center.

Seems you can try to stop robots follow links with specific querystrings too.

Arantor

It's always been wrong because the web was never intended for this, HTTP is stateless and has no notion of sessions, cookies are an after-market bodge that appeared some time afterwards, so in the meantime, sessions in URLs were the only option, not that made it right.

cvi

@cheong said:

I think that spider honours robot.txt as well. It's suggested in their help center.

I don't see them requesting robots.txt, probably they've cached some old version of it.

Anyway, the pages that they are being served now (to them) don't include the query params. OTOH, if they sit on a stash of to-be-processed-URLs from the 6M pages that they've previously requested, this might keep going on for a while...

Since it's not significantly impacting me, I'm OK with waiting it out. Call it an extended stress test. If I ever happen to get famous, I now know that my personal homepage can handle over 1M monthly page requests. ;-)

dkf

@cvi said:

Currently it's actually set to URL only, but the only other option seems to be "Auto", which defaults to cookies but falls back to URL-rewriting.

URL rewriting is a miserable way of doing this. The only sane reason for pushing a varying value into the query is to work around dumb-ass caches, and those should be simple random values, not the session token.

Set it to auto, please.

Yamikuronue

@Arantor said:

It's always been wrong because the web was never intended for this, HTTP is stateless

It's always interesting to me how many people forget this. I figure, design as much of your API as you can without sessions, then add them where needed. It reduces bloat and cruft.

blakeyrat

Well "after-market bodge" or not, cookies are literally the only way to have sessions. There's nothing else that both works, and is universally supported by all browsers.

Arantor

No, the only way to have it work everywhere is fudge it into the URL. For a long time after cookies were a thing, Google couldn't actually use them correctly. Certainly in 2005 it couldn't send them correctly with requests though no doubt it does today.

This thread is actually evidence that Baidu doesn't get cookies right since if it did, I see no reason to be fudging session into the URL like that framework does unless specifically told to ignore it...

blakeyrat

@Arantor said:

No, the only way to have it work everywhere is fudge it into the URL.

Dude, I just told you why that doesn't work. If you're going to say I'm wrong at least drop a few facts backing-up your opinion.

Tsaukpaetra

@blakeyrat said:

drop a few facts backing-up your opinion.

Not sourced, but there's some fact dropping in relation:

@Arantor said:

For a long time after cookies were a thing, Google couldn't actually use them correctly. Certainly in 2005 it couldn't send them correctly with requests though no doubt it does today.

blakeyrat

Well yeah but I have no idea what "Google couldn't actually use cookies correctly" means.

Arantor

Didn't send them at all.

Tsaukpaetra

@blakeyrat said:

what "Google couldn't actually use cookies correctly" means.

I inferred it to indicate that despite cookies being a thing, such popular entities such as Google were not able to make use of them, indicating that the URL fallback was indeed the only way to have sessions (at least, while using said programs).

blakeyrat

... send them to whom? For what purpose? What Google product are we talking about? When did this happen?

https://www.youtube.com/watch?v=FC-iULxz9Fw

Tsaukpaetra

@blakeyrat said:

... send them to whom?

The server and/or client?

@blakeyrat said:

For what purpose?

To fulfill the request.

@blakeyrat said:

What Google product are we talking about?

Any of them that exhibits the mentions symptoms.

@blakeyrat said:

When did this happen?

@Arantor said:

Certainly in 2005

Okay, now you're being intentionally ignorant.

blakeyrat

So. Google has at least one product which, certainly in 2005, failed to store cookies?

And because I have no idea what Arantor or you are talking about, I'm being "intentionally ignorant". Ok. Time for the mute button.

Arantor

Googlebot itself, you know the one for their search engine, the one everyone colloquially refers to as simple Google.

Sorry for being terse, I'm on mobile but wish I hadn't bothered having an opinion on the subject. I keep feeling like this.

Wake me when we're on NodeBB.

Tsaukpaetra

@blakeyrat said:

Ok. Time for the mute button.

Finally! Tool you long enough...

@Arantor said:

I keep feeling like this.

Blakey has that effect on people, it's not just you.

cheong

On the other hand, it doesn't sound wise to me to send object holding 5000+ fields on each postbacks, or send sensitive information across each time :P

These things are best be held in Session.

fbmac

@cvi said:

Anyway, the pages that they are being served now (to them) don't include the query params. OTOH, if they sit on a stash of to-be-processed-URLs from the 6M pages that they've previously requested, this might keep going on for a while...

Google takes 90 days to forget about a deleted page, and the other bots seems to behave similarly. I have a lot of deleted pages that the bots keep indexing too, from long ago.

Other than whatever you did to solve the problem, you could have a meta tag with a canonical URL as explained at https://support.google.com/webmasters/answer/139066 That should signal any bot that any page with the same canonical URL is the same page.

I suspect you're not being really crawled by Baidu anyway. Search engines don't drown your server in requests like this.

You're probably being crawled by an e-mail harvesting bot or something like that. These don't respect anything. You can try a firewall, mod_security, bad-behavior, etc.

cheong

I remember that if you want an URL be forgotten by Google, all you need to do is to configure your webserver to return HTTP 410 (Gone). The next time Google's bot visit the URL and see that status code, it'll remove the URL from index.

Not sure if Baidu implemented that too, but they seems to response to other HTTP status code that you could try.

And btw, seems Baidu will only refresh your page stat. on 2-4 week interval, referenced from question 8 of their CS site.

loopback0

@fbmac said:

I have a lot of deleted pages

Shocker.

cheong

Say, phpinfo.php ...

cvi

@fbmac said:

I suspect you're not being really crawled by Baidu anyway. Search engines don't drown your server in requests like this.

Might be, but whois on the originating IPs (e.g, 180.76.15.X) returns stuff that point towards Baidu. Searching for that IP range (on Google, not Baidu), doesn't immediately reveal anything bad (except for a post on Reddit, where somebody seems to have run into a similar problem as I have).

@fbmac said:

Other than whatever you did to solve the problem, you could have a meta tag with a canonical URL as explained at https://support.google.com/webmasters/answer/139066 That should signal any bot that any page with the same canonical URL is the same page.

That sounds like a good idea either way.

cvi

@dkf said:

Set it to auto, please.

Will do, probably during the weekend. I don't actually use the sessions for anything ATM, so it shouldn't break anything, but I want to have a bit of time for testing, just to make sure that nothing has broken too badly.

Filed under: I've heard "this shouldn't break anything" often enough to know that that's probably not the case.