ArchiveBot - "that's a nice robots.txt file you have there. Shame if anything should happen to it..."

PJH

Continuing the discussion from So... I am no longer blind about performance here ...:

You fool! I only visit this forum via the wayback-machine! You are totally breaking my workflow!

archive.org isn't a problem. These fucknuggets on the other-hand...

Paraphrased:

We think there's only one valid reason to use robots.txt and that reason is because you're using get requests to modify data. No other possible reason for wanting to filter automated requests could possibly ever exist.

In fact, I think this discussion of AT/AB deserves a thread all by itself....

People who are unaware of the context of this, please start here and follow stuff mentioning ArchiveTeam/ArchiveBot:

https://what.thedailywtf.com/t/so-i-am-no-longer-blind-about-performance-here/49472/138?u=pjh

boomzilla

We think there's only one valid reason to use robots.txt and that reason is because you're using get requests to modify data. No other possible reason for wanting to filter automated requests could possibly ever exist.

I guess that makes some sense if you assume static content.

accalia

@boomzilla said:

I guess that makes some sense if you assume static content.

what about if you assume that the webmaster had a reason for asking that spiders leave certain parts of the site alone?

Maciejasjmj

@accalia said:

what about if you assume that the webmaster had a reason for asking that spiders leave certain parts of the site alone?

They're just starting with the assertion that web developers are morons and drawing from that.

I see very little problem with this...

boomzilla

@accalia said:

what about if you assume that the webmaster had a reason for asking that spiders leave certain parts of the site alone?

Possibly. What's the reason? If they don't want it to be public then they probably shouldn't make it public.

accalia

@Maciejasjmj said:

They're just starting with the assertion that web developers are morons and drawing from that.

if your web designers are the ones dictating the contents of robots.txt then you are indeed a moron. that should be determined by a proper webmaster.

and for the purpose of clarity a proper webmaster is an actual competent sysadmin who is dedicated and specialized in running web systems. you know, the kind of person who knows what robots.txt is actually for

accalia

@boomzilla said:

Possibly. What's the reason? If they don't want it to be public then they probably shouldn't make it public.

i can think of several actually.

ROBOTS.txt should be able to

block badly behaved spiders (such as ArchiveBot that places excessive load on the server)
prevent bots from indexing pages with highly volitile data (data that should not be indexed because by the time a user visits it the data will have changed)
act as a second layer of protection for secure sections of the site (ask the bot not to index /secure so they never make the request that you will reject as unauthorized, particularly if there is any anti DOS code that sould block the bot outright after a certain amount of authorization failures)

boomzilla

@accalia said:

block badly behaved spiders (such as ArchiveBot that places excessive load on the serve

OK...I guess that's a more specific version of my original statement (though I suppose big static images [or whatever] could clog up pipes, too).

@accalia said:

prevent bots from indexing pages with highly volitile data (data that should not be indexed because by the time a user visits it the data will have changed)

That makes sense.

@accalia said:

act as a second layer of protection for secure sections of the site

Meh...

loose

@accalia said:

what about if you assume that the webmaster had a reason for asking that spiders leave certain parts of the site alone?

Yeah, robots.txt used to be good for doing stuff like that. Unfortunately the Bad Boys soon learnt that it was a convenient list for where the good stuff's at The best that the Authors can advise is not to use robots.txt to hide stuff. They even go so far as to recognise that there is no "allow" verb, but appear quite happy to not change anything. That's a right there.

Better a (long) list of where you can go, than a short list of where you can't.

I recall reading somewhere a discussion of moving the functionality of robots.txt to somewhere more obscure, like the Headers or something. But it is still providing a list of the things you want to hide.

Of course, the real solution is to construct your web site with only the stuff you want to be seen in your webroot directory.

Go here to get it from the horse's mouth

accalia

@boomzilla said:

Meh...

security in depth! security in depth!

the more layers you have the better, and that one is super simple and easy to put into place. makes a nice layer on top of your real proper armour.

Yamikuronue

@accalia said:

act as a second layer of protection for secure sections of the site

Bad idea

@accalia said:

particularly if there is any anti DOS code that sould block the bot outright

Good idea. "Don't try to hit this section or your bot will get blocked and be unable to finish crawling the site" is a totally legitimate warning to give IMO

accalia

@Yamikuronue said:

Bad idea

@Yamikuronue said:

Good idea.

those are the same two situations.... mostly

"hey, you won't be able to read that over there, so save yourself the effort and just skip it"
is pretty much the same thing (from the spider's perspective) to my mind as
"hey, if you try to read that over there you'll lock yourself out of the whole site"

you'll still need the security on the secure section but i'd much rather have it there and have Robots.txt in place so when i go trawling through the logs i don't see the spiders trying to get at the secure section of the site and possibly miss a real intrusion attempt in the noise.

PJH

@Yamikuronue said:

Good idea. "Don't try to hit this section or your bot will get blocked and be unable to finish crawling the site" is a totally legitimate warning to give IMO

Ah - forgot about that as a defense to maliciously acting crawlers (sick a Disallow /url_not_referenced_anywhere_else in there as a honeypot and blacklist with a vengeance anything pulling index.php from it.)

Dumb crawlers: just hide a URL that's hidden to humans somewhere/everywhere and use the same script or a variation on it.

accalia

@PJH said:

Ah - forgot about that as a defense to maliciously acting crawlers (sick a Disallow /url_not_referenced_anywhere_else in there as a honeypot and blacklist with a vengeance anything pulling index.php from it.)

ooooh.... i like that... i'll have to remember that trick.

hmm........ i should put that as a honeypot for servercooties (autogenerate a robots.txt with one of those as the first and only disallow and then serve up a special error page if anyone hits it.... (maybe for version 2.0))

loose

Re: Honeypots.

Many years ago I ran a personal webserver, and for the purposes of brevity of this story, I had configured Apache to run my code rather than it's own code for 404 handling.

Every time various popular open source server utilities were updated. I used to be plagued by bots trying to take advantage of the of the inevitable new exploits.

I had a two pronged defence strategy, which with minor variations, was:

Gave them a fake 200 status with a blank page.
Harvested their IP's so when they came back to my Server at any time in the future, I would redirect any and all of their requests to drive A: and then forget to put any media in it.

anonymous234

@loose said:

I would redirect any and all of their requests to drive A: and then forget to put any media in it.

Wouldn't that constant scratching noise get annoying after a while?

RaceProUK

Who said drive A: is a floppy drive?

blakeyrat

It's called a joke you idiot.

RaceProUK

Oh, of course! I'm so glad you came here specifically to point that out! I mean, it's not like it's possible I was talking about something even more annoying than a floppy drive! How stupid of me to forget that there's no such thing as a tape drive?

blakeyrat

What the fuck are you talking about?

No, nevermind, I don't care.

mott555

@loose said:

Harvested their IP's so when they came back to my Server at any time in the future, I would ~~redirect~~ reverse-proxy any and all of their requests to ~~drive A: and then forget to put any media in it.~~ a list of random shock sites.

Evil Idea's Threaded that for you.

Yamikuronue

@accalia said:

"hey, you won't be able to read that over there, so save yourself the effort and just skip it" is pretty much the same thing (from the spider's perspective) to my mind as"hey, if you try to read that over there you'll lock yourself out of the whole site"

but they're both very different from the naive reading of your post, which is "Hey, please don't look in this folder, it's got sensitive information in it"

loose

No actually it's true, and yes it was a floppy. My home server spent most of it's time sitting at home alone.

After a while the scratching noises stopped, so I assumed something eventually died from the lack of nutrition.

What would have been annoying would have been to give them something to read from the floppy as that produces additional whines and grumbles. But I was sorely temped.

PJH

Would you two go argue elsewhere.

Steve_The_Cynic

@PJH said:

archive.org isn't a problem.

Archive.org's use of robots.txt fucking well is a problem.

If the current version of a site has a "no archiving" setup in robots.txt, then archive.org will hide the entire history(1) of the site, even those versions that didn't have this setup. (Well, OK, it did when I noticed it a few years ago. A site I maintained for a while got picked up by a placeholder squatting domain buyer who put in a robots.txt that blocked archive.org. None of the older versions (from when I was maintaining the site) were visible because of the squatter's robots.txt.

(1) Well, they don't hide the history of their visits, but they do hide the history of the pages.

boomzilla

@anonymous234 said:

Wouldn't that constant scratching noise get annoying after a while?

Not to mention: Who runs apache on Windows?

INB4: or DOS‽

loose

You have no idea how much I lusted after a tape drive, any tape drive, back in those days

loose

It was a custom built, fully RAIDed dual (physical) CPU running Red Hat

accalia

@Yamikuronue said:

but they're both very different from the naive reading of your post, which is "Hey, please don't look in this folder, it's got sensitive information in it"

hmm.... ok, that i see.

i underspecified what i meant with

@accalia said:

act as a second layer of protection for secure sections of the site

i meant that if the bot requests /secure they will be rejected, and additionally you list /secure in robots.txt so that the bot knows that the request will fail and so can avoid requesting it in the first place.

two layers are better than one layer. ;-)

loose

Oh, the "technique" that I was developing (quite successfully, in the end) required that I muck about with the 404 handling.

In essence, any and every request to the server got a 200. And I mean any. I was amused one day trawling through the logs that somebody actually realised this and commenced to amuse himself by typing any random page - some of which were quite inventive

dkf

@loose said:

In essence, any and every request to the server got a 200. And I mean any.

Hopefully you were just customising 404 responses. Doing so for the other errors would be really annoying.

loose

Yes, the server didn't have any physical web pages.

dkf

@loose said:

Yes, the server didn't have any physical web pages.

Of course not. They're entirely virtual concepts in the first place.

But seriously, I've built systems where the entire content as viewed by clients was completely separated from how the content was stored on disk. The web application that mediated between the two did some clever caching (and was fast at rendering in the first place).

loose

@dkf said:

They're entirely ~~virtual~~visual concepts in the first place

FTFY

ben_lubar

"We will delete anything you say about this issue" followed a few paragraphs later by "hey, you should really try discussing this with us".

Arantor

@boomzilla said:

Not to mention: Who runs apache on Windows?

INB4: or DOS‽

waves hand in the air Oh I know, I know, pick me! :P

loose

@Arantor said:

Who runs apache on Windows?

People that what to run a web server that has one hand tied behind its back and is forced to hop on one leg?

Maciejasjmj

XAMPP makes a decent enough development environment. Beats having to run a Linux VM...

Gurth

That’s not [url=http://zx81-siggi.endoftheinternet.org/index.html]hardcore[/url] enough.

loose

Yes it does and you don't have to faff around under the skirts of a 'nix system to make things work (yes I kmow about 'apt instal' etc)

loose

Wow! That is hardcore and dedication.

loose

@loose said:

Yes it does and you don't have to faff around under the skirts of a 'nix system to make things work (yes I kmow about 'apt instal' etc)

Sorry, Replied to the wrong post

Arantor

It is a shame you have not yet achieved Lounge access, because then you will know on what basis I know about Apache on Windows.

But I will agree with the sentiment that WampServer, XAMPP et al are good for a basic dev instance without a VM.

loose

Oh, right secret squirrel stuf

I'll make a note to look that up when I drop in (the Lounge).

Hey! I bet there is a "Staff Room" that not even the denizens of the Lounge know about. It probably requires you to have been there long enough for mould and lichen to have grown over you, before they let you know the place even exists.

RaceProUK

@loose said:

I bet there is a "Staff Room" that not even the denizens of the Lounge know about.

TL4s get Turn Left, and yes there is a Staff-only section.

And no, this isn't secret knowledge.

loose

@RaceProUK said:

this isn't secret knowledge

Any more than Area 51 is

Arantor

Not exactly sekrit squirrel, just would prefer not to be so public about it since it isn't anonymised as much as, say, the front page is.

aliceif

is the best subforum.

Maciejasjmj

@Arantor said:

It is a shame you have not yet achieved Lounge access, because then you will know on what basis I know about Apache on Windows.

TL;DR: You work with crazy people. They could be using friggin' Star Trek technology and still fuck this up.

@RaceProUK said:

TL4s get Turn Left, and yes there is a Staff-only section.

Also a super-secret authors-only section. So far, we have no idea what to do with it.

loose

@Maciejasjmj said:

So far, we have no idea what to do with it.

I am quite sure that suggestions will now come poring in :P