Please ease up with the bots

sam

The NGINX logs are flooded with bot activity. I just engaged some rate limiting to stop some of the bleeding and keep the site running a bit better.

That said, some bots (SockAdept 1.0.0) are flooding the site at DoS rates, if you are writing a bot, please, have some etiquette about the number of reqs you make. 10 a minute as opposed to 10 a second would be a good start.

cc @apapadimoulis

accalia

All my instances of sockbot (uses SockAdept as user agent) have been killed. I'll look through the code and make sure they are tuned way back before i turn them back on.

i know others are using Sockbot base code as well. I'll bump the version number in the user agent so we can tell those apart and filter/kill as appropriate.

loopback0

Both of mine are supposed to poll every 10 seconds, need to see if that's actually getting honoured.

accalia

Ok, that should have made sockbot much less of a resource hog. increased polling delay 20x (should poll about once every 10 seconds now +/- processing time of notifications(and additional 5 seconds if it attempts a post as a result of the notification)).

I gave the two instances of sockbot different user agents so they can be told apart if they cause more problems. (Sockadept 1.1.0 and SockAdept 1.2.0, if you are getting any more SockAdept 1.0.0 that's not an instance run by me.)

Should sockbot continue to cause problems please send me a PM or email (i'm in the database); my intention with sockbot is to have fun, not to DoS the site.

loopback0

They're dead for now - can't even get them to login without hitting a 429 error despite it only being the second request and 10 seconds after the first request.

accalia

hmm... an excellent point. I'll be back. I'mma going to bake into sockbot that a 429 response shall result in a delay of 60 seconds before processing continues.

not sure why he's not getting 429s and you are though. i think @sam would have to answer that one?

loopback0

I'm leaving them dead for now, in case it's x requests over y time and maybe their previous activity puts them over the limit for now.
I tried adding a 30 second wait after a 429 before a retry but it just gets in a loop.

accalia

hmm.... well it's not useragent based. I tested switching sockbot back to using SockAdept 1.0.0 for a useragent. no 429s. he's back to 1.1.0 where he should be now.

also delay of 60 seconds after 429 is now enforced. should be out to Git later tonight.

loopback0

I've swapped discoursebot back to a static message too, as calculating the last bug logged adds more requests.
Might just swap to your SockBot code (but change the useragent) as looking at it I've got a better idea of what it's doing than the current code, which is making dealing with the 429 more difficult.

accalia

you are welcome to fork me. but if you can wait till Wednesday i should have the new format of sock_modules finished and some documentation done that will help! ;-)

specifically i'm hoping to sort out this issue, or at least the new format of sock_module i'll need to implement it over lunch tomorrow.

Allow sock_modules to set their priority order · Issue #8 · SockDrawer/SockBot

allow sock_modules to set priority level to influence order of execution

loopback0

It'll be Weds before I have chance to tackle this again anyway.

accalia

Then i'll poke you when i make that commit and when i finish that documentation i wanted to write as well.

end

Just so we are clear on what is going on, due to the 10 req per second bot activity, the nginx logs are 5gb per day, on a cloud instance with 60gb of total disk space. Only the last 2 days are kept uncompressed, and I think 7 days before log rotation, but it is a lot.

And daily site db backups are 1.2 gb including images.

chubertdev

Stress testing (both the software, and its author).

delfinom

Or you know, disable logs? That's what sane people do on high traffic sites. I have nginx serving 100+ req a second for a PHP websocket application just fine with logs turned off.

trithne

But Metrics!

delfinom

What useless metrics could you possibly get from an access log? Your application should handle any real metrics internally. Worst case is you try and recreate the Google Analytics wheel and roll your own IP access lookup with user agents.

darkmatter

suckers... my "bot" doesn't count and is unaffected by the limiting ;) course, I'm not polling anything, just reading the site.

accalia

I think i've finished what i wated to do to sockbot.

You'll probably want to look at the notifyprint.js and summon.js sock_modules to tweak for your bots.

i need to write docs for sockbot. i'll do that after i update reader.js tomorrow.

shout if you have questions. i'll be happy to fix issues or explain what i didn't make clear.(and then write or fix the docs)

Polygeekery

@accalia said:

you are welcome to fork me.

Giggity.

@codinghorror, I thought that bots were an impossibility. So how is this happening in the first place?

accalia

he said it couldn't be done in a room full of geeks, nerds, and CompSci professionals.

what did you think was going to happen? ;-P

Here's my github. You can see how i do what i do, if you are willing to read the slightly idiosyncratic way that Node.js does async programming (and my total lack of comments in like 90% of the code. I need to fix that very soon. hopefully until then it's self documenting.)

GitHub - SockDrawer/SockBot: Sockbot - A sock puppet Robot worthy of TheDailyWTF that interfaces with what.thedailywtf.com. SO META!

Sockbot - A sock puppet Robot worthy of TheDailyWTF that interfaces with what.thedailywtf.com. SO META! - SockDrawer/SockBot

end

I said spam was much harder to achieve in Discourse due to the all JS nature of the site. I never said bots attached to registered accounts were impossible.

When was the last time you saw spam on this site, versus the old forums?

Polygeekery

Bots = Spam

Funny spam, but spam. We have them posting. Self-registering seems rather trivial now.

darkmatter

@codinghorror said:

When was the last time you saw spam on this site, versus the old forums?

Pretty much every day now, constantly.

Spencer

Hot, sexy MILFs looking for you here

Spencer

It's only now that I've posted it that I realise I inadvertently created "emotse"

darkmatter

@codinghorror said:

due to the all JS nature of the site.

It doesn't matter how many jewels you encrust your site in, it is no harder to read the basic network POST & GET commands that are logged in every half-decent browser available. That's the only part needed, it's not like the bot needs to replicate the full UI. Bonus points because the calls almost all return JSON results, so the old page scraping techniques can be ditched for simply reading pre-created objects directly in the bot code. Botmakers don't need to know or care about how the actual dicsource site javascript works at all.

4 steps to spam bot

craft a post in the UI, hit "Reply", read network request header
mimic network request header in language of choice
???ⁱ ⁱⁱ ⁱⁱⁱ ^iv
Profit

_{i. create gmail spam accounts like always
ii. register to forums using the attach to gmail account option
iii. ???¹
iv. Profit}

end

Sure, someone could "spam" using APIs too. But when you think of spammers, do you think of people writing apps that call APIs? That is an issue of proper API design, and a combination of native and upstream (nginx, haproxy) rate limits.

Spencer

If a target (and/or it's marketshare) might be lucrative enough. I'd also think there to be some who build tools/scripts, and then sell access to the tool to others (if not just sharing amongst a tight group of fellow spammers).

delfinom

@codinghorror said:

Sure, someone could "spam" using APIs too. But when you think of spammers, do you think of people writing apps that call APIs? That is an issue of proper API design, and a combination of native and upstream (nginx, haproxy) rate limits.

Proper API design eh? And yet this forum is having issues with requests?

codlnghorror

@delfinom said:

Proper API design eh? And yet this forum is having issues with requests?

You have been rate limited for Doing It Wrong™. Pray you are not rate limited further.

codlnghorror

I realise I tested switching sockbot is an access to set their priority class= status style= clear: both the last bug was much harder to registered accounts like the 10 seconds before I just gets in the new format of people writing apps that are using the community to evaluate! faoileag So yes if they are getting any real metrics internally. Worst case is having issues or fix the attach to be ditched for now.
I tried adding a combination of people do to the bot please send me a good idea! We have made sockbot continue to sort of spammers do what my previous bug logged in every half-decent browser available. That's a lot. And RegexBuddy is unaffected by me. Giggity. class= status style= margin-top:10px; opened by the last time of geeks nerds and recreate the software to the attach to sort of spammers . I think of create gmail spam on this site versus the full of reqs you think of 60 seconds if you make. 10 data-topic= 3968 ><div class= mention href= https://github.com/AccaliaDeElementia src= https://avatars2.githubusercontent.com/u/829476?v=2&s=400 class= thumbnail width= 397 height= 20 height= 20 src= /letter_avatar/delfinom/40/2.png class= title class= emoji alt= arrows_counterclockwise width= 20 height= 20 height= 90 href= /users/sam would have wanted to set priority class= thumbnail width= 20 height= 108

aliceif

Mostly nonsense post with a link.
Yup, spambots seem possible.

Maciejasjmj

@aliceif said:

Mostly nonsense post with a link.

We already have a Markkov chain running around. There's absolutely nothing that mitigates the spammers, maybe except for the fact that the API is so terrible nobody sane would want to figure it out even for a million dollars.

accalia

@delfinom said:

Proper API design eh? And yet this forum is having issues with requests?

yeah.... I was doing it wrong. I've fixed that now. i think.....

mott555

@Maciejasjmj said:

We already have a Markkov chain running around. There's absolutely nothing that mitigates the spammers,

To be fair, MottBott is pretty easy to recognize as a Markov chain.

Though if someone were particularly evil and pointed a couple dozen instances at the same Discourse instance simultaneously, I wonder what kind of admin effort would be required to resolve that.

Maciejasjmj

@mott555 said:

To be fair, MottBott is pretty easy to recognize as a Markov chain.

Spammers, OTOH, are rather hard to discern from Markov chains.

Polygeekery

@mott555 said:

Though if someone were particularly evil and pointed a couple dozen instances at the same Discourse instance simultaneously, I wonder what kind of admin effort would be required to resolve that.

Considering that you cannot ban IPs? It might be a bit difficult.

accalia

can't ban IPS? you've never met my friend IPTABLES have you?

granted @PJH couldn't do it but some people do have access to the server firewall (no, i don't have that access. I don't want it either)

PJH

@mott555 said:

I wonder what kind of admin effort would be required to resolve that.

Guess.

accalia

oh. right. that would work too....

Luhmann

Delete?

It is a delete in a GUI and not CLI but does it use a recycle bin? And what if you shift-click it?

PJH

That button relates to the backup on the left of that line. No idea what happens on the host when clicked - it gets permanently removed from the GUI however (after an 'Are You Sure?' dialog.)

Shift-click does bugger all.

Luhmann

@PJH said:

Shift-click does bugger all.

I hadn't high hopes for it.

Arantor

@Spencer said:

If a target (and/or it's marketshare) might be lucrative enough. I'd also think there to be some who build tools/scripts, and then sell access to the tool to others (if not just sharing amongst a tight group of fellow spammers).

Exactly it. If Discourse takes off to any significant degree (and, frankly, the figures we've heard touted thus far are not that significant), the more lucrative it will be that someone adds in the necessary GET/POST requests to things like Xrumer.

In fact, I'd even go as far as suggesting the primary reason Discourse hasn't had a shitton of spam bots yet is because it's simply not worth the effort yet because of the limited userbase.

darkmatter

The software [xrumer] is also capable of avoiding detection by making posts in off-topic, spam and overflow sections of forums thus attempting to keep its activities in high activity low content areas of the targeted forum.

seriously though, the Likes Thread will be the #1 target of all time!

perhaps that's the reason for the poor search engine optimization of dicsourse, it discourages xrumer style spammers because they aren't going to get as big of a boost from spamming forums that don't translate well in google pageranks?

Arantor

Nah, the reason for poor SEO is simple: incompetence.

Spencer

@darkmatter said:

seriously though, the Likes Thread will be the #1 target of all time!

I don't see a problem with this.
For fame! For glory! For 30k/100k/1000k/beyond!

accalia

i want to see that topic get to 1 megapost!

Arantor

http://www.youtube.com/watch?v=Jgmk5D4a8K8