Bad Gateway *SPANK SPANK*

Yamikuronue

@ben_lubar said in Bad Gateway *SPANK SPANK*:

basically we're screwed

as in "we can't find the source of the problem" screwed or "we will keep having crashes" screwed?

masonwheeler

@Yamikuronue said in Bad Gateway *SPANK SPANK*:

as in "we can't find the source of the problem" screwed or "we will keep having crashes" screwed?

Doesn't the former imply the latter?

Yamikuronue

@masonwheeler I should have said "we can't find the source of the problem this way". As in, is what's broken the debugging method, or the actual code? Should we be looking for new approaches or is this the source of the problem?

ben_lubar

@Yamikuronue said in Bad Gateway *SPANK SPANK*:

@ben_lubar said in Bad Gateway *SPANK SPANK*:

basically we're screwed

as in "we can't find the source of the problem" screwed or "we will keep having crashes" screwed?

I went with "compile our own GDB", so we're not screwed after all! Hooray!

ben_lubar

Thread 1 (Thread 0x7fa78192b740 (LWP 44)):
#0  0x00007fa7805ffc1d in open64 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fa780598ca2 in _IO_file_open () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fa780598e1a in _IO_file_fopen () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fa78058e064 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x0000000000e10511 in v8::base::OS::FOpen(char const*, char const*) ()
#5  0x0000000000ca8c49 in v8::internal::WriteBytes(char const*, unsigned char const*, int, bool) ()
#6  0x0000000000a046d5 in v8::internal::GDBJITInterface::EventHandler(v8::JitCodeEvent const*) ()
#7  0x0000000000b50304 in v8::internal::JitLogger::LogRecordedBuffer(v8::internal::Code*, v8::internal::SharedFunctionInfo*, char const*, int) ()
#8  0x0000000000b55527 in v8::internal::Logger::CodeCreateEvent(v8::internal::Logger::LogEventsAndTags, v8::internal::Code*, v8::internal::SharedFunctionInfo*, v8::internal::CompilationInfo*, v8::internal::Name*, int, int) ()
#9  0x0000000000982909 in v8::internal::RecordFunctionCompilation(v8::internal::Logger::LogEventsAndTags, v8::internal::CompilationInfo*, v8::internal::Handle<v8::internal::SharedFunctionInfo>) ()
#10 0x0000000000988481 in v8::internal::Compiler::GetConcurrentlyOptimizedCode(v8::internal::OptimizedCompileJob*) ()
#11 0x0000000000ba84b8 in v8::internal::OptimizingCompileDispatcher::InstallOptimizedFunctions() ()
#12 0x00000000009d4fe7 in v8::internal::StackGuard::HandleInterrupts() ()
#13 0x0000000000c1a0eb in v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*) ()
#14 0x00002fc2d86068fb in ?? ()
#15 0x00002fc2da448d31 in LazyCompile:*Readable.on _stream_readable.js:679 (ev=..., fn=...) at _stream_readable.js:1
#16 0x00002fc2d9e9f1a3 in LazyCompile:*parserOnHeadersComplete _http_common.js:43 (versionMajor=..., versionMinor=..., headers=..., method=..., url=..., statusCode=..., statusMessage=..., upgrade=..., shouldKeepAlive=...) at _http_common.js:1

FrostCat

@ben_lubar Check that you have permissions to write to the temp directory.

(I get a couple of calls a year where that's the answer I give people--they have multiple people writing reports to the same directory, and people have write but not delete permissions, so when one person runs a report and then someone else runs the same report, the second person can't delete the first person's file.)

More seriously, did you get a memory dump so you can, for example, examine the contents of the parameters in, say, #4?

ben_lubar

@FrostCat said in Bad Gateway *SPANK SPANK*:

Check that you have permissions to write to the temp directory.

That's just where the process was when it got killed by GDB after a HTTP request took more than 5 seconds. It definitely can write to that temp directory because #15 and #16 have symbols. The name of the file it was opening is /var/tmp/elfdump[numbers]-[numbers].o.

So basically, this gives us 0 information about the problem.

boomzilla

@ben_lubar said in Bad Gateway *SPANK SPANK*:

That's just where the process was when it got killed by GDB after a HTTP request took more than 5 seconds.

Maybe we should up the time out time? It will mean more noticeable cooties, but maybe fewer false positives if some of those recover.

@accalia what's the threshold for servercooties sending notifications?

FrostCat

@ben_lubar said in Bad Gateway *SPANK SPANK*:

when it got killed by GDB after a HTTP request took more than 5 seconds.

Derp. Isn't it possible to keep that from happening? You might be able to deduce more about the problem if the process/thread/whatever isn't killed.

accalia

@boomzilla said in Bad Gateway *SPANK SPANK*:

@accalia what's the threshold for servercooties sending notifications?

complicated.

cooties occur when average score drops below 50

score for an individual check is calculated as:

HTTP response other than 200 or 204: 0
Response > 12 seconds: 0
Response > 3 seconds: 50
otherwise 100

average score is simply the latest three checks averaged together.

.... i have got to come up with a better system for servercooties.io 2.0.....

dkf

@Yamikuronue said in Bad Gateway *SPANK SPANK*:

As in, is what's broken the debugging method, or the actual code?

I'm guessing “both”. The debugging method has problems, but so does the underlying code. At this stage, the key task is to try to figure out what's really triggering the problem; there's an open question of whether there's a bug in the JIT engine (not something that can be ruled out) but there's no guarantee that that is what's really wrong. Bleah.

I guess the first real thing to do would be to turn off the JIT. The code shouldn't be CPU-bound under normal conditions — something is horribly wrong if that's the case — so running without JIT shouldn't overload anything. If that fixes the problem, it's (probably) a JIT bug. And if the problem still occurs without the JIT, it'll probably be easier to debug without the extra gumph of the JIT.

cvi

@ben_lubar Wait, so what exactly is going on here? Just when attaching GDB it was -per random coincidence- writing debugging information for GDB?

Do you have any more of these stack traces? Are they identical?

Finally ... is there something like a coverage-based profiler for node? I.e., if you can start coverage-based profiling when it stops responding, you should see what parts of the code it spends most time in.

lolwhat

@ben_lubar said in Bad Gateway *SPANK SPANK*:

It's a 464 byte generated variable name for a regular expression.

accalia

@dkf said in Bad Gateway *SPANK SPANK*:

@Yamikuronue said in Bad Gateway *SPANK SPANK*:

As in, is what's broken the debugging method, or the actual code?

I'm guessing “both”. The debugging method has problems, but so does the underlying code. At this stage, the key task is to try to figure out what's really triggering the problem; there's an open question of whether there's a bug in the JIT engine (not something that can be ruled out) but there's no guarantee that that is what's really wrong. Bleah.

I guess the first real thing to do would be to turn off the JIT. The code shouldn't be CPU-bound under normal conditions — something is horribly wrong if that's the case — so running without JIT shouldn't overload anything. If that fixes the problem, it's (probably) a JIT bug. And if the problem still occurs without the JIT, it'll probably be easier to debug without the extra gumph of the JIT.

that's been suggested a few times..... why havent we done that again?

boomzilla

@cvi said in Bad Gateway *SPANK SPANK*:

if you can start coverage-based profiling when it stops responding, you should see what parts of the code it spends most time in.

We've done this. It didn't help.

ben_lubar

@cvi said in Bad Gateway *SPANK SPANK*:

Just when attaching GDB it was -per random coincidence- writing debugging information for GDB?

When it JIT compiles code with --gdbjit --gdbjit_dump, it writes the symbols to a file. I just tested locally and GDB can load symbols on an attached process even without gdbjit_dump, so that'll help a little. GDB attaching to the process doesn't make the V8 GDBJITInterface code run at all.

@cvi said in Bad Gateway *SPANK SPANK*:

Finally ... is there something like a coverage-based profiler for node? I.e., if you can start coverage-based profiling when it stops responding, you should see what parts of the code it spends most time in.

We were using something like that, but it wasn't giving us any coherent results. We got stack traces, but they were completely unrelated to each other every time. You can see some SVGs in the server cooties tracking thread (they're inside tgz archives).

cvi

@boomzilla said in Bad Gateway *SPANK SPANK*:

@cvi said in Bad Gateway *SPANK SPANK*:

if you can start coverage-based profiling when it stops responding, you should see what parts of the code it spends most time in.

We've done this. It didn't help.

... on a per-function basis with call stacks? Only after it stops responding?

Well, if that doesn't help ... :-(

blakeyrat

I know, why don't we use the TimePod to go back in time and pick forum software written in reliable technology instead of broken garbage?

cvi

@ben_lubar said in Bad Gateway *SPANK SPANK*:

We were using something like that, but it wasn't giving us any coherent results. We got stack traces, but they were completely unrelated to each other every time. You can see some SVGs in the server cooties tracking thread (they're inside tgz archives).

Hmm, I'll have a quick look . Not sure I'd be any help with the JS parts, but whatever.

masonwheeler

@blakeyrat said in Bad Gateway *SPANK SPANK*:

I know, why don't we use the TimePod to go back in time and pick forum software written in reliable technology instead of broken garbage?

Sure, I'll go fire up my TimePod right now. Which forum would you recommend?

Yamikuronue

@cvi said in Bad Gateway *SPANK SPANK*:

Only after it stops responding?

The flamegraphs we're looking at end up in different functions every time it stops responding

HardwareGeek

@ben_lubar said in Bad Gateway *SPANK SPANK*:

#9 0x0000000000982909 in v8::internal::...

Wow, I could have had a V8^®!

FrostCat

@blakeyrat said in Bad Gateway *SPANK SPANK*:

why don't we use the TimePod

Are you volunteering the use of your TimePod?

ben_lubar

Hey @accalia. I noticed you added /nodebb.min.js to servercooties.io. Would you consider changing /recent to something else? It's the same as / for logged-out users, so it's not really giving any useful information.

ben_lubar

@accalia also, it would be nice if servercooties.io allowed compression in its requests so NodeBB didn't have to send half a megabyte over the internet every 5 seconds to someone who just discards it. 150KB versus 525KB.

accalia

@ben_lubar said in Bad Gateway *SPANK SPANK*:

Would you consider changing /recent to something else?

got a suggestion?

@ben_lubar said in Bad Gateway *SPANK SPANK*:

, it would be nice if servercooties.io allowed compression in its requests so NodeBB

addes gzip: true to the config for request. that should help.

also every five seconds? should be every 20ish seconds it gets around to checking that js file. it checks one of the endpoints every five seconds and there are four it's checking right now..... so 20 seconds, not 5 seconds ;-)

bb36e

@blakeyrat said in Bad Gateway *SPANK SPANK*:

I know, why don't we use the TimePod to go back in time and pick forum software written in reliable technology instead of broken garbage?

I just tried this. all forums suck. unfortunately, this is the best post-transition universe.

accalia

@ben_lubar said in Bad Gateway *SPANK SPANK*:

NodeBB didn't have to send half a megabyte over the internet every 5 seconds

0_1469236338566_upload-ab7aa212-f3ed-4be3-b965-0e81d40316bb

huh.... that's not the speed increase i expected from that change.......

boomzilla

I posit that this is evidence in favor of my theory that it's something at the VM / OS level fucking us over:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                    
15410 root      20   0 1044832 996.5m   2620 R  99.8 25.2   1:44.98 gdb                                                                        
15525 root      20   0    7504    684    588 R  92.4  0.0   0:11.00 find

Captured during the recent cooties storm. And the most recent Thread 1 dump (though there was one just before that's all ??):

Thread 1 (Thread 0x7f1ed6dd0740 (LWP 25642)):
#0  0x00007f1ed5aad809 in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000e0aa0a in uv__epoll_wait (epfd=<optimized out>, events=events@entry=0x7ffc59030490, nevents=nevents@entry=1024, timeout=timeout@entry=15) at ../deps/uv/src/unix/linux-syscalls.c:321
#2  0x0000000000e08b24 in uv__io_poll (loop=loop@entry=0x14b5680 <default_loop_struct>, timeout=15) at ../deps/uv/src/unix/linux-core.c:243
#3  0x0000000000df9614 in uv_run (loop=0x14b5680 <default_loop_struct>, mode=UV_RUN_ONCE) at ../deps/uv/src/unix/core.c:351
#4  0x0000000000d7e038 in node::Start(int, char**) ()
#5  0x00007f1ed59eab45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#6  0x0000000000681108 in _start ()
Kill the program being debugged? (y or n) [answered Y; input not from terminal]

Tsaukpaetra

@boomzilla said in Bad Gateway *SPANK SPANK*:

I posit that this is evidence in favor of my theory that it's something at the VM / OS level fucking us over

Probably unrelated, but this reminds me of a situation with VirtualBox in which a VM would lock up trying to access the attached virtual disks due to a glitch that's fixed by telling VirtualBox to use the hosts' I/O cache instead of the one provided by VirtualBox.

This coupled with the logs talking about accessing files it should definitely have access to....
Maybe nothing.

boomzilla

@Tsaukpaetra Hmmm...a related issue that I haven't brought up...we also use docker, so it could be a docker thing.

ben_lubar

@boomzilla said in Bad Gateway *SPANK SPANK*:

I posit that this is evidence in favor of my theory that it's something at the VM / OS level fucking us over:

Not really. The GDB thing is because it loads a shitton of symbols when we're grabbing the stack trace. The find invocation will go away with today's update (in about 5 minutes).

If anyone knows how to make GDB only load the JIT symbols for the current stack, we can make the GDB part really fast.

boomzilla

@ben_lubar It appears that we're back to this:

Thread 1 (Thread 0x7f96181ce740 (LWP 43)):
#0  0x00000aba4f9d270f in ?? ()
#1  0x000000000000043a in ?? ()
#2  0x00000000000007e8 in ?? ()
...

ben_lubar

@boomzilla I have it on --gdbjit and not --gdbjit_full because the latter never finished starting up GDB when I tested it. So we only get symbols for JavaScript source code and not for any of the other possible generated code.