More Go Chum In The WTF Ocean

drurowin

@drurowin said:
A statically-linked build of p7zip is a whopping 275% faster at compression on my machine...

I call bullshit. There's no way static linking is going to give performance gains like that. If you're not: 1) lying; or 2) unable to do a simple benchmark accurately, then I'd say it's a result of something else you did.

In fact, this illustrates just how clueless you are. Anybody who understands compilers, linking and PIC would never claim a 275% increase in performance from static linking. Even a knowledgeable person in favor of static linking (if such a thing exists) would instantly realize that that large of a performance gain is ridiculous and something must be wrong. This just shows that, once again, you have no fucking clue what you are talking about.

The methodology was "time 7z a test.file /path/to/10/gb/of/data; time 7zstatic a test2.file /path/to/10/gb/of/data". I swear I am not making this up. I'll rerun the test later and even post a video.

Ronald

@drurowin said:

I'll rerun the test later and even post a video.

Why don't you finish drawing and coloring your avatar first.

blakeyrat

@Ronald said:

Why don't you finish drawing and coloring your avatar first.

He probably has too much trouble gripping the pens with this noodle-y right arm there.

drurowin

@Ronald said:

@drurowin said:
I'll rerun the test later and even post a video.

Why don't you finish drawing and coloring your avatar first.

I haven't got my tablet working with Solaris yet.

Ben L.

@Ronald said:

@drurowin said:
I'll rerun the test later and even post a video.

Why don't you finish drawing and coloring your avatar first.

Dear Strong Bad,

Why don't you creat a montage?

drurowin

@Ben L. said:

@Ronald said:
@drurowin said:
I'll rerun the test later and even post a video.

Why don't you finish drawing and coloring your avatar first.

Dear Strong Bad,

Why don't you creat a montage?

I'd like to see ya twy.

morbiuswilters

@drurowin said:

The methodology was "time 7z a test.file /path/to/10/gb/of/data; time 7zstatic a test2.file /path/to/10/gb/of/data". I swear I am not making this up.

You fucking idiot, the first run would prime the disk cache and result in the second run being significantly faster. This is what I meant when I said "unable to do a simple benchmark accurately."

If you wanted to test the performance of static linking, you should have done: 1) several run-throughs so you can throw out an outliers; and 2) do a pre-run before each real run to prime the disk cache.

Once again, a person with any competence or experience or sense at all would have seen a 275% increase in performance and said "Yep, I fucked up the test somehow", even if they were rabidly pro-static-linking. The problem is you had no clue what you were doing, which isn't shocking considering the opinions you hold.

drurowin

@morbiuswilters said:

@drurowin said:
The methodology was "time 7z a test.file /path/to/10/gb/of/data; time 7zstatic a test2.file /path/to/10/gb/of/data". I swear I am not making this up.

You fucking idiot, the first run would prime the disk cache and result in the second run being significantly faster. This is what I meant when I said "unable to do a simple benchmark accurately."

If you wanted to test the performance of static linking, you should have done: 1) several run-throughs so you can throw out an outliers; and 2) do a pre-run before each real run to prime the disk cache.

Once again, a person with any competence or experience or sense at all would have seen a 275% increase in performance and said "Yep, I fucked up the test somehow", even if they were rabidly pro-static-linking. The problem is you had no clue what you were doing, which isn't shocking considering the opinions you hold.

Well at least knowing what you're doing isn't a requirement for posting here. I'll try it later with disk caching disabled.

Edit: Hmm, it looks like I can't disable ZFS's caching. Any suggestions for how to do this without being accused of "fucking up the test", to show that static linking has definite solid performance advantages?

@drurowin said:

Edit: Hmm, it looks like I can't disable ZFS's caching. Any suggestions for how to do this without being accused of "fucking up the test", to show that static linking has definite solid performance advantages?

@morbiuswilters said:

you should have done: 1) several run-throughs so you can throw out an outliers; and 2) do a pre-run before each real run to prime the disk cache.

drurowin

@Salamander said:

@drurowin said:
Edit: Hmm, it looks like I can't disable ZFS's caching. Any suggestions for how to do this without being accused of "fucking up the test", to show that static linking has definite solid performance advantages?
@morbiuswilters said:
you should have done: 1) several run-throughs so you can throw out an outliers; and 2) do a pre-run before each real run to prime the disk cache.

He'll still say I didn't do it right, and it irks me that I can't just disable caching. But I'll give that a try tonight when I'm done with real work.

blakeyrat

@drurowin said:

Edit: Hmm, it looks like I can't disable ZFS's caching. Any suggestions for how to do this without being accused of "fucking up the test", to show that static linking has definite solid performance advantages?

Why would you disable disk caching?

Look: write a script that does this:

1) Run command
2) Run command (store time in new variable)
3) Repeat previous 10 times
4) Output median of 10 stored times

Then run that for both commands and compare the results.

It's not fucking hard.

drurowin

@blakeyrat said:

@drurowin said:
Edit: Hmm, it looks like I can't disable ZFS's caching. Any suggestions for how to do this without being accused of "fucking up the test", to show that static linking has definite solid performance advantages?

Why would you disable disk caching?

Look: write a script that does this:
1) Run command
2) Run command (store time in new variable)
3) Repeat previous 10 times
4) Output median of 10 stored times
Then run that for both commands and compare the results.
It's not fucking hard.

Well, then I'm going to have to use a set smaller than 10 GB of data; otherwise your parents will have made all of you go to bed before I get done.

morbiuswilters

@Salamander said:

@drurowin said:
Edit: Hmm, it looks like I can't disable ZFS's caching. Any suggestions for how to do this without being accused of "fucking up the test", to show that static linking has definite solid performance advantages?

@morbiuswilters said:
you should have done: 1) several run-throughs so you can throw out an outliers; and 2) do a pre-run before each real run to prime the disk cache.

Right, you don't want to disable the disk cache for a test like this. You're not testing the throughput of your I/O controller or the performance of ZFS' I/O scheduler. You want to test this all in-memory.

In fact, whatever file you're zipping (and the zip it outputs) should fit entirely in memory. Maybe make a tmpfs to hold the uncompressed input and the compressed output. You only need a few GB. What you want to avoid is testing the performance of the I/O system.

drurowin

@morbiuswilters said:

@Salamander said:
@drurowin said:
Edit: Hmm, it looks like I can't disable ZFS's caching. Any suggestions for how to do this without being accused of "fucking up the test", to show that static linking has definite solid performance advantages?
@morbiuswilters said:
you should have done: 1) several run-throughs so you can throw out an outliers; and 2) do a pre-run before each real run to prime the disk cache.

Right, you don't want to disable the disk cache for a test like this. You're not testing the throughput of your I/O controller or the performance of ZFS' I/O scheduler. You want to test this all in-memory.

In fact, whatever file you're zipping (and the zip it outputs) should fit entirely in memory. Maybe make a tmpfs to hold the uncompressed input and the compressed output. You only need a few GB. What you want to avoid is testing the performance of the I/O system.

I have 32 GB RAM, my existing 10 GB test data should fit. I'll try a tmpfs.

morbiuswilters

@drurowin said:

@blakeyrat said:
@drurowin said:
Edit: Hmm, it looks like I can't disable ZFS's caching. Any suggestions for how to do this without being accused of "fucking up the test", to show that static linking has definite solid performance advantages?

Why would you disable disk caching?

Look: write a script that does this:
1) Run command
2) Run command (store time in new variable)
3) Repeat previous 10 times
4) Output median of 10 stored times
Then run that for both commands and compare the results.
It's not fucking hard.

Well, then I'm going to have to use a set smaller than 10 GB of data; otherwise your parents will have made all of you go to bed before I get done.

Yes, you do. You want the set (input and output) to fit in memory. Also, instead of storing the median, output all 10 times. The reason is you might have some really crazy outliers that need to be thrown out. You could probably get away with something like throwing out the top and bottom results, then taking the median of the remaining 8.

If the performance increase you see is anything more than 5% you probably fucked up somewhere. Seriously, PIC isn't quite as fast as statically-linked code, but the penalty is not nearly what you seem to think. Do you really think every OS on the planet has been using a linking strategy which made programs 4 times slower and you're the first person to figure out the truth?

drurowin

@morbiuswilters said:

If the performance increase you see is anything more than 5% you probably fucked up somewhere. Seriously, PIC isn't quite as fast as statically-linked code, but the penalty is not nearly what you seem to think. Do you really think every OS on the planet has been using a linking strategy which made programs 4 times slower and you're the first person to figure out the truth?

Performance vs your (valid) argument about maintainability.

blakeyrat

@morbiuswilters said:

You could probably get away with something like throwing out the top and bottom results, then taking the median of the remaining 8.

... yes. That would be a completely different result than taking the median of all 10 values.

(Morbs, it's not often I can call you an idiot. But today! Today! You are the idiot. Sorry.)

Ben L.

@blakeyrat said:

@morbiuswilters said:
You could probably get away with something like throwing out the top and bottom results, then taking the median of the remaining 8.

... yes. That would be a completely different result than taking the median of all 10 values.

(Morbs, it's not often I can call you an idiot. But today! Today! You are the idiot. Sorry.)

What if the top result is really fat and you can only take away a third of it?

morbiuswilters

@drurowin said:

@morbiuswilters said:
If the performance increase you see is anything more than 5% you probably fucked up somewhere. Seriously, PIC isn't quite as fast as statically-linked code, but the penalty is not nearly what you seem to think. Do you really think every OS on the planet has been using a linking strategy which made programs 4 times slower and you're the first person to figure out the truth?

Performance vs your (valid) argument about maintainability.

Of course there's some performance gain to avoiding PIC, but if it was 4-fold, then no OS would use dynamic linking.

morbiuswilters

@blakeyrat said:

@morbiuswilters said:
You could probably get away with something like throwing out the top and bottom results, then taking the median of the remaining 8.

... yes. That would be a completely different result than taking the median of all 10 values.

(Morbs, it's not often I can call you an idiot. But today! Today! You are the idiot. Sorry.)

Dammit, I meant "mean", not "median". But, yeah, taking the median of all 10 should be fine, too. This isn't the most precise test in the world, anyway.

Ben L.

ben@loads foo$ ls -l
total 1932
-rw-rw-r--. 1 ben ben    1018 May 27 22:12 foo.go
-rwxrwxr-x. 1 ben ben   36583 May 27 22:12 foo-shared
-rwxrwxr-x. 1 ben ben 1934936 May 27 22:12 foo-static
ben@loads foo$ file *
foo.go:     C source, ASCII text
foo-shared: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=0x0a6d08210da35d09861b705dda0680ae9c00d242, not stripped
foo-static: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
ben@loads foo$ ldd foo-*
foo-shared:
	linux-vdso.so.1 =>  (0x00007fff04ffe000)
	libgo.so.0 => /lib64/libgo.so.0 (0x00007f73057dd000)
	libm.so.6 => /lib64/libm.so.6 (0x0000003625e00000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003627200000)
	libc.so.6 => /lib64/libc.so.6 (0x0000003625a00000)
	/lib64/ld-linux-x86-64.so.2 (0x0000003625600000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003626200000)
foo-static:
	not a dynamic executable
ben@loads foo$ cat foo.go 
package main

import (
	"bytes"
	"fmt"
	"crypto/sha1"
	"compress/gzip"
)

func main() {
	var buf bytes.Buffer
	defer func() {
		sha := sha1.New()
		sha.Write(buf.Bytes())
		fmt.Printf("% x\n", sha.Sum(nil))
	}()

	w1, _ := gzip.NewWriterLevel(&buf, gzip.BestCompression)
	defer w1.Close()
	w2, _ := gzip.NewWriterLevel(w1, gzip.BestCompression)
	defer w2.Close()
	w3, _ := gzip.NewWriterLevel(w2, gzip.BestCompression)
	defer w3.Close()
	w4, _ := gzip.NewWriterLevel(w3, gzip.BestCompression)
	defer w4.Close()
	w5, _ := gzip.NewWriterLevel(w4, gzip.BestCompression)
	defer w5.Close()
	w6, _ := gzip.NewWriterLevel(w5, gzip.BestCompression)
	defer w6.Close()
	w7, _ := gzip.NewWriterLevel(w6, gzip.BestCompression)
	defer w7.Close()
	w8, _ := gzip.NewWriterLevel(w7, gzip.BestCompression)
	defer w8.Close()
	w9, _ := gzip.NewWriterLevel(w8, gzip.BestCompression)
	defer w9.Close()

	var b []]byte
	sha := sha1.New()
	for i := 0; i < 1000; i++ {
		fmt.Printf("%d ", i)
		b = sha.Sum(b)
		sha.Write(b)
		w9.Write(b)
	}
}

Won't be a jiffy.

I have a Pentium 4, so it will be quite a few jiffies.

Edit: I'm changing that loop to 10k iterations instead of 1k.

drurowin

@morbiuswilters said:

If the performance increase you see is anything more than 5% you probably fucked up somewhere.

Rerun it 15 times, average is 7.7% faster. The outliers were 284% faster and 0.3% faster, which I removed because the 284% was WAY the fuck out there. The next highest one was in the 20s of percents. Still, though, it's a measurable gain.

Ben L.

+ for i in '{0..4}'
+ /usr/bin/time -v ./foo-shared
	Command being timed: "./foo-shared"
	User time (seconds): 3387.62
	System time (seconds): 24.01
	Percent of CPU this job got: 84%
	Elapsed (wall clock) time (hss or m:ss): 1:07:28
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2285928
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 46559
	Minor (reclaiming a frame) page faults: 1218681
	Voluntary context switches: 49327
	Involuntary context switches: 1101848
	Swaps: 0
	File system inputs: 2664640
	File system outputs: 560
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
+ /usr/bin/time -v ./foo-static
	Command being timed: "./foo-static"
	User time (seconds): 2107.56
	System time (seconds): 17.13
	Percent of CPU this job got: 80%
	Elapsed (wall clock) time (hss or m:ss): 44:05.69
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2351512
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 46519
	Minor (reclaiming a frame) page faults: 978369
	Voluntary context switches: 283808
	Involuntary context switches: 774157
	Swaps: 0
	File system inputs: 2821440
	File system outputs: 704
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
+ for i in '{0..4}'
+ /usr/bin/time -v ./foo-shared
	Command being timed: "./foo-shared"
	User time (seconds): 3669.83
	System time (seconds): 27.48
	Percent of CPU this job got: 80%
	Elapsed (wall clock) time (hss or m:ss): 1:16:11
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2345112
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 43531
	Minor (reclaiming a frame) page faults: 1034713
	Voluntary context switches: 45576
	Involuntary context switches: 1590604
	Swaps: 0
	File system inputs: 2477144
	File system outputs: 872
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
+ /usr/bin/time -v ./foo-static
	Command being timed: "./foo-static"
	User time (seconds): 2307.04
	System time (seconds): 22.42
	Percent of CPU this job got: 74%
	Elapsed (wall clock) time (hss or m:ss): 52:21.97
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2394384
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 73384
	Minor (reclaiming a frame) page faults: 1268979
	Voluntary context switches: 367508
	Involuntary context switches: 1035768
	Swaps: 0
	File system inputs: 4471096
	File system outputs: 632
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
+ for i in '{0..4}'
+ /usr/bin/time -v ./foo-shared
	Command being timed: "./foo-shared"
	User time (seconds): 3890.41
	System time (seconds): 31.58
	Percent of CPU this job got: 78%
	Elapsed (wall clock) time (hss or m:ss): 1:23:32
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2293712
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 41250
	Minor (reclaiming a frame) page faults: 1121608
	Voluntary context switches: 43162
	Involuntary context switches: 1921375
	Swaps: 0
	File system inputs: 2349232
	File system outputs: 232
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
+ /usr/bin/time -v ./foo-static
	Command being timed: "./foo-static"
	User time (seconds): 2488.11
	System time (seconds): 25.71
	Percent of CPU this job got: 70%
	Elapsed (wall clock) time (hss or m:ss): 59:13.04
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2415208
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 72966
	Minor (reclaiming a frame) page faults: 1165381
	Voluntary context switches: 406809
	Involuntary context switches: 1508734
	Swaps: 0
	File system inputs: 4456288
	File system outputs: 1160
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
+ for i in '{0..4}'
+ /usr/bin/time -v ./foo-shared
	Command being timed: "./foo-shared"
	User time (seconds): 4139.62
	System time (seconds): 34.59
	Percent of CPU this job got: 70%
	Elapsed (wall clock) time (hss or m:ss): 1:38:58
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2299820
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 52247
	Minor (reclaiming a frame) page faults: 1017613
	Voluntary context switches: 55211
	Involuntary context switches: 2499290
	Swaps: 0
	File system inputs: 2825400
	File system outputs: 1280
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0
+ /usr/bin/time -v ./foo-static
	Command being timed: "./foo-static"
	User time (seconds): 2571.57
	System time (seconds): 29.90
	Percent of CPU this job got: 62%
	Elapsed (wall clock) time (hss or m:ss): 1:08:51
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 2331084
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 90333
	Minor (reclaiming a frame) page faults: 1165370
	Voluntary context switches: 476299
	Involuntary context switches: 1960679
	Swaps: 0
	File system inputs: 5391832
	File system outputs: 992
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Ben L.: Master of pointless information.
Seriously, damn near half of that is just the same lines repeated over and over again.

Ben L.

@Salamander said:

Ben L.: Master of pointless information.

Seriously, damn near half of that is just the same lines repeated over and over again.

Fine, since you're obviously incapable of reading information and discarding the parts you deem unimportant, here's a slimmed-down version:

SHARED
	User time (seconds): 3387.62
	System time (seconds): 24.01

STATIC
	User time (seconds): 2107.56
	System time (seconds): 17.13

SHARED
	User time (seconds): 3669.83
	System time (seconds): 27.48

STATIC
	User time (seconds): 2307.04
	System time (seconds): 22.42

SHARED
	User time (seconds): 3890.41
	System time (seconds): 31.58

STATIC
	User time (seconds): 2488.11
	System time (seconds): 25.71

SHARED
	User time (seconds): 4139.62
	System time (seconds): 34.59

STATIC
	User time (seconds): 2571.57
	System time (seconds): 29.90

error

A pretty dramatic difference. Also, if I'm following correctly, both binaries were compiled from Go? So I guess it's a myth then that Go produces only statically linked binaries.

Ben L.

And if you can't read:

Ben L.

@joe.edwards said:

A pretty dramatic difference. Also, if I'm following correctly, both binaries were compiled from Go? So I guess it's a myth then that Go produces only statically linked binaries.

There's a compiler called gccgo, which I used for the shared one.

morbiuswilters

@Ben L. said:

Fine, since you're obviously incapable of reading information and discarding the parts you deem unimportant...

Some of us have jobs that we're trying to avoid doing. We don't want to spend time sifting through your console output.

As for your results.. is this some Go program? I thought Go didn't support dynamic linking..

morbiuswilters

@Ben L. said:

@joe.edwards said:
A pretty dramatic difference. Also, if I'm following correctly, both binaries were compiled from Go? So I guess it's a myth then that Go produces only statically linked binaries.

There's a compiler called gccgo, which I used for the shared one.

But then you're testing something significantly different than static vs. dynamic linking. (How do you people not understand how to do a simple experiment??) You're testing a completely different compiler for a language that supposedly doesn't support anything but static linking (so who knows what weird hoops the dynamic-linking compiler is jumping through?)