What the Daily WTF?

Dark_Shikari

TRWTF is the comments. The code does a strict superset of the calculations as the non-FAST MODE; it does all the same loads, so it cannot possibly be "cache locality". Even x86 has enough registers to make that trick worthless, and this code isn't targeted at an 8-bit 8086. The only possible benefit is SIMD -- and if they wanted to use that, they could just write two lines of intrinsics.

Dark_Shikari

From the Tandberg/ Nokia/ Ericsson H.265 video compression proposal (A119 here). Bonus points: their powerpoint says "Clean and fast software written from scratch using C".

/*-----------------------------------------------------------------------------------------
Function:   SAD_64x64
Purpose:    Calculate SAD for a 64x64 block
Input:      *a           - Pointer to first block
            *b           - Pointer to second block
            stridea      - stride of first block
            strideb      - stride of second block
Return:     sad          - SAD value
Parameters: None
-------------------------------------------------------------------------------------------*/
static inline unsigned int SAD_64x64(const unsigned char *a,
                                     const unsigned char *b,
                                     const int stridea,
                                     const int strideb)
{
    int i,j;
    unsigned int sad = 0;
#if FAST_MODE

    for (i=0;i<64;i++)
    {
        unsigned char b_[64];
        for (j=0;j<64;j++)
            b_[j] = b[i*strideb+j];

        for (j=0;j<64;j++)
            sad += abs(a[stridea*i+j] - b_[j]);
    }
#else
    for (i=0;i<64;i++)
        for (j=0;j<64;j++)
            sad += abs(a[stridea*i+j] - b[i*strideb+j]);
#endif
    return(sad);
}

This "FAST_MODE" pattern is repeated for about 10 different copy-pastes of the same function with different input sizes. Sometimes they use memcpy instead of a for loop. I still am not quite sure what the author of this was trying to accomplish.

Dark_Shikari

@ender said:

@Dark Shikari said:
It definitely won't work for AVI, MP4, or MKV, and probably won't work for FLV, OGG, or WMV/ASF.
One of design goals of Ogg was to be able to simply cat files together.

Ogg wasn't "designed", it was thrown together in the same fashion that a 4 year old cleans up his room by hiding all the toys in the closet.

Dark_Shikari

@Tacroy said:

Erm, if you're dealing with mpeg video fragments, you can usually just concatenate them together. e.g, cat vid1 vid2 vid3 > vid123 (assuming the fragments aren't too huge, otherwise use dd). In Windows, type vid1 vid2 vid3 > vid123 should work but I've never tried it and Windows tends to mess this sort of thing up.
In fact, it should work as long as all of the fragments use the same encoding scheme, and it isn't ridiculously old. Modern encoding schemes are based on a packet stream model, and don't really care about file boundaries.

This won't work for anything with a global header, i.e. almost any modern video container format. It'll work for MPEG-2 TS, and maybe PS depending on the phase of the moon and how much you like desynced audio, and that's about it. It definitely won't work for AVI, MP4, or MKV, and probably won't work for FLV, OGG, or WMV/ASF.

Dark_Shikari

SUPER is a pile of unusable buggy garbage. There are a gajillion better GUIs, and of course you can just go and use the libraries directly via the commandline.

A tiny selection of better (free) GUI applications:

Handbrake
Staxrip
Ripbot264
HDConvertToX
Avidemux
MeGUI
AutoMen
ASXGui

And CLI:

x264
ffmpeg
mencoder
handbrake-cli

Keep in mind that nearly every single freeware H.264 encoder in the entire world uses x264 with about half a dozen exceptions total, so they're all using the same encoding library anyways.

Dark_Shikari

@fennec said:

I believe that Popular Science or Popular Mechanics had an issue about a nucleonic airplane concept (thorium strobed with X-Rays).
Recently, too (not just in the 60's, in the 90's or 2000's).

From what I recall, both the US and Soviets tried this one out: imagine an airplane that could fly around for months without refueling? It's the nuclear tactician's dream. And I mean a full nuclear reactor on a plane.

It actually worked, but there were two "small" problems:

1. Enough shielding to protect a full crew would make the plane far too heavy to take off. This could be solved by having a single crew member, or even making the plane autonomous.

2. It seems almost possible at this point... except for one problem. It turns out that most of the atoms in the air are not easily neutron-activateable; oxygen, carbon, etc won't generally become radioactive from neutron bombardment. But xenon, a trace impurity, does. A lot. And so the plane would become a gigantic machine spewing highly radioactive Xenon wherever it flew.

They decided that this was probably a bad idea and gave up.

Dark_Shikari

The log speaks for itself.

16:59 -!- mohamedferose [n=mohamedf@219.64.67.150] has joined #x264
16:59 < mohamedferose> Hi Zarxrax
16:59 < mohamedferose> How is going on
16:59 < mohamedferose> Z0rc
17:00 < mohamedferose> Topic About x264 is so boring
17:00 < mohamedferose> how u implemets
17:00 < mohamedferose> pls post me sorce conde
17:00 < mohamedferose> whaer is it c
17:01 -!- mohamedferose [n=mohamedf@219.64.67.150] has quit [Client Quit]

I find it amazing that not only do they insist that someone "post me sorce conde", but that if they don't get their "sorce conde" within 1 minute, they leave.

Dark_Shikari

@Rootbeer said:

Speaking Super(TM)(R)(C)(MRUEQ), can anyone recommend a GUI frontend for video encoding that DOESN'T make me bleed out of the eyes and anus with its utter and all-consuming interface awfulness?

Ripbot264? AutoMKV? Staxrip? Handbrake? Megui?

If you need Vdub-like editing functionality (trim, etc), Avidemux?

Dark_Shikari

+1 for frontpage; this is hysterical.

Dark_Shikari

@Someone You Know said:

Chicken chicken chicken. Chicken chicken? Chicken.

That was the greatest presentation ever.

I mean, chicken chicken chicken chicken, chicken chicken, chicken chicken chicken!

Dark_Shikari

@Soviut said:

If only DV compression was lossless. Its 4:2:2 compression, not 4:4:4 so it holds the greens and blues well, but kills reds and other colours that fall into the brown ranges.

I lol'd hard at this line.

Dark_Shikari

@GettinSadda said:

@Dark Shikari said:
An NDA that prevents you from saying where you work? FFS, this is video encoding, not an intelligence agency.

Funnily enough I have both an NDA stating that I must not mention any of the projects I work on without prior agreement, and a clause in my contract saying that I must not discuss my work in public forums.
It gets quite interesting when you play with certain litigious Big-Boys!
Edit: They are fairly OK with me discussing the general field as long as I do not bring them into it by saying who I work for (and best to keep my real name out of it)

I know people who work at or have worked at Ateme, DivX, Mainconcept, Harmonic, and Digital Fountain, and none had anything even close to that--such an NDA is absurd.

Of course, this is one of the joys of working on open source encoders professionally--my employer is quite limited in how strictly they can NDA me ;)

Dark_Shikari

@GettinSadda said:

@Dark Shikari said:
So you work for the JVT? In that case, I have a very long list of stupid decisions made in the H.264 standard that I would like rectified ;)
(and if you don't work for the JVT, then you're irrelevent, video standards-wise. Nobody uses VP7, RV30/40, and even VC-1 is not very popular. I'm also pretty sure the JVT doesn't "worst-case" test anything; from what I can tell all they do is throw Foreman.cif-like test sequences at things over and over and over until something works.)

Not JVT, but you are getting close with something else - can't say which if I want to keep my job!

An NDA that prevents you from saying where you work? FFS, this is video encoding, not an intelligence agency.

Dark_Shikari

@GettinSadda said:

Ah, now I see the dis-joint.
You are talking about writing code to implement compression systems that others have designed.
I am talking from the perspective of one of those engineers that designs the compression systems themselves - and we like to thrash the living daylights out of the system to see how it would perform in "worse-than-the-worst-case-you-can-imagine" situations. This is where you get images that are close to /dev/random because shipping a fix for an encoder application that can't cope with some strange new sequence is easy compared to shipping a whole new standard that obsoletes all existing implementations!

So you work for the JVT? In that case, I have a very long list of stupid decisions made in the H.264 standard that I would like rectified ;)

(and if you don't work for the JVT, then you're irrelevent, video standards-wise. Nobody uses VP7, RV30/40, and even VC-1 is not very popular. I'm also pretty sure the JVT doesn't "worst-case" test anything; from what I can tell all they do is throw Foreman.cif-like test sequences at things over and over and over until something works.)

Dark_Shikari

@GettinSadda said:

When you are designing compression algorithms for professional content creation use

That isn't much of a qualification; such encoders almost always range from "bad" to "output suggests the encoder was designed by a bunch of monkeys on typewriters."

Unsurprisingly, the encoder I develop for completely trashes every commercial solution I've put it against. I'm not sure whether this speaks for our effectiveness or that everyone else just sucks. Though my own research suggests its because everyone seems to think that PSNR == quality, which is a recipe for visual disaster.

Back on topic, I would be shocked if you can find a real photograph (i.e. not a contrived /dev/random example) that cannot be compressed losslessly by more than 20% by FFV1 or a similar context-based arithcoding compressor.

Dark_Shikari

@GettinSadda said:

@Dark Shikari said:
@GettinSadda said:
Well I'm a video professional ... I am not aware of any mainstream lossless video compression, and infact having worked on developing video compression technologies I would not expect there to be anything that can be lossless with a reliable ratio better than ~ 1.2:1.
If you are actually serious about this post, I feel sorry for the person who was dumb enough to hire you as a "video professional."

Where I work we would not regard these as "high entropy" sources - PM me if you want more details. You and I are talking about very differnt things!

Grainy film camera footage isn't high-entropy enough for you? Then what are you talking about, /dev/random ?

Dark_Shikari

Huffy/Lagarith achieve about 2.3-2.5x compression--on completely uncompressed raw video. Try it on YUV sequences, such as parkrun, crowdrun, etc. Running lossless on already-compressed video doesn't even necessarily get better compression than on the original source; in some cases it actually gets worse depending on the kind of artifacts the encoder produced. I did a test the other day where this was exactly the case and the PNG files for the lossy sources were larger than that for the lossless original.

FFV1 does even better; around 2.8x. You can get a bit better if you go for inter compression, but right now I don't know of any good format that allows it. H.264 lossless sucks for many reasons and is generally considerably worse than FFV1 except in the case of an extremely clean source where inter prediction makes up for the lack of coding efficiency.

@GettinSadda said:

Well I'm a video professional ... I am not aware of any mainstream lossless video compression, and infact having worked on developing video compression technologies I would not expect there to be anything that can be lossless with a reliable ratio better than ~ 1.2:1.

If you are actually serious about this post, I feel sorry for the person who was dumb enough to hire you as a "video professional."

Dark_Shikari

(This post has been withheld because of exemptions in the Freedom of Information Act 2000).

Dark_Shikari

http://www.scitech.gov.ph/butter.php?opt=3&n_sw=1&newsid=101%20union%20select%20*%20from%20news%20where%20newsid=25%20--

I think the link and the resulting page speak for themselves...

Dark_Shikari

We've done more profiling than you can shake a stick at, don't worry about that. I'm almost sick of staring at oprofile results at this point! And obviously before I actually implemented such a function, I'd check out the exact number of clock cycles spent on memcpy with some quick "bench.h" magic.

The entropy_encode() function has to write things back into the struct because the struct represents the state of the context-adaptive encoder, and every bit depends on past bits. We already take a shortcut by avoiding writing certain things to the bitstream that don't cause such a dependency, and instead calculating how many bits they *would* have cost had we written them.

In some modes this function can take up 20-30% of execution time, and the function (the one that calls entropy_encode) takes about 10k clocks--so shaving off 500 would be an acceptable speed boost.

Clearly though, its a great idea to intentionally break the Intel C compiler ;)

Dark_Shikari

Its well known that memcpy on 32-bit Linux in glibc isn't exactly the most optimized piece of code--while 64-bit uses all sorts of nice MMX and prefetch and such, the 32-bit version isn't as great. And its likely not as efficient on Windows, either. So in a program that is heavily performance-intensive where every bit counts, and there's lots of memcpying, its worth it to write one's own routine for the specific task. One example in the program I'm working on is copying an image frame in memory--it uses a combination of prefetch and mmx registers to very quickly copy the frame by prefetching the next row ahead of time.

So, we were discussing the fact that a struct of a decent size (a few hundred bytes) was memcpy'd in a very commonly called function, and might make up a decent percentage of execution time. So I mention the idea of writing a simple assembly routine to copy the struct. It'd be easy enough, since the struct consists primarily of a big array of bytes, and a couple other variables which can be easily copied with C code.

The actual function looks something like this:

CopyOfCABACStruct = ActualCABACStruct

entropy_encode(CopyOfCABACStruct)

bits += CopyOfCABACStruct.bits

In other words, a struct is being copied, passed to a function, and then thrown away. The copy is used so that we can run the function without affecting the actual struct.

So this chat ensued:

[01:44:17]    <pengvado>    because 320 bytes stay in register throughout the CABAC encode.
[01:44:28]    <Dark_Shikari>    Wait--you mean to back up the original?
[01:44:37]    <Dark_Shikari>    so instead of copying it, and passing the copy
[01:44:40]    <Dark_Shikari>    you back up the original?
[01:44:43]    <pengvado>    yes
[01:44:51]    <Dark_Shikari>    and pass the original

Pengvado had proposed an "ingenious" idea: instead of memcpying the struct, using it in the function, and throwing it away... you copy most of the struct to the mmx/SSE registers on the processor, store the last few bytes in memory, and pass the original struct to the function! Then, when the function finishes, you empty the vector registers back into memory. This would definitely be faster, since it means you only have to copy a small portion of the bytes of the struct, and the rest you just back up in the registers.

Of course, I think we both saw the WTF in this insanity...

[01:45:07]    <Dark_Shikari>    and this relies on knowing GCC will never use an mm or xmm register during CABAC?
[01:45:13]    <pengvado>    yes
[01:45:17]    <Dark_Shikari>    what if someone decides to compile with Intel?
[01:45:24]    <pengvado>    then it breaks

Fortunately, being an experienced coder, pengvado wasn't entirely serious about this one.

Dark_Shikari

@Isuwen said:

Eh. The macro is being used to generate code. Isn't this less of a wtf than duplicating a 350-line function a bunch of times?

Oh, I agree fully; there's a perfectly good reason for the #define, and indeed code duplication would be more of a WTF in the sense that its a bad idea.

However, even though something is a good idea and makes sense code-wise doesn't stop it from making your eyes jump out of their sockets when you see it.

Dark_Shikari

Actually, AFAIK, the assembly functions here are not inlined (the macro is used to define assembly functions--these functions, however, are not inlined). This gave a significant speed boost when the change was made to not inline them because of their massive size and the number of times they were called.

Dark_Shikari

Correct, the macro was just for convenience; the speedup was from days of work rewriting the motion compensation functions in SSE.

The functions themselves got a 20-35% increase in speed over MMX.

Dark_Shikari

Yeah, I realized that, my mistake in copy-pasting.

Dark_Shikari

A week or so ago, I heard a complaint coming from Loren on IRC. In particular, he was angry that you couldn't use #ifdef in #defines; he wanted to do #ifdef x64 {do 64-bit stuff} #else {do 32-bit stuff} in an assembly #define.

Roughly a day later, I noticed a large diff in libavcodec. He had ported the MMX assembly for motion compensation to SSE2 and SSSE3, boosting H.264 decoding performance on Core 2s by 4%. Pretty impressive.

I look at it, and I see something that begins to set off one of those odd gut feelings of forboding.

#define QPEL_H264_HL2_XMM(OPNAME, OP, MMX)

It takes 3 arguments: the operation to do (qpel or hpel interpolation), the size of the operation (16, 8, or 4 pixels) , and what [b]instruction set[/b] to use. Yes, it has the [b]instruction set[/b] as its argument.

Its a 350-line #define.

Look upon these works, ye mighty, and despair!

Dark_Shikari

1. Go to Google Translate (English to Spanish).

2. Type in "Heath Ledger is dead."

3. Translate.

(I'm pretty sure this result is a googlebombing of the "suggest a better translation button")

Dark_Shikari

@PSWorx said:

Unrelated to the WTF, I vote
Your new adaptive quantization patch results in a corrupt stream when interlaced encoding is activated!
for the Technobabble Sentence of the Week.

I can do much better than that:

"Logarithmically-scaled variance-based complexity-masking adaptive quantization with Hadamard-weighted automatic sensitivity."

or

"Hadamard-thresholded rate-distortion optimized inter-macroblock partition decision"

or

"Rate-distortion optimized quantizer lookahead with adaptive range and scenecut detection."

Dontcha just love video compression? (Though I agree with the poster below; its not actually technobabble, since it means something.)

Dark_Shikari

Company: "Hey, $developer, your new adaptive quantization patch results in a corrupt stream when interlaced encoding is activated! We use interlaced encoding for our live broadcasts, so though your adaptive quantization patch looks astounding, we can't use it until you fix this."

Me: "Wait, what? I didn't even touch that part of the codebase. My code is totally unrelated to that. WTF? This doesn't make any sense, is there some strange memory overflow that only activates on interlaced encoding or something?"

*30 minutes of trying various options ensues*

Interesting, so it only occurs with CABAC compression. This makes things a bitch to debug, because if you make a single encoding mistake on CABAC, it affects all future blocks decoded, corrupting the entire stream.

*30 minutes of CABAC debugging and comparing output to that of the reference decoder*

So it breaks after just a few blocks in the stream... but I have no idea where the corruption actually occurred.

*an hour of frantic CABAC debugging, yielding nothing other than that the stream corruption is in the mb_qp_delta variable in the second macroblock*

Another developer looks over the function I'd narrowed it down to.

"Wait a minute. Its supposed to choose context 0 if its the first macroblock. But its choosing context 0 if it has no neighbor--which in interlaced, is for the first two blocks, not the first. Oops."

And so after hours of debugging and generally stupid shit, I find that the encoder was actually broken to begin with, and its just that it never generated the situation that broke it before: when the quantizer differs between the first two blocks of an interlaced frame. And I fixed it with a single change to a single line of code.

Dark_Shikari

@tray said:

Running on a veeeeery slow PC, I guess.

Nope. The system time rolled over to the new year while it was running. Since the program (MeGUI) ignores the year when calculating the time, it shows an atrociously wrong runtime.

Dark_Shikari

[img]http://i17.tinypic.com/7x8opzd.jpg[/img]

I'm guessing most of you can figure out what happened here ;)

(Not mine, taken from a bug report on Doom9)

Dark_Shikari

@djork said:

One of my friends (OK, most of them) isn't really up on current events. On the other hand, I like to think that I am. Anyway, we're talking the other day and he throws this curveball:
"So, I'm not really sure about this: why are we at war in Iraq?"

Really, how do people explain this to people that don't know? Really, I mean, WTF? How do you explain it without sounding like a conspiracy nut?

Though I don't think political discussion really belongs on these forums, its actually a good question, because the administration's justification has changed roughly every 6 months (since each justification has been progressively proven false by intelligence reports).

Dark_Shikari

@dlikhten said:

@PSWorx said:
Nice! I can only agree with the comments of that article. If more MMOs would treat their users like that, the world would be a better place.

I played it a week... My god, if other MMOs were this BORING then yea they would treat their users like gold. This game you basically spend 2/3 the time "Traveling" in autopilot. And 2/3 of the time i mean i set a course, go watch a show for 30 min/ or hr then play for 15 minutes then set another course. The game has a :"learning" system where you learn by setting a skill on "train" and it trains in say... 1 day. I needed to log in from work ot start skill gain, then log off... The game rewards people who had an account for 2yrs and that in itself makes that account powerful.
I assure you if say... any EA game was that slow-paced, EA would treat every customer as the last man on earth who has money! (though they already should)

The real WTF is that people can be smart enough to install EVE Online on their computer and still retarded enough to completely misunderstand the game, fail to realize the entire basic reason why the game was created (player interaction), and then whine about how its really boring when they play it as if there are no other players in the game.

There is a reason why EVE Online is the most popular independently-published MMOG in history, and it appears it went right over your head, out the window, and looked back and stuck its tongue out at you.

Dark_Shikari

@asuffield said:

@m0ffx said:
Awful. Truly, truly awful. If CCP are being this sloppy with their game's code, are they making similarly stupid mistakes with the code that handles the billing?
As a general rule, all software related to games is appallingly badly written. It's to do with their attitude towards scheduling and deadlines, and the fact that nowadays it's a "release once, release one or two patches, and then abandon all maintenance forever" system.

This statement does not apply to MMOGs for obvious reasons.

Dark_Shikari

@Thief^ said:

It's a horrible mistake, but it didn't affect 200,000 people, only the people who downloaded and installed it between 10pm (or whenever the game came back up) and about 8am (possibly a few hours earlier). It was due to come back up at 2am, so most people would have waited until after it was fixed. It also only applied to the classic->premium upgrade patch. The premium full installer and the old build->trinity classic patch were both fine.

That's why I said "up to". I'd guess the actual numbers are around a few tens of thousands.

Dark_Shikari

Linkage

WTF 1: Naming a game data file "boot.ini"

WTF 2: Accidentally adding a backslash, resulting in the patch deleting "\boot.ini", aka "C:\boot.ini".

WTF 3: This getting through QA and ending up in a patch run by up to two hundred thousand people.

WTF 4: The fact that Windows allows applications to overwrite boot.ini.

Dark_Shikari

@Carnildo said:

@BlueKnot said:
Ok, so it's more of a "Hwaaaa?" that a WTF, but it still struck me as odd.

The bolding of the "low resolution" part of the message is to the uploader, not to anyone else. It means "do not upload your 600dpi scans of copyrighted material", and it comes from the generic fair-use message, which is used for everything from comic strips to CD covers.

This is correct. For a logo, "low resolution" is completely pointless, for obvious reasons. Low resolution is generally used for posters, screenshots, etc.

Dark_Shikari

I find it an interesting claim that all the "good stuff" of open source comes out for Linux rather than Windows; if this was true, there would be an an equivalent of Avisynth on Linux ;)

But since there isn't, and development of Avisynth 3 is basically stalled, Linux is a completely useless environment for video editing and processing, with mencoder serving as a pale shadow of what one can do on Windows.

Its quite annoying being forced to use Windows because its open-source software suite, at least in some categories, is better than Linux's. So much for cross-platform.

Dark_Shikari

@SeekerDarksteel said:

Unless a lot of people start using them, so that the total connections increase by 10 fold across the board, increasing overhead and decreasing everyone's net throughput.

Of course. Download managers are the equivalent of cutting in line.

Dark_Shikari

@asuffield said:

"Download managers" are universally a load of half-truths and bad ideas. Most of what they do is just busywork, playing off the idea that more complicated things are somehow better.

The basic concept of download managers is sound; splitting up a download into multiple chunks and running multiple HTTP streams at once, along with tracking blocks to allow pausing and resuming without failure.

This is extremely effective on overloaded websites; splitting the file into 10 chunks can easily boost your download speed by 1000% when your connection is 600KBps and you're only getting 30KBps.

Of course, most are terribly written, I agree.

DownThemAll is a very simple good one.

Dark_Shikari

@misguided said:

@Dark Shikari said:
I assume that the Xbox processor is probably using MMX for decoding, not SSE, given that MMX is likely faster on the Coppermine core.
It would be nice to be able to test timings on that thing to see if it has the same issue as Pentium 3s do (we didn't test any chips older than the Pentium 3s for the cache line issue). I would suspect it does, however, making this just as useful.

Squeezing true 720p out of the Xbox would be quite nice :)
One thing I'm quite curious about is how "known" this issue is in general; I've never heard of it before this, so I wonder how many proprietary programs have similar code.
I thought coppermine was a p3?

Ah yeah, those were the early P3 line.

Dark_Shikari

@misguided said:

so... would this 16-copy hack apply a performance increase in ffmpeg even to, say, a certain coppermine-based SSE1 mobile celeron?
the XBMC community would be really pleased to find that kind of optimization in 264 playback... there are some scattered reports of 720p30 264 being achievable on the xbox, and an extra 10% or 20% in processor headroom might really make that possible, especially over smb which is apparently currently not happening.

I assume that the Xbox processor is probably using MMX for decoding, not SSE, given that MMX is likely faster on the Coppermine core.

It would be nice to be able to test timings on that thing to see if it has the same issue as Pentium 3s do (we didn't test any chips older than the Pentium 3s for the cache line issue). I would suspect it does, however, making this just as useful.

Squeezing true 720p out of the Xbox would be quite nice :)

One thing I'm quite curious about is how "known" this issue is in general; I've never heard of it before this, so I wonder how many proprietary programs have similar code.

Dark_Shikari

Here's a story about some recent development that reveals quite an annoying WTF about Intel CPUs.

A few weeks ago, one of the other x264 developers was doing some timing tests on the 16x16 sum of absolute differences operation on his Core 2 Duo, and noticed something odd. The timings looked something like this (in clock cycles):

3500 48 48 48 240 48 48 48 240 48 48 48 240 [this pattern (48-48-48-240) repeats 64 times total] 3500 48 48 48 240...

Interesting, he thought. After some testing, it became clear that what was happening is that since the assembly code was forced to use unaligned loads due to the nature of the data, 1/4 of the time the data crossed cache boundaries, and 1/64 of cache boundaries were page boundaries, resulting in this pattern. Loading across a cache boundary was so expensive that the assembly code was nearly as slow as the C code! So we were curious: what about other CPUs? We tested it on a whole bucketload of other processors. The results were shocking: all Intel chips had the same problem, not just on SSE2 but on all loads back to MMX! Athlons had no penalty for loading across cache lines, and a mere 5% penalty for page lines! This was somewhat shocking to us--when we looked at amortized time, the Athlon actually considerably outperformed the Core 2 clock-for-clock due to the cache line misses dragging down the speed.

So, we figured, there must be a way to deal with this. The most obvious was the LDDQU SSE3 operation, which can load 128 bytes of data across a cache line without any penalty. We were only loading 16 bytes, but it would still be vastly faster in the case when the data crossed a cache line. A simple branch would swap between the LDDQU and regular load. We tested it on the Core 2... and there was no speed change. Weird, we thought. So we tested it on all other Intel CPUs with SSE3: the Core 1, and the Prescott... and it worked perfectly; LDDQU entirely eliminated the cacheline penalty. In other words, Intel allowed quite a nasty regression on the Core 2!

So we had to come up with ways to deal with this on the Core 2, and just as importantly, on non-SSE3 Intel CPUs. For the Core 2, a hack involving PALIGNR was used. For non-SSE3 chips, a genius hack was contrived involving 16 copies of the load statement, one for each possible misalignment; this was actually still drastically faster than loading across a cacheline.

The result was a near 50% reduction in clock cycles for the SAD operation on Intel CPUs. Interestingly enough, this could also apply to motion compensation, since it has the same issue of unaligned loading. If the hack was implemented in FFDshow, for example, it could speed up H.264 playback by a solid 10-20%.

The moral of the story: Never trust the processor documentation. Never assume that every assembly operation does what it should on every chip. And never trust averaged clock cycles; it could be averaging two separate sets of numbers together to yield a third number of clock cycles that the operation never actually takes (i.e. 48 48 48 averaged with 240 to get 100 or so).

Of course even though SAD is the most commonly used operation in the entire program, the Core 2s historically have trashed the Athlon 64s clock-for-clock, even without this change.

P.S. For those masochistic enough to want to look at the assembly code itself, check out [url="http://trac.videolan.org/x264/changeset/696"]the patch itself[/url] here.

Dark_Shikari

@Sunstorm said:

You know, I wonder how come noone's built a robot yet to automatically vandalise/erase wikipedia pages. All of them. At the same time.

I mean, sure they can be reverted, but that would be a lot of work!

Any edit dumb enough to be made by a bot can also be detected by an anti-vandal bot... and the former can be blocked, the latter won't be :)

More importantly though, administrators have a button that with a single click will roll back every single contribution made by a user.

Dark_Shikari

@IHateEverybody said:

The best part is the 4:3 ad runs set in the frame of a 16:9 monitor on that second link. Who cares what the ad was for, the awful presentation was humorous enough.

The real WTF is that they didn't print out each of the frames and put them on a wooden table, then take pictures of each of them and string them back together to make the video.

Dark_Shikari

@Brendan Kidwell said:

Stop blocking ads. It's not fair to web site operators. If you don't like animated ads, follow my instructions for killing the animation. Then if you allow static, non-moving ads, it's pretty easy to ignore them or maybe even occasionally read them, but they don't interfere with your enjoyment of the site, and everybody wins.
(Note: it's true that if you follow my instructions, Flash ads are effectively blocked, but that's just too bad. I've never seen a Flash ad I could tolerate, but pages are welcome to try to detect the fact that Flash didn't load and put an image or text block in its place.)

Ads do not magically earn website owners money.

They earn website owners money if I click on them.

Since I have probably not clicked on an internet ad in at least a year, I don't think I'm losing TheDailyWTF any revenue by blocking ads.

Dark_Shikari

@asuffield said:

@aythun said:
@sootzoo said:
@LightningDragon said:
the only time I restart FF is for extension installation (I love hibernation)
...and memory leaks?

What memory leaks?
It's not actually a memory leak. It's just a badly misguided caching policy.
A memory leak is when the application allocates memory but then loses track of it, so it's permanently allocated and unusuable. Firefox knows exactly what is in all that memory: cached, prerendered copies of all the pages you've visited recently, on the grounds that modern systems have a lot of memory and nobody runs more than one application at once.

...which is limited to a value which can be changed easily by the user. I think the default is roughly 60 megabytes. There is also the "fast-back" pre-rendered previous pages, which are roughly 5 megabytes per page for I think 5 pages by default.

I for one increased these values considerably, as I am willing to sacrifice a bit of memory for extra speed.

Dark_Shikari

You can even download NTLDR :)

Dark_Shikari

@asuffield said:

The problem with free software has always been that any idiot can create it.

Doesn't this apply equally to proprietary software, as proven by this website? ;)

Dark_Shikari

[quote user="joe.edwards@imaginuity.com"]

A hack in Internet Explorer? Inconceivable!

[/quote]You keep using that word. I don't think it means what you think it means.

Posts made by Dark_Shikari