Read HighPart, read LowPart, read HighPart2?

aib

Looking through the sources of a sample [NT] kernel driver I wrote, I came across this:

    getEnterTime:
        mov edx, [0xFFDF000C]
        mov enterTime.HighPart, edx

        mov eax, [0xFFDF0008]
        mov enterTime.LowPart, eax

        cmp edx, dword ptr [0xFFDF0010]
        jne getEnterTime

I remember copying this method off some library/kernel function; but I can't remember where from, what 0xFFDF0010 is supposed to contain (or why I did not write any comments).

Has anyone seen something like this? Where could this be from? (I checked the kernel's NtQueryPerformanceCounter function, not there)

I think [0xFFDF0010] =?= [0xFFDF000C] is used for synchronization; some function writes the High:Low parts of the timer into *10 and *18, then writes them again to *08 and *OC. Thus, if the Low part overflows as you read the value, you can go back and re-read.

Wonder where I came up with the magic numbers, though; google comes up with nothing.

bstorer

0xFFDF0000 is the location of the KUSER_SHARED_DATA structure for x86 systems. 0xFFDF0008 is then the location of the interrupt time structure, which is a constantly increasing timer tracking 100ns intervals since boot. 0xFFDF0000 is defined by KI_USER_SHARED_DATA. I think this stuff is in ntddk.h from the Driver Development Kit. I don't know if this stuff extends to Vista, but I expect it's similar, if not the same. Hope that helps.

aib

A-ha!

From: http://www.dcl.hpi.uni-potsdam.de/research/WRK/?p=34

"As you can see, the time is stored as a 64-bit value: High1Time:LowPart. But why is there another HighPart (High2Time) field in the structure (line 51)? To be short, it is for synchronization purposes. The three time values are updated directly by the clock interrupt service routine (ISR). An ISR must not acquire any lock because it must complete as fast as possible and thus must not block. However, to ensure applications (Win32) to read a consistent time value, the order in which the members of KSYSTEM_TIME are updated has a very strict manner. The ISR first updates High2Time, then LowPart, and finally High1Time. A consuming application must read the structure the same strict but inverse order, that is, it first reads High1Time, then LowPart, and finally High2Time. If High1Time and High2Time are equal, the High1Time:LowPart is a consistent value. Otherwise the application was interrupted by the clock interrupt and needs to read the structure again."

So I got that part right. And the name of the struct was all I needed, thanks. Below is for reference:

Looking through the 2600.1106 (XP) DDK, I found that struct _KUSER_SHARED_DATA is defined in inc/ddk/wxp/ntddk.h line #9451

offset +8 is KSYSTEM_TIME InterruptTime

Apparently, KSYSTEM_TIME is defined with two high parts (inc/ddk/wdm/wxp/wdm.h line #959), so all the kernel timers use this method.

Thank you

Now, on to replacing the magic numbers...

asuffield

@aib said:

"As you can see, the time is stored as a 64-bit value: High1Time:LowPart. But why is there another HighPart (High2Time)
field in the structure (line 51)? To be short, it is for
synchronization purposes. The three time values are updated directly by
the clock interrupt service routine (ISR). An ISR must not acquire any
lock because it must complete as fast as possible and thus must not
block. However, to ensure applications (Win32) to read a consistent
time value, the order in which the members of KSYSTEM_TIME are updated has a very strict manner. The ISR first updates High2Time, then LowPart, and finally High1Time. A consuming application must read the structure the same strict but inverse order, that is, it first reads High1Time, then LowPart, and finally High2Time. If High1Time and High2Time are equal, the High1Time:LowPart is a consistent value. Otherwise the application was interrupted by the clock interrupt and needs to read the structure again."

Sheesh. Ingenious, hopelessly overcomplicated, and the wrong solution to the wrong problem. Who came up with that?

The problem with this approach is that it only works if you set the relevant page to be non-cacheable (since modern CPUs will re-order reads to cacheable memory, and that breaks the whole thing), and that does nasty things to performance. Overall it's much slower than using a lock.

bstorer

@asuffield said:

Sheesh. Ingenious, hopelessly overcomplicated, and the wrong solution to the wrong problem. Who came up with that?

I'll give you a hint: M _ _ _ _ S _ _ _

Actually, they probably stole it from someone else.

aib

@asuffield said:

@aib said:
...[write HighPartCopy][write LowPart][write HighPart] "solution" to a synchronization issue...
Sheesh. Ingenious, hopelessly overcomplicated, and the wrong solution to the wrong problem. Who came up with that?
The problem with this approach is that it only works if you set the relevant page to be non-cacheable (since modern CPUs will re-order reads to cacheable memory, and that breaks the whole thing), and that does nasty things to performance. Overall it's much slower than using a lock.

I'm sorry, could you elaborate a bit on that?

Seems like a rather nice hack to me -- the LowPart gets updated quite frequently (probably once per scheduler tick) and the HighPart not so much. What would be the point of caching the page?

Wouldn't using locks overcomplicate this? (Unless, you mean a hard lock as in "LOCK CMPXCHG8B" -- which I don't see the point of?)

asuffield

@aib said:

@asuffield said:
Sheesh. Ingenious, hopelessly overcomplicated, and the wrong solution to the wrong problem. Who came up with that?
The problem with this approach is that it only works if you set the relevant page to be non-cacheable (since modern CPUs will re-order reads to cacheable memory, and that breaks the whole thing), and that does nasty things to performance. Overall it's much slower than using a lock.
I'm sorry, could you elaborate a bit on that?
Seems like a rather nice hack to me -- the LowPart gets updated quite frequently (probably once per scheduler tick) and the HighPart not so much. What would be the point of caching the page?

CPU cache is hundreds of times faster than SDRAM, and the only thing that can keep up with the rate at which the CPU executes instructions. Having to fetch the cache line from memory every time you want to read it, and worse, to flush the line to memory once per scheduler tick? That's going to hurt, badly.

Wouldn't using locks overcomplicate this?

Here's an example of 'read' code for a locking variant (linux 2.6.22.6):

         do {
                seq = read_seqbegin(&xtime_lock);
                offset = time_interpolator_get_offset();
                sec = xtime.tv_sec;
                nsec = xtime.tv_nsec;
        } while (unlikely(read_seqretry(&xtime_lock, seq)));

The 'write' code is even simpler. Hardly complicated.

aib

@asuffield said:

CPU cache is hundreds of times faster than SDRAM, and the only thing that can keep up with the rate at which the CPU executes instructions. Having to fetch the cache line from memory every time you want to read it, and worse, to flush the line to memory once per scheduler tick? That's going to hurt, badly.

Yes, of course, but what is Linux doing that is bypassing the SDRAM (and, if I understood you correctly, using cacheable pages?)

The 'read' code doesn't seem that much different, especially considering that the 'seq' checks check if a write was done under the reader's nose, i.e. same thing as the HighPartCopy, only with a lock and a sequence value. (And with proper memory barriers)

I can't find the 'writer' code, but I can't really imagine it doing anything significantly different than writing two words to the memory somewhere. And as long as you have two volatile words, what's it about adding a third one that breaks the mechanism so suddenly?

@asuffield said:

Wouldn't using locks overcomplicate this?
Here's an example of 'read' code for a locking variant (linux 2.6.22.6):

         do {
                seq = read_seqbegin(&xtime_lock);
                offset = time_interpolator_get_offset();
                sec = xtime.tv_sec;
                nsec = xtime.tv_nsec;
        } while (unlikely(read_seqretry(&xtime_lock, seq)));

The 'write' code is even simpler. Hardly complicated.

It boils down to the same thing - read seq1, fence, do stuff, read seq2, loop if not equal

And I wouldn't argue about using proper locks versus hacking a system structure to add a weird synchronization primitive.

asuffield

@aib said:

@asuffield said:
CPU cache is hundreds of times faster than SDRAM, and the only thing that can keep up with the rate at which the CPU executes instructions. Having to fetch the cache line from memory every time you want to read it, and worse, to flush the line to memory once per scheduler tick? That's going to hurt, badly.
Yes, of course, but what is Linux doing that is bypassing the SDRAM (and, if I understood you correctly, using cacheable pages?)

Nothing special...

I can't find the 'writer' code, but I can't really imagine it doing anything significantly different than writing two words to the memory somewhere. And as long as you have two volatile words, what's it about adding a third one that breaks the mechanism so suddenly?

The windows hack relies on the CPU executing memory reads in a specific order. Modern Intel and AMD CPUs will always re-order reads to cacheable pages, so in order for this hack to work at all you have to either mark the page as uncacheable or flush it explicitly (which amounts to the same thing, for performance purposes).

You can mark pages as "cache but enforce write-ordering" on both breeds of chip, but that is not sufficient for this algorithm to work. There is no "cache but enforce read-ordering" mode, since that would be incompatible with the SMP mechanisms they all use.

Hatshepsut

@asuffield said:

The windows hack relies on the CPU executing memory reads in a specific order. Modern Intel and AMD CPUs will always re-order reads to cacheable pages, so in order for this hack to work at all you have to either mark the page as uncacheable or flush it explicitly (which amounts to the same thing, for performance purposes).
You can mark pages as "cache but enforce write-ordering" on both breeds of chip, but that is not sufficient for this algorithm to work. There is no "cache but enforce read-ordering" mode, since that would be incompatible with the SMP mechanisms they all use.

Perhaps someone should tell the author of the third paragraph of section 6.3 of [url=http://download.microsoft.com/download/e/b/a/eba1050f-a31d-436b-9281-92cdfeae4b45/MP_issues.doc]this[/url] before it's too late...

But why should the original driver code be reading the kernel system time structure directly anyway - surely there's a kernel routine to do that for you (and hide whatever synchronisation issues/hacks that may exist)?

asuffield

@Hatshepsut said:

@asuffield said:
The windows hack relies on the CPU executing memory reads in a specific order. Modern Intel and AMD CPUs will always re-order reads to cacheable pages, so in order for this hack to work at all you have to either mark the page as uncacheable or flush it explicitly (which amounts to the same thing, for performance purposes).
You can mark pages as "cache but enforce write-ordering" on both breeds of chip, but that is not sufficient for this algorithm to work. There is no "cache but enforce read-ordering" mode, since that would be incompatible with the SMP mechanisms they all use.
Perhaps someone should tell the author of the third paragraph of section 6.3 of [url=http://download.microsoft.com/download/e/b/a/eba1050f-a31d-436b-9281-92cdfeae4b45/MP_issues.doc]this[/url] before it's too late...

Yes. Oh dear. That's quite horribly wrong. That paragraph says:

<style type="text/css">

</style>
On x86-based, x64-based and Itanium-based hardware, reordering might take place when a write operation for one location precedes a read operation for a different location. Processor reordering might move the read operation ahead of the write operation on the same CPU, thus effectively reversing their order in code. These architectures do not reorder read operations followed by read operations or write operations followed by write operations.

While the AMD64 programmer's manual, Vol 1, Section 3.9.1 says:

Out-of-order reads are allowed. Out-of-order reads can occur as
a result of out-of-order instruction execution. The processor
can read memory out-of-order to prevent stalling
instructions that are executed out-of-order.

(The other manuals say similar things, but I don't have them on hand)

And furthermore, all of them use SMP cache coherency protocols that do not enforce read ordering at all, except on the strict-ordering instructions (which are the basis for locks).

But why should the original driver code be reading the kernel system time structure directly anyway - surely there's a kernel routine to do that for you (and hide whatever synchronisation issues/hacks that may exist)?

That is an excellent question. The whole thing is insane.

aib

@asuffield said:

But why should the original driver code be reading the kernel system time structure directly anyway - surely there's a kernel routine to do that for you (and hide whatever synchronisation issues/hacks that may exist)?
That is an excellent question. The whole thing is insane.

What, my code? It was a SYSENTER hook so (I thought) I couldn't afford the function call. The same reasoning may apply to other code as well. (In Linux you can duplicate the original routine as you have access to the source code and locks are macros. Here I did the same thing with assembly code, though I don't have any read fences (not sure if the original code did.))

@asuffield said:

On x86-based, x64-based and Itanium-based hardware, reordering
might take place when a write operation for one location precedes a
read operation for a different location. Processor reordering might
move the read operation ahead of the write operation on the same CPU,
thus effectively reversing their order in code. These architectures
do not reorder read operations followed by read operations or write
operations followed by write operations.

While the AMD64 programmer's manual, Vol 1, Section 3.9.1 says:
Out-of-order reads are allowed. Out-of-order reads can occur as
a result of out-of-order instruction execution. The processor
can read memory out-of-order to prevent stalling
instructions that are executed out-of-order.

IA-32 Intel Architecture Software Developer's Manual, Vol 3., §7.2.2 (Memory ordering in P4, Xeon, P6) :

1. Reads can be carried out speculatively and in any order.