Real-time video encoding on limited hardware

Kermos

Anyone ever deal with having to encode video on limited hardware (520MHz processor, Intel X-Scale, ARM core with SSE/MMX support)?

Video data is as follows:

30 Frames per second

320x200 in YCC 4:2:2 format from camera, can drop down to YCC 4:2:0 (though would have to do that in software)

I'm trying to find a way to somehow squeeze this into something a 512kbps out connection can handle. Yes, I know, I (or rather, my boss) wants the impossible.

x264 has been a consideration, but currently I can't find anything that either doesn't require XP/Vista with a GPU to do the encoding in hardware, or in case of software solutions, that isn't GPL tainted.

All hardware x264 encoders we've found so far are way too expensive as they are aimed for decoding high-end content and far more powerful than anything we need.

Here's what I've done so far:

Raw frame in YCC 4:2:2 is ~4.5megs/second. Since the hardware on the X-Scale processor end can only has a 18-bit TFT, using 565 RGB format on the PC end as well. This allows me to drop the lower 2 bits in the YCC data without any significant visual loss that isn't already incurred by the final RGB565 data.

I then took this data, run length encoded it and compressed it using huffman coding. This got me down to roughly 1.8megs/second. However, this is where i'm currently stuck.

Even with piss-poor quality jpeg compressio, I still hit 1meg/second. So I'm pretty happy with my 1.8megs/second with far superior quality and a over 50% compression ratio. But, that is still beyond what a 512kbps connection could ever handle...and sadly, we have to assume the PC end of the connection to not have more outgoing bandwidth than that.

If anyone has any suggestions, I'd appreciate it. Though, I pretty much figure that there just isn't that much more I can do.

If there is somehing that will work which is under GPL, just build a seperate app which uses that library and which is also GPL. Then in your system which you don't want to be GPLed you can interface with that app (just don't make it a part of the system) and you are home free.

Kermos

@tster said:

If there is somehing that will work which is under GPL, just build a seperate app which uses that library and which is also GPL. Then in your system which you don't want to be GPLed you can interface with that app (just don't make it a part of the system) and you are home free.

Given that both apps would be running on the same embedded linux device, I'm a bit worried that wouldn't fly as it'd probably be considered part of the system because of that.

However, i've made progress. After some ideas I had I'm now down, just with huffman coding (no more need for the RLE) to ~450-500kb/s with near-lossless quality.

It is still too much bandwidth, but it is a lot closer than anything I had before.

The problem I think I'm running into now is, I'm using 8-bit huffman coding. So that means the best achievable compression ratio is 8:1, or in my case, ~420kb/s. So I'm near perfect-compression on each frame almost. I think I need to see if I can give the huffman coding larger chunks of data to work with in a meaningful way to improve that ratio a bit.

arty

Just out of curiosity, I tried jpeg-ing some frames after doing a simple frame difference with a base "I" frame. The scheme is, we take a frame and encode it normally, then the next 7 frames have each component of each pixel as 128 + (me / 2) - (base / 2). This saves about 10% in the jpeg in my test at Q=30. Not much, but not bad IMO for just subtraction. Maybe this kind of idea will help?

This is the little ppm image processor

import sys

class img:
    def __init__(self,imgfile):
        hdr = imgfile.readline()
        res = imgfile.readline().split(' ')
        mxv = imgfile.readline()
        self.x = int(res[0])
        self.y = int(res[1])
        self.dat = imgfile.read()
        imgfile.close()

    def subtract(self,other):
        newdat = ""
        for i in xrange(0,len(self.dat)):
            newdat = newdat + chr(128 + (ord(self.dat[i]) / 2) - (ord(other.dat[i]) / 2))
        self.dat = newdat

    def save(self,name):
        imgfile = open(name,'w')
        imgfile.write('P6\n')
        imgfile.write('%d %d\n' % (self.x, self.y))
        imgfile.write('255\n')
        imgfile.write(self.dat)
        imgfile.close()

if __name__ == '__main__':
    imgct = 0
    fullimg = None
    for arg in sys.argv[1:]:
        if imgct % 8 == 0:
            # Full picture
            fullimg = img(open(arg))
        else:
            # Diff picture
            diffimg = img(open(arg))
            diffimg.subtract(fullimg)
            diffimg.save(arg)
        imgct += 1

Here's the result

arty@slave ~/data $ for file in 00000*.png ; do pngtopnm $file >`echo $file | sed -e 's/png/ppm/g'` ; done
arty@slave ~/data $ for file in 00000*.ppm ; do cjpeg -quality 30 $file >`echo $file | sed -e 's/ppm/jpg/g'` ; done
arty@slave ~/data $ ls -l 0000*.jpg | awk '{x=x+$5;} END {print x;}'

330727

arty@slave ~/data $ python imgdiff.py *.ppm

arty@slave ~/data $ for file in 00000*.ppm ; do cjpeg -quality 30 $file >`echo $file | sed -e 's/ppm/jpg/g'` ; done
arty@slave ~/data $ ls -l 0000*.jpg | awk '{x=x+$5;} END {print x;}'

298635

arty@slave ~/data $ python -c "print (330727 - 298635) / 330727.0"
0.0970347144321

Nelle

the company where i used to work, used mjpeg a lot on embedded devices ...

Kermos

@arty said:

Just out of curiosity, I tried jpeg-ing some frames after doing a simple frame difference with a base "I" frame. The scheme is, we take a frame and encode it normally, then the next 7 frames have each component of each pixel as 128 + (me / 2) - (base / 2). This saves about 10% in the jpeg in my test at Q=30. Not much, but not bad IMO for just subtraction. Maybe this kind of idea will help?

Thing is Q=30 = utterly horrid. I tried jpeg at Q=30 and that still chewed around 1 meg/second bandwidth while looking utterly horrible in quality. Even a 10% saving with your method I would still be lightyears off the goal.

What I ended up implementing was, instead of encoding the frame, I encode the *differences* between the frames. Which, on the first frame, is the actual frame (since the initial reference buffer is cleared to 0).

On all subsequent frames, I only transmit the delta between the reference and the new frame. That gives me a very tight set of values, usually in the 10-20 different value range, that I can compress very nicely with huffman coding. After some additional optimizations, such as identifying 2x2 pixel blocks that did not change between frames (easy to do now, no mem compares needed anymore, just count 0s) this dropped me around 500kb/second range. However, I am still 8-10 times larger what a 512kbit outbound connection could handle.

About the only remaining idea I've got is to feed the huffman coding with 2x2 pixel blocks instead of byte values and encode those instead. Seems that out of 19200 blocks per frame, I generally only have rougly ~3000 unique changes. I'll have to try how that would compress. However, there still is the issue that the value/frequency pairs that I have to transmit along for the receive to be able to reconstruct the tree now take an excessive amount of space. Need 8 bytes each (6 bytes per 2x2 pixel block, 2 bytes sufficient for frequency), so that is ~703kb overhead per second. That's more than my entire compressed data right now on a byte level!

The only way around that I see is to, generate the dictionary, transmit it to the client, and then use this same dictionary for all future transmissions as well. For blocks that have no match, just compress on a byte level instead. Update dictionaries if I start getting too many misses.

In the end though, no idea if that'll be efficient enough. Will need to try and test.

I don't know if this would be possible but perhaps if you have a high density of unchanged pixels (or even 2x2 pixel squares there will be a lot of unchanged pixels for 3 or more frames.

So if you had a single pixel for 10 frames with values {FAFA, FABA, FABA, FABA, FABA, FABA, FABA, D000, D000, D000}

right now you send (FAFA, -40, 0, 0, 0, 0, 0, -2ABA, 0, 0} (compressed with other 0s from other pixels)

modify it so it sends

{FAFA, -40, 0x5, -2ABA, 0, 0}

Of course this will probably fuck up your compression strategy so it's probably best to standardize on a number of 0s (like 4) which would be

{FAFA, -40, 0x4, 0, -2ABA, 0, 0}

then you might be able to still compress across pixels.

Of course maybe none of this would work....

Kermos

I'm already doing exactly that and it is what had stabilized me into the 500ish kb/second range.

Since I'm only using 16 bit RGB on the target, I can drop off the lower 2 bits on the YCC data without noticable loss in quality. That allows me to use the lower 2 bits for other information. So in my case, when I see 6 0's (completely unchanged 2x2 block), I send a 0x01 in its place.

Thanks :)

@Kermos said:

I'm already doing exactly that and it is what had stabilized me into the 500ish kb/second range.
Since I'm only using 16 bit RGB on the target, I can drop off the lower 2 bits on the YCC data without noticable loss in quality. That allows me to use the lower 2 bits for other information. So in my case, when I see 6 0's (completely unchanged 2x2 block), I send a 0x01 in its place.
Thanks :)

Hmm, I'd have to actually look at example data to try and figure out specific ways to compress it more. One idea would be if there are lots of low numbers (pixels changing color only slightly) then change those numbers to 0 up to a certain tolerance and then send the change after it.

For instance:

if a pixel has {1000, 1002, 1005, 1007, 1007, 1006, 1009, 100B, 100D, 1040}

and your tolerance is 10 send:

{1000, 0x6, B, 0, 35}

Kermos

@tster said:

@Kermos said:
I'm already doing exactly that and it is what had stabilized me into the 500ish kb/second range.
Since I'm only using 16 bit RGB on the target, I can drop off the lower 2 bits on the YCC data without noticable loss in quality. That allows me to use the lower 2 bits for other information. So in my case, when I see 6 0's (completely unchanged 2x2 block), I send a 0x01 in its place.
Thanks :)

Hmm, I'd have to actually look at example data to try and figure out specific ways to compress it more. One idea would be if there are lots of low numbers (pixels changing color only slightly) then change those numbers to 0 up to a certain tolerance and then send the change after it.

For instance:
if a pixel has {1000, 1002, 1005, 1007, 1007, 1006, 1009, 100B, 100D, 1040}
and your tolerance is 10 send:
{1000, 0x6, B, 0, 35}

I've considered this, but so far not tried it due to the fact that it increases my complexity on the encoder.

Right now, after computing the deltas, I can simply take incoming camera frame and assign it to the reference frame to use the next time around. If I start using tolerances as you're describing, I can no longer do that as the reference frame will no longer match the receiver. I now, have to explicitly perform the same operation on every byte that the client does in order for my reference frame to match.

Also, there is another issue.

Huffman coding that encodes only 1 byte at a time, can at best not compress at a ratio better than 8:1 as at best, given the shortest key of 1 bit, you can't but more than 8 items in a byte. If the value for the 1 bit key is 1 byte in size, then that means you can't stuff more than 8 bytes into 1 byte.

Applying that to my data means:

1 frame in YCC420 is 115200 bytes, that means the best 100% most optimal compression (assuming all bytes had the same value) would be 14400 bytes the way things are right now. That times 30 frames a second is roughly ~421kb/second. I'm currently at around 500 so I'm really close to that.

I've tried applying run-length encoding prior to the huffman coding but it doesn't help much and the above reason explains why. So for that reason, I'm really focusing right now on expanding my huffman encoder to look at more than 1 byte at a time. It's almost working, but I seem to have some bug in there somewhere that's causing occasional data corruption I gotta find. :)

Weng

Have you looked into MPEG4/xvid encoding? And your worries about GPL pollution are a bit unfounded - as long as you build it seperately and interface it as an external module, you'll be fine - no worse than basing the entire platform itself on Linux.