How the heck does Google Reader do it?



  • I'm guessing the answer is "millions of dollars worth of servers", but bear with me here.

    So Google Reader is closing down because Google is evil and horrible and wants to discourage me from ever using their products ever again. Fine. I've made my peace with this. Those idiots.

    I looked into creating my own little personal Google Reader clone as a result, figuring I could host it on Amazon AWS and maybe if it turned otu any good in 6-10 months I could actually productize it and put some ads on it or sell access or something. I quickly coded up a quick and dirty app that could basically make web API calls to load RSS feeds and shove them back at the client. I came up with a database scheme to download and store the RSS items for quick access (and historical searching). So far so good, and it works... kinda.

    Issue 1: I'm having: my Google Reader list is about 150 feeds. When I make my API call, the poor server has to ping 150 other servers to get the most recent feed items. Now there's a few ways I could optimize this; for example if the database says the last feed refresh was less than 5 minutes ago, don't bother to do a new fetch and just serve up what's in the DB already. But doing this with 150 feeds is simply not going to be sustainable in the long run.

    Any ideas on how Google's doing that work and returning the goddamned data so goddamned quickly? I was thinking you should "shard" databases so that, for example, users created before 5/1/2013 get database A and users after 5/1/2013 are on database B, which if managed correctly could keep the load manageable. But the disadvantage is they can't share the same data warehouse of downloaded feed items, so you have a ton of duplicate data, and of course the hosting would quickly become expensive.

    Issue 2: Keeping track of read and unread items. How do you do this so the storage of your "unread list" isn't insane? My solution is that instead of storing a database row for each item you've read (which is obviously unsustainable), what you could do is just keep track of the *range* of read items in each feed-- for example for CNN.com's feed, you might say "this user has read every item from 1/1/2013 to 5/1/2013". You'd allow multiples so you could say, "this user has read every item from 1/1/2013 to 3/1/2013, and every item from 4/1/2013 to 5/1/2013". This data could be passed directly to the front-end which could mark read and unread using it.

    But how to handle items the user specifically marks are unread (or "stars" for that matter)? My idea is that you could have a separate unread list that would contain exceptions to the main list. So your dialog with the DB would be like "this user has read every item from 1/1/2013 to 5/1/2013 except items 2031, 2034 and 2055". (Except the numbers would be the GUIDs assigned to the feed items.

    Is this the best way of doing it? Can you think of a better way?

    Issue 3: Obviously feeds can be updated even when nobody's logged-in to the site or reading the site. The problem is feeds only return the last X items, so if you don't check regularly you could miss items, which is no good. Due to this I know I need some kind of server-side process to periodically check feeds to ensure my own database has the full list of items. What would be a good algorithm to determine how often to check feeds? I was thinking just finding the average time between feet posts, then the number of posts returned with each fetch, then basically just multiplying that out so you're checking "twice as often" as needed to get every post. On average.

    Does that sound reasonable? It runs the risk of missing items, but, eh, so does the entire RSS protocol anyway so I don't see that as such a huge deal.



  • @blakeyrat said:

    I'm guessing the answer is "millions of dollars worth of servers", but bear with me here.

    This. And they spent a lot of engineering hours optimizing the crap out of it so it could scale.

    @blakeyrat said:

    Issue 1: I'm having: my Google Reader list is about 150 feeds. When I make my API call, the poor server has to ping 150 other servers to get the most recent feed items. Now there's a few ways I could optimize this; for example if the database says the last feed refresh was less than 5 minutes ago, don't bother to do a new fetch and just serve up what's in the DB already. But doing this with 150 feeds is simply not going to be sustainable in the long run.

    Your best bet is to try to avoid pulling data "live" whenever possible. Try to pull from your local cache and keep the cache always up-to-date with background processes.

    @blakeyrat said:

    Any ideas on how Google's doing that work and returning the goddamned data so goddamned quickly? I was thinking you should "shard" databases so that, for example, users created before 5/1/2013 get database A and users after 5/1/2013 are on database B, which if managed correctly could keep the load manageable. But the disadvantage is they can't share the same data warehouse of downloaded feed items, so you have a ton of duplicate data, and of course the hosting would quickly become expensive.

    Shard by feed and user. User-specific data (which feeds they're subscribed to, read/unread, etc.) can be sharded by date the user was created. The feeds themselves can also be sharded the same way, with the actual historical feed data being split up based on when the feed was added to the system (and tracked in a "master" database of all feeds.)

    Also keep in mind: Google's probably not only using relational databases for this. They're probably using NoSQL-type datastores, so it might be worth looking into something like Redis. NoSQL makes it easier to work with a large, distributed datasets.

    @blakeyrat said:

    Issue 2: Keeping track of read and unread items. How do you do this so the storage of your "unread list" isn't insane? My solution is that instead of storing a database row for each item you've read (which is obviously unsustainable), what you could do is just keep track of the range of read items in each feed-- for example for CNN.com's feed, you might say "this user has read every item from 1/1/2013 to 5/1/2013". You'd allow multiples so you could say, "this user has read every item from 1/1/2013 to 3/1/2013, and every item from 4/1/2013 to 5/1/2013". This data could be passed directly to the front-end which could mark read and unread using it.

    But how to handle items the user specifically marks are unread (or "stars" for that matter)? My idea is that you could have a separate unread list that would contain exceptions to the main list. So your dialog with the DB would be like "this user has read every item from 1/1/2013 to 5/1/2013 except items 2031, 2034 and 2055". (Except the numbers would be the GUIDs assigned to the feed items.

    Ranges can work, but also can quickly become cumbersome. Another possibility would be to simply have a list of every read or starred item and split it up by month. This is a situation where NoSQL would be ideal: have an object with the key "USERID::2012-05" which would be the list of read or starred items for May 2012 for that user ID. Most NoSQL databases will hash the key and use that to determine which node it lives on (basically a distributed hash table), which automatically shards the data evenly for you. Since you're only storing a month-per-key, it's a much smaller dataset to work with. This also means you're not having to mess with querying a huge number of read or starred items when you're just showing the most recent feeds. And since your data is almost always browsed linearly, by date, it's pretty easy to move backwards through the months, querying keys as needed.

    One wrinkle is if you have "jump to page links", since trying to calculate where Page 5 or Page 127 are will be kind of a pain, but it's also going to be a bit of a pain with SQL, and a relational database is going to have trouble scaling to handle billions of tiny rows easily.

    @blakeyrat said:

    Issue 3: Obviously feeds can be updated even when nobody's logged-in to the site or reading the site. The problem is feeds only return the last X items, so if you don't check regularly you could miss items, which is no good. Due to this I know I need some kind of server-side process to periodically check feeds to ensure my own database has the full list of items. What would be a good algorithm to determine how often to check feeds? I was thinking just finding the average time between feet posts, then the number of posts returned with each fetch, then basically just multiplying that out so you're checking "twice as often" as needed to get every post. On average.

    Your server-side job to keep feeds up-to-date is going to be critical to this working well, I think. I would try to only pull from the cache and never do a "live" lookup, especially since someone might have dozens of feeds and that's going to be very slow. Since you don't want feeds to be stale, checking frequently is important. You want to time it so you can get every new feed post soon after it is published, while doing as few lookups as possible. I would probably do something like taking the average time between posts you've calculated, and checking that often. However, be careful to have a lower bound on it, since a lot of people will get pissy if you're checking more frequently than every 15 mins.

    However, if you do get people using your site, then you're actually a big help to feed publishers since you can check their feed once and serve it to lots of people, instead of having those people each running a feed reader that's polling every 15 minutes.



  • @morbiuswilters said:

    Your best bet is to try to avoid pulling data "live" whenever possible. Try to pull from your local cache and keep the cache always up-to-date with background processes.

    Roger, so you're thinking when I log in with my 150 feeds, the first query should only pull the cached shit, and refreshing those feeds should go into a queue for some other process to do, then the front-end should re-query in 15 seconds or something to get the new new items.

    That has the disadvantage that the items won't show in order if you're viewing multiple feeds in the same ... aggregated feed? I guess would be the term? But that's fairly minor, as long as the read/unread markers are ok.

    @morbiuswilters said:

    Also keep in mind: Google's probably not only using relational databases for this. They're probably using NoSQL-type datastores, so it might be worth looking into something like Redis. NoSQL makes it easier to work with a large, distributed datasets.

    I get that of course, but my budget for this is basically $0. So I want something I can get in place quickly with very little outlay of moneys... unfortunately for AWS, that means Linux VMs with MySQL databases. (They have MS SQL available of course, but it's many times more expensive.)

    Amazon RedShift would be *perfect* for this application, but we're talking $7500/year. Unless you pay for 3 years in advance, that's the only way it becomes reasonable ("reasonable" meaning $2000/year.) That leaves me with either MySQL, or something like MongoDB I could setup and maintain myself. Which frankly isn't gonna happen.

    @morbiuswilters said:

    anges can work, but also can quickly become cumbersome. Another possibility would be to simply have a list of every read or starred item and split it up by month. This is a situation where NoSQL would be ideal: have an object with the key "USERID::2012-05" which would be the list of read or starred items for May 2012 for that user ID.

    That would be pretty storage-heavy? The "normal" situation is every feed is 100% read at all times, except since the last time the user logged-in, so I think I'm still leaning towards the date ranges.

    @morbiuswilters said:

    One wrinkle is if you have "jump to page links", since trying to calculate where Page 5 or Page 127 are will be kind of a pain, but it's also going to be a bit of a pain with SQL, and a relational database is going to have trouble scaling to handle billions of tiny rows easily.

    Google Reader don't do it so I don't neither.

    @morbiuswilters said:

    However, if you do get people using your site, then you're actually a big help to feed publishers since you can check their feed once and serve it to lots of people, instead of having those people each running a feed reader that's polling every 15 minutes.

    Yeah, that's the goal. Well, and enabling people to search feeds back as far as I have data for them.



  • @blakeyrat said:

    Roger, so you're thinking when I log in with my 150 feeds, the first query should only pull the cached shit, and refreshing those feeds should go into a queue for some other process to do, then the front-end should re-query in 15 seconds or something to get the new new items.

    I'd probably just keep some threads always running in the background to constantly keep feeds up-to-date. But your way works, too. Your way is better if you don't expect feeds to be read as often (so you're not doing tens of thousands of requests for a feed that's glanced at once-a-week). Mine makes it a bit more up-to-date if your feeds are read regularly.

    @blakeyrat said:

    I get that of course, but my budget for this is basically $0. So I want something I can get in place quickly with very little outlay of moneys... unfortunately for AWS, that means Linux VMs with MySQL databases.

    Redis is free, but if you don't want to maintain it, MySQL's fine. Honestly, MySQL's gonna support you until you have a lot of feeds and users, so if it does start taking off you'll have the ability to go back and refactor into something that scales better than MySQL.

    @blakeyrat said:

    That would be pretty storage-heavy? The "normal" situation is every feed is 100% read at all times, except since the last time the user logged-in, so I think I'm still leaning towards the date ranges.

    Yeah, if things will never be read out-of-order, then ranges are best. I was thinking it would end up about even with storing a marker for each read post if you had lots of broken-up ranges, but that probably won't happen in the real world. People are either reading regularly or they're not, and either way you get big, contiguous ranges.

    @blakeyrat said:

    Yeah, that's the goal. Well, and enabling people to search feeds back as far as I have data for them.

    Searching in MySQL is another thing that won't scale well, but should work on a small scale. If you do get to the point where fulltext MySQL searches are too slow, there's stuff like Lucene and Solr.



  • @morbiuswilters said:

    I'd probably just keep some threads always running in the background to constantly keep feeds up-to-date. But your way works, too. Your way is better if you don't expect feeds to be read as often (so you're not doing tens of thousands of requests for a feed that's glanced at once-a-week). Mine makes it a bit more up-to-date if your feeds are read regularly.

    Yeah but remember I'm going to get to the point where there's 340,000 stored feeds and 24 logged-in users (looking at maybe 2000 feeds), I want the logged-in users to get priority on feed updates. (i.e. their feeds should update every 15 mins on the dot, the other feeds just need to update enough so we're not missing items.)

    Coming up with an algorithm to ensure feed updates is going to be interesting work, especially when I scale horizontally and there could be multiple servers stomping over themselves doing it...



  • @blakeyrat said:

    Issue 2: Keeping track of read and unread items. How do you do this so the storage of your "unread list" isn't insane? My solution is that instead of storing a database row for each item you've read (which is obviously unsustainable), what you could do is just keep track of the range of read items in each feed-- for example for CNN.com's feed, you might say "this user has read every item from 1/1/2013 to 5/1/2013". You'd allow multiples so you could say, "this user has read every item from 1/1/2013 to 3/1/2013, and every item from 4/1/2013 to 5/1/2013". This data could be passed directly to the front-end which could mark read and unread using it.

    I'm not sure that'd work - You may get readers who only read 33% of the items from a feed, you may get readers who read 95% of them. If you create a row for every article for the 95% readers then you will indeed end up unsustainable, but on the flip side you'd have a collossal exception list of unread articles for the 33% readers which generates an unsustainable no of rows too, or you'd have to toil away with converting ranges with exceptions to an unsustainable list of unread articles.

    Obviously, a one-man project with a budget of $0 isn't going to compete with Google's attempts, but (and I've never used Google Reader so this is conjecture) perhaps the problems you're addressing here are why Google has killed it? If they don't drive any revenue from it and its performance deteriorates maybe they've taken the bad press now rather than endure a bad reputation over a longer term. 

     

     



  • @nosliwmas said:

    perhaps the problems you're addressing here are why Google has killed it? If they don't drive any revenue from it

    BTW this is exactly why it pisses me off so much: Google doesn't drive any revenue from it because they never tried. If Google's concern was "hey Google Reader's expensive/difficult to run, and we don't gain anything from it", why would the default decision be, "shut it down" instead of "monetize it"?

    THEY NEVER EVEN TRIED TO MONETIZE IT. They don't let me pay cash money to keep my account alive. They don't put ads on it. They don't have "premium content". So yes, of course it's unprofitable, duh.

    GRUMP! End of rant.

    Anyway, as for this point:

    @nosliwmas said:

    I'm not sure that'd work - You may get readers who only read 33% of the items from a feed, you may get readers who read 95% of them. If you create a row for every article for the 95% readers then you will indeed end up unsustainable, but on the flip side you'd have a collossal exception list of unread articles for the 33% readers which generates an unsustainable no of rows too, or you'd have to toil away with converting ranges with exceptions to an unsustainable list of unread articles.

    I'm counting on the above-mentioned assumption that people keep their feeds read most of the time and won't use the star/bookmark/whatever feature incredibly often. It might turn out these assumptions are wrong and stupid, but I won't know until I have customers, if I ever have customers.



  • STOP.

    STOP.

    STOP.

    Reader used pubsubhubbub. Don't poll. Push.


  • @Ben L. said:

    Reader used pubsubhubbub. Don't poll. Push.

    What the fuck is this bullshit.



  • @Ben L. said:

    STOP.

    STOP.

    STOP.

    Reader used pubsubhubbub. Don't poll. Push.

    Yeah, that works if you have a hub server that you can subscribe to that has every feed you want on it. If not, then you're going to still have to poll because RSS wasn't meant to be used.

    And I'm going to go out on a limb here and assuming Reader was able to utilize pushes for maybe 10% of their feeds. I honestly have no idea how many of their feeds were pushed, but I'm betting most people weren't installing pubsubhubbub on their blogs or sites.



  • @blakeyrat said:

    @Ben L. said:
    Reader used pubsubhubbub. Don't poll. Push.

    What the fuck is this bullshit.

    Here is a list of hubs you can use!

    Please ignore the fact that all comments on that page made in the last 2 years are spam. Spam that nobody has bothered deleting.



  • @morbiuswilters said:

    And I'm going to go out on a limb here and assuming Reader was able to utilize pushes for maybe 10% of their feeds. I honestly have no idea how many of their feeds were pushed, but I'm betting most people weren't installing pubsubhubbub on their blogs or sites.

    More to the point, it's impossible to have a web page "subscribe" until web sockets came along very recently. So whatever they were doing was a nasty hack, probably pinging a server every 10 seconds or something stupid.

    And websockets won't work for me because I have to scale horizontally-- i.e. every server has to be able to respond to every request.



  • @morbiuswilters said:

    Redis is free,

    BTW Morbs, I've decided to make this a learning experience and try out Redis as my database.

    Edit: oh wait they don't support Windows. Well fuck them then.



  • @blakeyrat said:

    @morbiuswilters said:
    Redis is free,

    BTW Morbs, I've decided to make this a learning experience and try out Redis as my database.

    Edit: oh wait they don't support Windows. Well fuck them then.

    Couchbase is free AND supports Windows.


  • @morbiuswilters said:

    RSS wasn't meant to be used

    QFT



  • @Ben L. said:

    Couchbase is free AND supports Windows.

    Since MongoDB won't fucking install as a service without shitting all over itself, and also shits files all over my home directory*, I will try this option.

    *) WHY DO ALL OPEN SOURCE PROJECTS GET THIS FUCKING WRONG!???!?!? IS THERE A SINGLE FUCKING HUMAN FUCKING BEING IN THE ENTIRE FUCKING OPEN SOURCE ECOSYSTEM WHO KNOWS HOW FUCKING THE MOST POPULAR OS ON THE PLANET'S FUCKING PERMISSIONS WORK!? JESUS FUCK I WANT TO MURDER ERVERYBODFYADIY WI! HEY GUYS YOU KNOW WHY I ALWAYS CALL OPEN SOURCE PROGRAMS SHIT? IT'S BECAUSE THEY'RE ALL SHIT! TURDS! FUCKING USELESS BUGGY PIECES OF SHIT!!!!



  • What's the difference between Couchbase and Apache CouchDB? Are they basically two ports of the same code or what's going on here?



  • Ok I answered my own question. Couchbase is basically an older product named "Membase" renamed when it was made (more-or-less) compatible with CouchDB.

    It's also not free for commercial use, and the license agreement has a lot of scary legalese. (Basically, you can only put it on 2 non-production servers before you need to buy a license. If you are using Couchbase at all, they have the right to audit your servers to ensure compliance-- yikes!)

    This shockingly-not-crap explanation at Stack Overflow lays it all out.

    I have to admit CouchDB looks pretty exciting. In theory at least, other than the back-end feed updater service, I could write the entire app in JS and have it talk *directly* to the server for all its needs-- that's pretty damned slick honestly. And being Apache, it's free-free.



  • @blakeyrat said:

    I looked into creating my own little personal Google Reader clone as a result, figuring I could host it on Amazon AWS and maybe if it turned otu any good in 6-10 months I could actually productize it and put some ads on it or sell access or something.
    If you're after a self-hosting solution, try TT-RSS - it's the one I installed upon learning of Google's reluctance to carry on with their reader; never looked back.

    @blakeyrat said:

    Issue 1: I'm having: my Google Reader list is about 150 feeds. When I make my API call, the poor server has to ping 150 other servers to get the most recent feed items. Now there's a few ways I could optimize this; for example if the database says the last feed refresh was less than 5 minutes ago, don't bother to do a new fetch and just serve up what's in the DB already. But doing this with 150 feeds is simply not going to be sustainable in the long run.
    PubSubHubbub@blakeyrat said:
    Is this the best way of doing it? Can you think of a better way?
    It's a solved problem - stop reinventing pentagonal wheels.



  • @PJH said:

    If you're after a self-hosting solution, try TT-RSS

    Wow what a jackass. I wouldn't even use this software on principle after reading that.

    @PJH said:

    PubSubHubbub

    What the fuck is this bullshit.

    I'm sorry, does this product give you some sort of disease that prevents you from explaining what it is and how it helps the situation?

    @PJH said:

    It's a solved problem - stop reinventing pentagonal wheels.

    Well since I haven't found a Google Reader clone I like (or even one that behaves even vaguely like Google Reader itself), it's demonstrably not a solved problem.

    New thread rule: no more linking to vaguely-defined, Linux-only bullshit. The dev machine runs Windows 7. The software must run on Windows 7. This is a requirement.



  • @blakeyrat said:

    Wow what a jackass. I wouldn't even use this software on principle after reading that.

    By FOSS standards, that's not even that bad.



  • @blakeyrat said:

    Well since I haven't found a Google Reader clone I like (or even one that behaves even vaguely like Google Reader itself), it's demonstrably not a solved problem.

    New thread rule: no more linking to vaguely-defined, Linux-only bullshit. The dev machine runs Windows 7. The software must run on Windows 7. This is a requirement.

    Go take look at Feedly. I am using for past 1 year and it is better and goodier than google reader.



  • @Nagesh said:

    Go take look at Feedly. I am using for past 1 year and it is better and goodier than google reader.

    I have heard of it. I rejected it because it doesn't work on my phone. It doesn't work on my phone because it's not a web app. Which is fucking stupid.



  • @blakeyrat said:

    Issue 1: I'm having: my Google Reader list is about 150 feeds. When I make my API call, the poor server has to ping 150 other servers to get the most recent feed items.

    And this is a problem exactly... why? Unless you're pinging them one-after-the other, how would the number of feeds affect the time your API call needs to complete? Also, dumb question, are you employing all the usual HTTP headers? Expires, conditional GET, etc etc...

    As for databases, I've worked with couchDB and it's pretty much written for applications like this. It can handle loads of concurrent gets and updates without slowing down and throwing in another server is trivial. It comes at the cost of having rather... peculiar transaction and query models. Or, to put it bluntly, it can't do joins. Period. The development team says it's out of scope and won't ever be implemented. In fact, thanks to their transaction moel of "eventual consistency", you can't even reliably join stuff yourself in your client, after you've pulled it out of the DB. You can actually work around this for a large number of use cases (which should include yours) but if you realize you NEED joins, don't try to hack them in or it WILL drive you insane. Just switch to something else.

    @blakeyrat said:
    How the heck does Google Reader do it?

    A large fraction of blogs manage their RSS through FeedBurner. Another large fraction is hosted on Blogger. Guess which company runs both services? The Google Reader guys didn't need pubsubhubbub to get feeds pushed to them, they just had to shoot an e-mail to the neighbor department.



  • @blakeyrat said:

    @Nagesh said:
    Go take look at Feedly. I am using for past 1 year and it is better and goodier than google reader.

    I have heard of it. I rejected it because it doesn't work on my phone. It doesn't work on my phone because it's not a web app. Which is fucking stupid.

     

     

    Then clever thing is to write an app version of feedly, not go around trying build wheel, that will make you feel like Superman in your mind!



  • @Nagesh said:

    Then clever thing is to write an app version of feedly, not go around trying build wheel, that will make you feel like Superman in your mind!

    That's what this thread is about, buddy. I'm way ahead of you.

    Feeling like Superman would be huge, considering I usually just feel like Aquaman.

    @Nagesh said:

    Filed under: superman is not real.

    Fuck! I'll stick with Aquaman then.



  • @blakeyrat said:

    @Nagesh said:
    Then clever thing is to write an app version of feedly, not go around trying build wheel, that will make you feel like Superman in your mind!

    That's what this thread is about, buddy. I'm way ahead of you.

    Feeling like Superman would be huge, considering I usually just feel like Aquaman.

    @Nagesh said:

    Filed under: superman is not real.

    Fuck! I'll stick with Aquaman then.

     

     

    there is one good thing about aquaman. he will never have to look for public toilet (sulabh shauchalya)

     



  • Quick question for anybody still subscribing to this thread:

    What's the BEST (most technically-correct) way of stripping <SCRIPT> tags from a block of HTML? I could run the whole thing through the C# web control and traverse the DOM but that strikes me as super-slow, there has to be a better way, right?

    What else should I sanitize off of RSS-supplied HTML being fed into a DIV? Styles that aren't inline perhaps?



  • @blakeyrat said:

    Quick question for anybody still subscribing to this thread:

    What's the BEST (most technically-correct) way of stripping <SCRIPT> tags from a block of HTML? I could run the whole thing through the C# web control and traverse the DOM but that strikes me as super-slow, there has to be a better way, right?

    What else should I sanitize off of RSS-supplied HTML being fed into a DIV? Styles that aren't inline perhaps?

    For Java, NekoHTML and TagSoup work pretty well - they're specialized SAX parsers, so they'll pass a stream of "tag open" and "tag close" events to you instead of a fully constructed DOM, which does wonders for your memory usage. I don't know about similar projects for .NET, but there might be a port if you're lucky.


  • @blakeyrat said:

    What else should I sanitize off of RSS-supplied HTML being fed into a DIV? Styles that aren't inline perhaps?

    Of course you're already stripping the usual suspects - onXXX attributes, javascript:* URLs, data:* URLS, Iframes, Object tags, harmful inline styles etc etc - are you? Also link and meta tags which, as per the latest iteration of HTML5 now can be anywhere inside the body. Actually, I'd consider stripping everything except a whitelist. That seems to be the safest bet for now.



    Alternatively, you could try if seamless, sandboxed iframes work out for you. I'm not sure how well-supported they're already though.



  • You need OBJECT and EMBED for RSS feeds that embed YouTube videos.

    What are "harmful inline styles"?



  • @blakeyrat said:

    What are "harmful inline styles"?

    Stuff that affects content outside the styled element. Absolute positioning, floats, etc. Uh, don't be CS basically.



  • That could get nasty, I didn't think of that. I guess I could just axe all inline CSS referring to positioning in any way. But even then you could write CSS to set the font size to 2,000 point which would screw with stuff...



  • Looking through my feeds, it seems like Reader strips a lot out. I see header tags, links, breaks, imgs, bold/italic/underline, and some <pre> for code samples (which have coloring, so <style> tags, right? been a while for me). I think you'd be good allowing those (remove everything but color stuff from the style tags) and stripping everything else.



  • Ok coming back to this project finally.

    Here's the HTML cleaning specs I've come up with:

    Tags:
    SCRIPT - Remove
    STYLE - Remove
    OBJECT - Keep (plus all attributes) (plus all inner tags)
    EMBED - Keep (plus all attributes)
    All Others - Keep

    Attributes:
    alt - Keep
    title - Keep
    All Others - Remove

    YouTube embeds are annoying though, they're now in iframes like so:

    <iframe width="560" height="315" src="http://www.youtube.com/embed/kFPYTsqtEno?list=FL7UdKZ9ujF1sFlP93dfjMgA" frameborder="0" allowfullscreen></iframe>

    So now I'm thinking I need a whitelist of iframes to keep based on their src URL? Does that sound reasonable? Wouldn't it be a nightmare to maintain as new video streaming services came along? Maybe I should just allow *all* iframes as long as their height and width isn't ridonkulous?



  • A few other things you might want to do:

    • Convert all anchors URLs to absolute.
    • Add (or set) target="_blank" to all anchors.
    • Convert all img URLs to absolute.

    The problem with sandboxing iframes is that they don't size to the content, so for content where you have no idea about the space it needs, it's not an option.



  • Good point about relative paths, I'd hope RSS feeds didn't have them, but best not to make assumptions.

    Re: iframes - Well you know how big YouTube iframes are but that just brings back the problem of "so you need to keep track of EVERY video site ever for all time?" If I just arbitrarily set a max size and adjust the width and height properties if they exceed it, and turn on scroll bars... I dunno how else to handle that





  • @blakeyrat said:

    Good point about relative paths, I'd hope RSS feeds didn't have them, but best not to make assumptions.

    Depends on the feed. Blogs which like to put release pictures, etc. hosted on their server almost always use relative URLs.

    One other thing you might want to do is sanitize anchor URLs that resolve to something on your reader interface. So if I make a post with a URL called "http://blakyerat.com/reader/delete?user=Arnavion" (I assume you won't make delete a GET, but why take chances?) you might want to nuke that.

    @blakeyrat said:

    Re: iframes - Well you know how big YouTube iframes are but that just brings back the problem of "so you need to keep track of EVERY video site ever for all time?" If I just arbitrarily set a max size and adjust the width and height properties if they exceed it, and turn on scroll bars... I dunno how else to handle that
    Oh, I wasn't talking about your Youtube iframes. I was talking about PSWorx's suggestion earlier about putting the entire post content in one as a sandbox.

    Also, what are you using to parse the feeds? XML parser? Because a lot of misheaved feeds (Wordpress-generated, etc.) like to have HTML entities in them, even outside CDATA sections, which your parser may not like.



  • @Arnavion said:

    One other thing you might want to do is sanitize anchor URLs that resolve to something on your reader interface. So if I make a post with a URL called "http://blakyerat.com/reader/delete?user=Arnavion" (I assume you won't make delete a GET, but why take chances?) you might want to nuke that.

    Avoiding CSRF. You're doing it wrong.


  • @Ben L. said:

    Avoiding CSRF. You're doing it wrong.

    True, it's extremely hacky and deserves some thought, but giving it thought is exactly what we're doing here. If you have a better idea, I'm all ears.

    Oh wait, your idea is probably to bundle a Go runtime compiled to JS with the web page. Because Go is so good.



  • @Arnavion said:

    @Ben L. said:

    Avoiding CSRF. You're doing it wrong.

    True, it's extremely hacky and deserves some thought, but giving it thought is exactly what we're doing here. If you have a better idea, I'm all ears.>

    Wow, you really ARE a moron.



  • @Arnavion said:

    If you have a better idea, I'm all ears.
     

    • Don't make deleting an account a GET.
    • Check cookies to make sure the user is logged in and has the appropriate permissions before deleting someone's account.
    • Don't just give those permissions to anyone, either.
    • Don't just delete an account when someone visits http://blakyerat.com/reader/delete?user=Arnavion. Send a confirmation message saying "Are you sure?" where the "Yes"button includes a randomly generated ID - so, a link to /reader/delete?user=Arnavion&confirm=486a7f58e04bdcc5. If the confirmation codes don't match, don't delete the page.
    • I'm forgetting something. But this is not a bad place to start.

     



  • @Snowyowl said:

    @Arnavion said:

    If you have a better idea, I'm all ears.
     

    • Don't make deleting an account a GET.
    • Check cookies to make sure the user is logged in and has the appropriate permissions before deleting someone's account.
    • Don't just give those permissions to anyone, either.
    • Don't just delete an account when someone visits http://blakyerat.com/reader/delete?user=Arnavion. Send a confirmation message saying "Are you sure?" where the "Yes"button includes a randomly generated ID - so, a link to /reader/delete?user=Arnavion&confirm=486a7f58e04bdcc5. If the confirmation codes don't match, don't delete the page.
    • I'm forgetting something. But this is not a bad place to start.

     


    @Arnavion said:
    (I assume you won't make delete a GET, [b]but why take chances[/b]?)

    The point was not how to secure the delete API, but how to sanitize the feed HTML to make sure that even [b]if[/b] there is a slip in the API, the feed HTML can't exploit it. Defensive coding and all that. I'm fully aware that this is not an excuse to not protect the API from CSRF in the first place.



  • @Arnavion said:

    The point was not how to secure the delete API, but how to sanitize the feed HTML to make sure that even if there is a slip in the API, the feed HTML can't exploit it. Defensive coding and all that. I'm fully aware that this is not an excuse to not protect the API from CSRF in the first place.

    If they put a link like that in an RSS feed, that's their problem, not mine. An RSS reader reads RSS item, it's not "magical universal fix every site's security issues for them magically" software. In any case, I don't have any way of telling apart a GET param that deletes a request from one that, say, displays a particular image.



  • @blakeyrat said:

    An RSS reader reads RSS item, it's not "magical universal fix every site's security issues for them magically" software.

    Well the "every site" is your site.

    @blakeyrat said:

    In any case, I don't have any way of telling apart a GET param that deletes a request from one that, say, displays a particular image.

    Perhaps I wasn't explaining myself clearly.

    Your site is http://blakeyrat.com/reader/ . Suppose you screw up and have some API - http://blakeyrat.com/reader/settings/deleteme that deletes the currently logged in user - that is callable via a GET (it should really be a POST with appropriate CSRF protection, but we're assuming you screwed up really bad).

    Somebody puts a link to http://blakeyrat.com/reader/settings/deleteme on their site, http://evil.com/ If a user clicks it on their site while they're logged in to yours, they end up having their account deleted.

    Suppose you had the foresight to check that all requests to your API (and I assume all your API is under ~/settings/) have a referer header that's your site. Now somebody puts a link to http://blakeyrat.com/reader/settings/deleteme on their site, http://evil.com/ and a user subscribes to their feed. You end up embedding the anchor in your own site. The referer won't protect you.

    Of course having a CSRF token protects you. And of course having a referer check protects you. And making management API endpoints require POST instead of GET. But unless you're completely confident about your code, how sure are you that you didn't screw up and expose some endpoint that can be called from all the third-party unsafe HTML that you're embedding?

    You're embedding third-party HTML directly into your web page and thus bypassing basically every protection the browser gives your users. You have to be in the mindset that the contents of the feed that you end up showing on your site are as if you'd written them yourself. It pays to be a bit paranoid for the sake of the user. Stripping all links from the feed that point to your site gives you a tiny and redundant safety measure on top of all the measures you already should have.



  • @Arnavion said:

    Your site is http://blakeyrat.com/reader/ . Suppose you screw up and have some API - http://blakeyrat.com/reader/settings/deleteme that deletes the currently logged in user - that is callable via a GET (it should really be a POST with appropriate CSRF protection, but we're assuming you screwed up really bad).

    ... why would you assume I'm that stupid? I mean I get what you're saying, but come on man. Duh.

    @Arnavion said:

    Stripping all links from the feed that point to your site gives you a tiny and redundant safety measure on top of all the measures you already should have.

    I would "fix" this to avoid 404 errors from badly-written RSS content, not because I'm afraid of the security implications for my own site.



  • @blakeyrat said:

    ... why would you assume I'm that stupid? I mean I get what you're saying, but come on man. Duh.

    I wasn't. Don't take it that way. When it comes to security I always treat myself as stupid so that I make sure to be really careful. (The same attitude applies to programming in general, I think.)



  • @blakeyrat said:

    Here's the HTML cleaning specs I've come up with:

    Tags:
    SCRIPT - Remove
    STYLE - Remove
    OBJECT - Keep (plus all attributes) (plus all inner tags)
    EMBED - Keep (plus all attributes)
    All Others - Keep

    Attributes:
    alt - Keep
    title - Keep
    All Others - Remove

    Perhaps you also want to keep width and height on images?

    For <video> and <audio> tags you either want to remove them altogether or add/set autoplay="false"

    I went and implemented a whitelist-based solution instead for my own project. Let's see how it turns out.

    One annoyance I noted is that img tags with width set to "100%" (a common thing in the feeds I follow) end up being xboxhueg. Reader doesn't have this problem because it typesets feeds about 500px wide. Mine just gives the whole horizontal width to the feed so the images tend to become almost a screen in height. I'm contemplating also searching for percentage width/height and changing them to half their value or something...



  • @Arnavion said:

    Perhaps you also want to keep width and height on images?

    I thought about that but it means diving into the style attribute and making/finding a CSS parser, so I might be too lazy for that.

    @Arnavion said:

    For <video> and <audio> tags you either want to remove them altogether or add/set autoplay="false"

    Good catch, these new-fangled HTMLs!!! Give me my good ol' HTML 2.0 when a man was a man!

    @Arnavion said:

    I'm contemplating also searching for percentage width/height and changing them to half their value or something...

    Create a max pixel width and just change the percentage to the pixel width, if it's 100%. Of course if it's like 98% then you're still screwed, but I'm guessing that's pretty rare.

    What language are you building your solution in?


Log in to reply
 

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.