Messing Around With Too Much Data



  • So, I'm at work now, and don't have time to go get my code and show it off, but I decided to try out ASP.Net Core 2 and try to build a site for consuming the Path of Exile public stash tab API.

    Basically, the developers of Path of Exile do not want to make trading too easy, and want player interaction in every transaction, so they do not intend to add an auction house to the game. To make things easier, though, they have made it possible to query a web API for all the stash tabs people have marked as public, and see the items they contain and their preferred prices. A good example of a use of this data is the highly used http://poe.trade or http://poe.ninja .

    So, I decided to try making a generator, which is an IEnumerable that, when it runs out of tabs, asks for the next set of tabs from the API again.

    Now, if you look at the above posted poe.ninja, you'll notice that the full API is rather large. I have no idea how someone can possibly deal with that many items! Even querying for items of a specific base type would use up insane amounts of memory.

    How do you deal with a monstrous dataset like that?



  • Without fully understanding what you're asking, my suggestions would be:

    • Use a database
    • Pagination


  • @magus The only stats I can find on that site are these:

    Stash Tabs proccessed 381419889 stash tabs
    GGG API download 15301.17 GB

    So we know they've downloaded a bit over 15 TB, but we have no idea how many data items they actually need to keep track of.

    We know they've queried stash tabs 381419889 times (there's no way there's that many unique tabs in the game, so we also know they're very bad at writing clear website copy). Again: we have no idea how much of that data has to actually be stored and available. For all we know, 99% of them were empty.

    So the conclusion is: I have no conclusion.

    But I doubt they've outgrown the limits of MS SQL Server. It would surprise me very very much to learn that. What I'd like to see is the number of records they've tracked in toto, the number "active" at any one time, and the disk space taken up by that data.


    EDIT: I should add that if you have to do stuff in memory all at once, you should switch to a better programming language. C# is great at streaming data, either by using the explicit Stream types or by making use of yield return. If you already have an IEnumerable, then you should have no problem converting it to use yield.



  • @blakeyrat said in Messing Around With Too Much Data:

    C# is great at streaming data

    You do have to be careful, though. If you have 15 TB of data that makes up 381419889 records, your algorithm better be O(n) and not O(n2) or you'll be pulling five and a half zettabytes of data from disk.



  • @magus said in Messing Around With Too Much Data:

    How do you deal with a monstrous dataset like that?

    What do you intend to do with it, and what level of support does the API provide?

    You probably don't need the entire dataset in memory at once. If you aggregate (with things like sum, average) you can do it page by page and keep partial results. If you show a list, the API should be able to present you a particular page in a particular order.

    But it really depends on what your plan is.



  • @blakeyrat said in Messing Around With Too Much Data:

    C# is great at streaming data, either by using the explicit Stream types or by making use of yield return. If you already have an IEnumerable, then you should have no problem converting it to use yield.

    If the API is suitably advanced (and needs eg. joins between endpoints), then I'd even go for implementing it as a IQueryable and translating LINQ to API calls. Advanced stuff, but possibly worth the hassle.



  • @blakeyrat said in Messing Around With Too Much Data:

    381419889

    I don't know how many user accounts they have, but even if the average player buys one or two tabs, if they never clear those tabs out when they reach Standard, it's likely that players average on 12 or so tabs each, public or not, across Standard and the active leagues.

    Tabs will generally hold somewhere between 32 and 144 items, so the total number of items is definitely going to be rather large. But yeah, that's the most I know on that.



  • @maciejasjmj Basically, you can request pages, and they stick in as many tabs as they can, and then the token to continue your query is the only parameter. But yeah, I may be able to filter much more aggressively before I try doing any serious processing.

    I don't have any concrete plans yet, I mostly just want to be able to interact with this data in a reasonable way for now, and display my results.



  • @magus said in Messing Around With Too Much Data:

    @maciejasjmj Basically, you can request pages, and they stick in as many tabs as they can, and then the token to continue your query is the only parameter.

    Ah, @cartman82 school of API design 🚎

    I think the API expects to be dumped page by page into a database or some sort of storage and then polled occasionally when you're done - there are no filtering facilities or anything like that. I don't think you'll be able to just get by with an IEnumerable, since on average a simple query to get the info about a particular tab will have you downloading 7,5TB of data and discarding it.



  • @maciejasjmj said in Messing Around With Too Much Data:

    @magus said in Messing Around With Too Much Data:

    @maciejasjmj Basically, you can request pages, and they stick in as many tabs as they can, and then the token to continue your query is the only parameter.

    Ah, @cartman82 school of API design 🚎

    I think the API expects to be dumped page by page into a database or some sort of storage and then polled occasionally when you're done - there are no filtering facilities or anything like that. I don't think you'll be able to just get by with an IEnumerable, since on average a simple query to get the info about a particular tab will have you downloading 7,5TB of data and discarding it.

    Also what you'd get from the API if you just enumerated it could contain duplicates, where the latter entry is an updated version of an entry that has already been retrieved. You'd need to keep and overwrite that instead of yielding both.

    I guess if you just wanted to play around you could fetch a few pages and juggle them in memory, and if you were collecting stats then maybe you could get by keeping just the aggregate value (but you'd still need to pipe all the data through you and you'd need to eat the duplicates), but pretty much anything more than that will require you to dump that data to a database and write yourself a service to poll for updates.



  • @maciejasjmj It really leaves me wondering how poe.trade does it; they seem to get updates within 10 minutes of someone moving an item into one of those tabs.



  • @magus said in Messing Around With Too Much Data:

    @maciejasjmj It really leaves me wondering how poe.trade does it; they seem to get updates within 10 minutes of someone moving an item into one of those tabs.

    That's what the duplicates are - esentially once you've fetched all the data you start getting empty sets with the same token until someone makes a change, and that change is then transmitted (bumping the token).

    They explain it on the API site - the token is a sequential ID, and each change causes the change ID property on the tab to be reassigned to the next value from that sequence.



  • @maciejasjmj That definitely sounds like the only way to make it work is to download all the things. Unfortunate. I don't think I can be bothered to deal with the space costs :/



  • @magus said in Messing Around With Too Much Data:

    @maciejasjmj That definitely sounds like the only way to make it work is to download all the things. Unfortunate. I don't think I can be bothered to deal with the space costs :/

    Or see if someone already is doing the legwork of hosting a proper API to that data.



  • @magus said in Messing Around With Too Much Data:

    @maciejasjmj It really leaves me wondering how poe.trade does it; they seem to get updates within 10 minutes of someone moving an item into one of those tabs.

    Based on the info the other site provides on their "stats" page, it looks like one of the API endpoints is a "all changes since X date" endpoint.

    If you go to poe.ninja's site and click the Stats page and click the GUID of the "last update" call it'll show you the actual JSON it received.



  • @magus said in Messing Around With Too Much Data:

    @maciejasjmj That definitely sounds like the only way to make it work is to download all the things. Unfortunate. I don't think I can be bothered to deal with the space costs

    Do you know the space costs?

    EDIT: also their JSON is verbose as shit. It seems to report the text description of every item every time it appears. You only need to store that in your DB once.


  • Considered Harmful

    @blakeyrat said in Messing Around With Too Much Data:

    @magus said in Messing Around With Too Much Data:

    @maciejasjmj That definitely sounds like the only way to make it work is to download all the things. Unfortunate. I don't think I can be bothered to deal with the space costs

    Do you know the space costs?

    https://i.imgur.com/S8UN8sM.jpg



  • @blakeyrat said in Messing Around With Too Much Data:

    EDIT: also their JSON is verbose as shit. It seems to report the text description of every item every time it appears. You only need to store that in your DB once.

    That is actually an extremely good point. I may indeed play around with this some more.


  • sekret PM club

    @blakeyrat said in Messing Around With Too Much Data:

    EDIT: also their JSON is verbose as shit. It seems to report the text description of every item every time it appears. You only need to store that in your DB once.

    Eh, I think this depends on the item. Since this is a Diablo-ish style game, you can end up with two items named exactly the same that rolled completely different sub-stats, which is probably why all that's included. Someone using it for trade purposes will want to know the stats on the items they're looking at.



  • @e4tmyl33t http://www.pathofexile.com/api/public-stash-tabs?id=96917148-101696284-95530975-110260464-102925256

    Click. Let it load.

    Type control-F. Paste in: "Place into an item socket of the right colour to gain this skill. Right click to remove from a socket." and hit enter.

    Watch the scrollbar FILL with yellow.

    Hard to justify needing that exact string that many times unless your API is verbose as shit.


  • Impossible Mission - B

    @blakeyrat Sounds like they should have used YAML. 🐠



  • @e4tmyl33t I think what he's saying is, "6% increased fire damage" is fully spelled out every time, as opposed to something like "dmg_f_inc(0.06)" or "increase": { "amount": "0.06", "type": "fire" }, the latter being maybe easier to deal with. Either way, I should be able to track stuff with a column name, and not repeat this every time.



  • Did not see...but ...

    1. async
    2. Multi-Threaded (probably multi-process or even multi-machine) parallel processing.


  • Decided to mess around with trying to get all the implicit and explicit mod names, so I can come up with sane database columns, if I get to that point. It doesn't work, but here it is:

    // <copyright file="TabCollection.cs" company="BlackOmen">
    // Copyright (c) BlackOmen. All rights reserved.
    // </copyright>
    
    namespace BlackOmen.RothProcurementIndustries.Models
    {
    	using System.Collections;
    	using System.Collections.Generic;
    	using System.Linq;
    	using System.Net.Http;
    	using System.Threading.Tasks;
    	using Newtonsoft.Json.Linq;
    
    	public class TabCollection : IEnumerable<string>
    	{
    		public IEnumerator<string> GetEnumerator()
    		{
    			const string apiUrl = "http://api.pathofexile.com/public-stash-tabs";
    
    			using (var client = new HttpClient())
    			{
    				var nextChangeId = apiUrl;
    				var mods = new List<string>();
    
    				while (true)
    				{
    					if (!mods.Any())
    					{
    						var content = Task.Run(() => client.GetStringAsync(nextChangeId)).GetAwaiter().GetResult();
    
    						var json = JObject.Parse(content);
    						var changeId = json["next_change_id"].Value<string>();
    
    						var modList =
    							json["stashes"]
    								.Where(tab => tab["public"].Value<bool>())
    								.SelectMany(tab => tab["items"])
    								.SelectMany(item =>
    									item?["explicitMods"]
    										?.Select(mod => mod.Value<string>()) ?? new string[] { }
    										.Concat(item?["implicitMods"]?.Select(mod => mod.Value<string>()) ?? new string[] { }))
    								.Select(mod => new string(mod.Where(character => character < 0 || character > 9).ToArray()))
    								.Distinct();
    
    						nextChangeId = $"{apiUrl}?id={changeId}";
    
    						mods.AddRange(modList);
    					}
    
    					var firstMod = mods.First();
    					mods.Remove(firstMod);
    
    					yield return firstMod;
    				}
    			}
    		}
    
    		IEnumerator IEnumerable.GetEnumerator()
    			=> this.GetEnumerator();
    	}
    }
    


  • One of us hasn't had enough coffee, but can't you just:

    while (true)
    {
        var mods = FetchData(); //HTTP and filtering go here
        foreach (var mod in mods) 
        {
            yield return mod;
        } 
    }
    

  • Considered Harmful

    @magus Seems to be some bug in NodeBB; your spaces have been replaced with tabs.


  • Java Dev

    @magus said in Messing Around With Too Much Data:

    @e4tmyl33t I think what he's saying is, "6% increased fire damage" is fully spelled out every time, as opposed to something like "dmg_f_inc(0.06)" or "increase": { "amount": "0.06", "type": "fire" }, the latter being maybe easier to deal with. Either way, I should be able to track stuff with a column name, and not repeat this every time.

    Yeah, that bit looks pretty bad. You're going to have to parse all of those manually, especially if you also want to be able to search for things like "Provides at least +5% fire damage" later on.

    Do you have an inkling of how many properties you'll get? Can your DB engine handle that many columns in one table?



  • @maciejasjmj No. The fetch code is in an if, and will only be called if the cache is empty. I can't think of another way to do it with the system they've got. To the best of my knowledge, what you suggest would cause it to call the API every time, and when I had it implemented that way, the API sometimes complained about receiving too many requests.



  • @pleegwat Right now, I just want to get all the things, rip the digits out, and find out how many unique strings I end up with.



  • @magus said in Messing Around With Too Much Data:

    To the best of my knowledge, what you suggest would cause it to call the API every time

    Hm? No. You fetch the data, you loop over the portion you got yielding each item, when you reach the end of the portion you fall out of the foreach and get back to the FetchData part.



  • @maciejasjmj Ah, maybe you're right and I just don't remember how these things work properly.

    EDIT: Yeah, you're right. This really does work in the most obvious way. Thanks!

    ...Now I just need to find out why the digit removal isn't working.


Log in to reply