Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M



  • One of my friends sucked me into a space trucker game, and I'm running into issues going from point A to point B. The in-game map usually works, but sometimes leaves you stranded without fuel when plotting a multi-hop journey, and outright refuses to plot to systems more than 1,000 light years away.There are a few online tools to help with this, but they have similar or even more restrictive regulations. Since I'm a coder, I rolled up my sleeves to make my own. Fortunately, most of these sites participate in a real-time data gathering system, and provide daily dumps of all data collected to date, so I have more than ample source material.

    Problem 1: Loading all the data into RAM (~10M stellar systems, ~200K planets/moons/belts, ~20K space stations and other facilities) takes about 12 minutes, plus another 5 minutes to build an octree for reasonably-efficient "what's nearby" lookups. Once this 17-minute loading process is complete, route planning can be done pretty quickly (<0.1 sec for a typical six-hop trade run, <0.5 sec for going from Sol to newbie-ville, about 2 minutes to go from the center of the galaxy to one of its extrema in a ~100ly/jump ship at about 2,200 jumps). However, this consumes way too much RAM and has too long an initial load time to be usable.

    Problem 2: I loaded this all into SQL Server for the regular data plus (since SQL Server doesn't do 3D spatial) a separate, custom octree index. After "tuning" Entity Framework to act solely as a database abstraction layer, it took about 20 minutes. Now, my software loads practically instantly, but can't reasonably return results. Without the octree, the six-hop trade run lookup now takes about 40 seconds, since SQL Server runs a full table scan to find the hop candidates. With the octree, after several minutes of badly-done string concatenation, ADO.NET throws an exception that is "[...] a rare event and only expected for extremely complex queries [...]" and won't provide an answer. It looks like what happens is that the octree narrows it down from ~10M to ~20K candidates, and EF uses an "WHERE ID IN (...)" clause, but textually substitutes all 20K ID numbers, rendering the query unrunnable. My next step is to move the index into a SQL-Server-hosted CLR object with a table-valued function, but that gets... messy. Especially teaching EF how to call into it.

    I don't think I can do this using the resources I know about on hand. MS SQL, MySQL, MariaDB, and the default builds of pgSQL and SQLite don't support 3D spatial indexing, only 2D at best. Oracle considers it an enterprise-only feature (hence the price tag in the title). SQLite does offer n-dimentional R*-tree indexing if you compile it from source, but even when you follow the instructions to create a shared object / DLL, it spits out a static library with no exports. There's something called PostGIS for pgSQL, and it explicitly has exactly what I need, but I can't figure out how to get it installed (and I'm pretty hopeless at pgSQL anyway). I've been picking the brains of a Databricks engineer, and there are 2D geospatial extensions for Spark, but not 3D geospatial. There do exist dedicated servers for handling XYM and TXYM geospatial datasets, but typically they're designed for massively parallel systems, don't offer a working Z coordinate, and start at $20K.



  • @twelvebaud buy more RAM. Or if it's gonna be used by many people, put it on a server with lots of RAM. Alternatively, roll your own solution for streaming parts of full octree from disk (should be moderately straightforward given that you have working in-memory implementation).



  • @gąska I've got no problem loading the octree; it's like 40MB. It's matching it against the ~7GB of original data that's giving me fits. I suppose I could keep all the data in RAM all the time, but it sucks that in the event it goes down it needs the time to download the data from upstream again plus 17 minutes to ETL it, which is why I'm trying to keep it in a traditional database.



  • @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I suppose I could keep all the data in RAM all the time, but it sucks that in the event it goes down it needs the time to download the data from upstream again plus 17 minutes to ETL it, which is why I'm trying to keep it in a traditional database.

    Why are "keep the data in RAM" and "keep the data in a traditional database" mutually-exclusive?



  • @blakeyrat I'm using Entity Framework so I can deal with plain old CLR objects, and it really doesn't like trying to create ten million of them at once, like what I'd get if I ToList()ed my data table. I've shut off most of EF's features like change tracking and autoproxies, but even after they've been detached I can only get about 100,000 in memory at once.



  • @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I'm using Entity Framework so I can deal with plain old CLR objects, and it really doesn't like trying to create ten million of them at once, like what I'd get if I ToList()ed my data table.

    So don't do that?

    SQL isn't SO hard that you need a Entity Framework for a problem like this. Man-up and write code.


  • Impossible Mission - B

    @blakeyrat said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    SQL isn't SO hard that you need a Entity Framework for a problem like this. Man-up and write code.

    Agreed.

    Try something like Dapper. EF is way overkill for something like this.



  • @blakeyrat said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I'm using Entity Framework so I can deal with plain old CLR objects, and it really doesn't like trying to create ten million of them at once, like what I'd get if I ToList()ed my data table.

    So don't do that?

    SQL isn't SO hard that you need a Entity Framework for a problem like this. Man-up and write code.

    In fact, I think most of the work should be done with stored procedure or SQLCLR.

    Since EntityFramework make use of SqlConnection internally when your target is MSSQL server, using System.Data.SqlClient.* directly can help you load less library.



  • I'm pretty sure Postgres has support for geometric indexes.



  • @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    @gąska I've got no problem loading the octree; it's like 40MB. It's matching it against the ~7GB of original data that's giving me fits.

    I don't really get it. What's the original data here? How is it related/unrelated to octree? Why can't you chunk it and store in files that you load partially from disk?



  • @ben_lubar said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I'm pretty sure Postgres has support for geometric indexes.

    Yep, the documentation references "n-dimensional", so I assume that means 3D is possible.


  • Discourse touched me in a no-no place

    @ben_lubar said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    @ben_lubar said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I'm pretty sure Postgres has support for geometric indexes.

    Yep, the documentation references "n-dimensional", so I assume that means 3D is possible.

    PostGIS would be the first thing I'd look at as well. (Except I've never done any geospatial work at all, so I don't really know the right questions to ask.)



  • @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    @blakeyrat I'm using Entity Framework so I can deal with plain old CLR objects, and it really doesn't like trying to create ten million of them at once, like what I'd get if I ToList()ed my data table. I've shut off most of EF's features like change tracking and autoproxies, but even after they've been detached I can only get about 100,000 in memory at once.

    If you're hand-building index-lookalikes anyway, I'd suggest to just use a plain data file. Set it up so you can memory map it and write code to use it in that state, rather than having to load the whole thing into memory explicitly.



  • @gąska Elite has about ten million star systems I can fly between, with information like "this is named Sol", or "this is part of the Typhon Expanse cluster", or "the police here are run by the Federation", or "this has a neutron star / black hole / yellow sun", along with X, Y, and Z coordinates. I've got a gigantic JSON dump of all this information, and loading it up in a traditional database lets me use the database's indexes to easily sort and filter based on this information -- for example, I can provide autocomplete for someone typing in "Sag..." near instantly because the database engine can find systems whose name starts with that quickly: it has an index for that, and knows where all the S's are without having to look for them individually.

    Where that breaks down is trying to find systems near other systems. When route planning, the biggest query I have is "find me star systems within a 20-light-year radius of Sol, ordered by closest to Sagittarius A*". A traditional database's index can get it down to a 40ly-high "slice" of the universe, but within that slice systems that are very close to Sol get mixed in with systems near the extremities -- even if those systems are very close in one axis, they're not close at all in 3D space. Several database engines have special spatial indexes that take that into account, and partition based on rectangles or cuboids. SQL Server, which I was using because it was easy to set up, only takes into account 2D space, which narrows that down from a universe-wide slice to a universe-long cylinder, but that's still a lot of systems to chew through.

    I've made my own index that takes the full 3D space into account. At the leaf level, it has a list of star system IDs within a cuboid of space. Since I suck at reading research papers, the best I could do is an octree (3D version of a binary search tree with fixed pivots), which isn't that great but is still more performant in theory than scanning the entire table or intersecting three linear index lookups. For this particular kind of query only, I'm trying to use this external index to guide the database, the same way it guides itself when looking up star systems by name or by star class.

    I can definitely move everything back into CLR objects I can save/load in chunks from disk, but at that point I'm just writing a database engine from scratch, and I know I can't do as good a job as the manyears of effort behind MS SQL, Oracle, etc.

    @blakeyrat @masonwheeler You're right; I really hate writing that sort of code but I can't deny it's effective. Driving ADO.NET manually, I can load all 10M star systems at once in about 45 seconds, and spatial lookups are back to a reasonable amount of time.



  • @ben_lubar @dkf That's absolutely the correct solution, and what I would go with immediately if I knew how to use PostgreSQL. Sadly, my experience with it comes from trying to troubleshoot Discourse performance problems and getting set up to do that felt like I was shaving yaks, in the dark, during a hurricane, with a pair of pinking shears. And I recognize that's entirely on me and not indicative of a problem with pgSQL, but it also means my chance to succeed at that solution is pretty low.



  • @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    Since I suck at reading research papers, the best I could do is an octree (3D version of a binary search tree with fixed pivots), which isn't that great but is still more performant in theory than scanning the entire table or intersecting three linear index lookups.

    I suck at SQL and that kind of stuff, but acceleration structures like octrees are kinda my thing. If you any questions regarding that, I'm happy to help. (My first attempt would usually involve a BVH rather than an octree: building a semi-decent BVH is very easy, and range queries tend to be fairly efficient against it. You can make it pointerless, so serializing it is easy too.)


  • Discourse touched me in a no-no place

    @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    adly, my experience with it comes from trying to troubleshoot Discourse performance problems and getting set up to do that felt like I was shaving yaks, in the dark, during a hurricane, with a pair of pinking shears.

    Toby Faire, that was with the frankly disgusting indexing configuration that undefined et al were using. They appeared to be assuming that if you'd optimised a DB for one particular engine with one particular workload, you'd have an optimal configuration for all engines and workloads.

    The main reason I first thought of PostGIS (which I emphasise that I really don't know how to use) was that it's used a lot by friends of mine who do complex mapping work, and it appears (from a small amount of searching) to not just assume that the dataset is a flat plane, or even planar at all (unless you use a function that specifically does). What's more, I've never heard them complain about it as a piece of technology, and they have quite… strong opinions on other software systems. 😉



  • @twelvebaud have you thought of making two separate databases - one for positions of objects, and one for everything else? The first might be custom octree implementation with just IDs and positions, avoiding use of any stock RDBMS, and the latter would be flat database without any spatial divisions. You'd do route planning using exclusively the first one, and everything else using exclusively the second. Would that work for you?



  • I don't know much about geospatial indexes, but my first thought is to create something kind of like a hashtable. If you divide your space into some number of units in each dimension and then populate a DB table with 1 row for each cell, having columns for id and x, y, and z coords. Index that table by the x, y, and z, and populate each table that has a location with the id of the cell in your index table and create an index for that table by that id.

    Then in your code, you can compute which block an exact location is in, query the table for the id of that block, then query the item table by that ID to find items in that block. And step the block by 1 in each direction and repeat those queries to find items a little further.

    Then again, I found PostgreSQL to be really easy to install on Windows, much faster and easier than SQL Server, oddly enough. And supposedly PostGIS can be installed easily with that.

    Related WTF - why does installing Microsoft's own Database on their own OS take 10 times longer than installing a random open-source database?


  • Discourse touched me in a no-no place

    @undergroundcode said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I don't know much about geospatial indexes, but my first thought is to create something kind of like a hashtable.

    Hashtables are great for equality tests, but not distance comparisons where the values are very much not the same but are merely similarish. Quadtrees/octrees (and their relatives that use adaptive node-set splitting) are a much better solution for nearest neighbour searching.



  • @dkf said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    @undergroundcode said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I don't know much about geospatial indexes, but my first thought is to create something kind of like a hashtable.

    Hashtables are great for equality tests, but not distance comparisons where the values are very much not the same but are merely similarish. Quadtrees/octrees (and their relatives that use adaptive node-set splitting) are a much better solution for nearest neighbour searching.

    I wasn't thinking of the hashcode aspect, but the part about splitting the range into buckets. Seems like, if the bucket size was chosen wisely, it should be a decent way to quickly get items that are close to a given point without doing expensive calculations on the millions that aren't.



  • I feel like vector distance calculations are a pretty easy thing to do...


  • Discourse touched me in a no-no place

    @undergroundcode said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I wasn't thinking of the hashcode aspect, but the part about splitting the range into buckets.

    What do you think an octree does? 😉


  • Discourse touched me in a no-no place

    @magus said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I feel like vector distance calculations are a pretty easy thing to do...

    On a one-by-one basis, sure. Over the dataset of ten million items mentioned above? Well…


  • Impossible Mission Players - A

    @undergroundcode said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    Related WTF - why does installing Microsoft's own Database on their own OS take 10 times longer than installing a random open-source database?

    Short Story: Because merely blindly copying files and writing registry entries is Insufficient.



  • This post is deleted!


  • Maybe you should write a microservice app for storing spacial data.



  • @gąska I thought about that (and that's effectively what I'm doing now since @blakeyrat told me to man up and write code). The problem with stripping the non-route-planning information is that a surprising amount of information can be relevant. "I can refuel up to my 20ly jump range only in systems with main-sequence stars; my ship is too large to dock at a station. Also, I have a very large bounty on my head in Empire space, so I can't enter any Empire-controlled system. Finally, I'm transporting VIP passengers with queasy stomachs, so no neutron stars. Given those requirements, how do I get from HIP 2222 to ..."

    I think I've managed to get spatial lookups fast enough that I can afford to filter additional restrictions like those above the database level, but if someone has a tuned Beluga Spaceliner and tries to plot a route using 100ly hops, it might become a problem again, so I'm still trying to figure it out.

    @magus It's been done, and the last time I attempted this someone ( @accalia maybe?) recommended I use one that cost "only" $33K, but I don't think I have the expertise to do so. @cvi would probably be much better than I.

    Speaking of @cvi, I definitely want to use something more conformant than an octree. Since I anticipate the overall number of new datapoints to be low and in underfilled leaves anyway, I can use something with a preloading strategy, like R* with sort-tile-recursive loading. When I read the R* paper though, it provided very general algorithms that seemed to state the end goals rather than the means to accomplish them; I entered having very little idea of how to select a pivot point and direction, and left with many more words but even less idea.

    @UndergroundCode MSSQL is so time-consuming to install because of the extent of the changes it tries to make. Aside from the OS-level changes like tweaking the scheduler to favor longer quanta and letting it handle its own caching and working set management, it also installs a bunch of extras for service advertisement and discovery, extension and integration points for CLR code (instead of just C), prep work for installing other services, possibly IIS and SharePoint integration if you install Reporting Services, integration into Event Tracing for Windows, et cetera. Random open source databases tend to be "just the database, ma'am", which reduces the footprint considerably, and tend to avoid painful Windowsisms because they need to work on platforms that aren't Windows.


  • Discourse touched me in a no-no place



  • @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    @magus It's been done, and the last time I attempted this someone ( @accalia maybe?) recommended I use one that cost "only" $33K, but I don't think I have the expertise to do so. @cvi would probably be much better than I.

    What I mean is, if you take, say, Service Fabric, you can have actors that persist a position and an ID, and add a query to the service that activates them, that sends out a request to all of them, and ones close enough return back.

    I don't know exactly how easy it is to do, but it shouldn't cost that much.



  • @twelvebaud said in Recommend to me a 3DZ/nD geospatial database for less than Oracle's $36M:

    I think I've managed to get spatial lookups fast enough that I can afford to filter additional restrictions like those above the database level, but if someone has a tuned Beluga Spaceliner and tries to plot a route using 100ly hops, it might become a problem again, so I'm still trying to figure it out.

    Aren't you using some exact algorithms when some heuristic would do just fine?


  • Discourse touched me in a no-no place

    I know that "rare" extremely complex query error well. It comes from using EF in any application other than a CRUD app.

    Dump EF. It's junk for heavy lifting.

    I probably have some other advice, but I'll have to look at the problem space some more.



Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.