Finding distinct ranges in a collection - C#/SQL


  • kills Dumbledore

    I have a fairly complex database structure representing value ranges and the parameters that apply to those ranges. So each row has a minimum and a maximum, and foreign keys out to tables that hold the parameters.

    The complexity comes from the fact that among the top level ranges there are sub ranges and sub-sub ranges which override the top level if present. If a value is within a sub-sub range, the parameters for that are used, if not but it's within a sub range then it's those parameters, otherwise it uses the top level (which covers all possible values without gaps). None of the lower level ranges overlap and a lower level range will always be within one parent range, not crossing borders. So there won't be a subrange from 100 to 200 if the top level ranges are 0 - 150 and 151 - 300, and there won't be ranges on the same level like 100 - 200 and 150 - 250.

    I can't change the database structure but have free rein over the query to extract them and the handling of the resultant data. What I need out of the other side is a JSON document with an array of all ranges with no overlap, e.g. if there's a top level range of 1 to 1000, two sub ranges 100 - 200 and 250 - 350 with a sub-sub range of 300 - 350 I need the JSON to contain 1 - 99, 100 - 200, 201 - 249, 250 - 299, 300 - 350, 351 - 1000, with the appropiate parameters for each range.

    I was thinking of having one query return all ranges of all levels with their parameters, then sort out the distinct ranges in code, but either I'm too tired to come up with a sensible way to organise this, or it's just a bad way to approach the problem. I don't think trying to handle the ranges in SQL is a good approach, so maybe three different queries and combine the resulting collections in code? I'm still not sure what sort of approach would work with that.

    Any ideas?



  • @Jaloopa how big is your data set? if you've got a million or fewer ranges, it's probably not worth trying to optimize this.



  • here's the example as an SQL database if anyone wants to try stuff I guess: http://sqlfiddle.com/#!15/b2cb6


  • kills Dumbledore

    @ben_lubar said in Finding distinct ranges in a collection - C#/SQL:

    @Jaloopa how big is your data set? if you've got a million or fewer ranges, it's probably not worth trying to optimize this.

    Less than a hundred top level, a thousand sub ranges and about 80 sub sub ranges, so not particularly large.


  • Considered Harmful

    If I understand the preconditions correctly, you could just do something like this (I can't C#):

    • read the [min,max,n] triples ordered by n
    • if the list of triples is empty, insert the new one
    • if it is not, go through the list (or binary-search for O(nlogn)) until you find the one where min'<=min AND max'>=max
    • replace that one with the three triples [min',min-1,n'], [min,max,n], [max+1,max',n']. Add some special-casing to get fewer triples if a subrange can start or end on the same number as the higher-order one.

    Edit: here's Perl, :kneeling_warthog: for special-casing

    use strict;
    use warnings;
    use 5.024;
    use DBI;
    use List::MoreUtils 'firstidx';
    use constant VAL => 0;
    use constant MIN => 1;
    use constant MAX => 2;
    use constant SQL => 'SELECT * FROM ranges ORDER by fk';
        
    my $dbh = DBI->connect("dbi:SQLite:dbname=foo.sqlite", "", "") or die;
    
    my @r;
    for my $row ( $dbh->selectall_arrayref( SQL )->@* ) {
        my ($val, $min, $max) = @$row;
        @r or do { push @r, $row; next };
        my $i = firstidx { $_->[MIN] <= $min and $_->[MAX] >= $max } @r;
        splice(@r, $i, 1,
            [ $r[$i][VAL], $r[$i][MIN], $min-1 ],
            [ $val, $min, $max ],
            [ $r[$i][VAL], $max+1, $r[$i][MAX] ] 
        );
    }
    #use Data::Dumper; say Dumper(\@r);
    say join(", ", @$_) for @r;
    

  • Banned

    1. Get main ranges.
    2. Get beginnings and ends of sub- and subsubranges.
    3. Add 1 to all sub-ends.
    4. Join beginnings and end-plus-ones into one list.
    5. Split main range into two on each beginning-or-end-plus-one index.

    I hope it doesn't have to be a single query.



  • Basically, you can walk through the ranges and use a stack to keep track of the properties set. Create a list of start and end points in the ranges, sort the list by start/end point value, then loop over them. For each item, finish the previous range by setting the end point and create a new range. If the item starts a new range, push the item to the stack and set the new range's properties to the item's properties. If the item ends a range, pop from the stack and set the new range's properties to the properties of the item now at the top of the stack.

    Here's some Python code (all I have available at the moment) that seems to work.

    query_result = [(1, 1000, 1), (100, 200, 2), (250, 350, 3), (300, 350, 4)]
    range_endpoints = list()
    
    for i in query_result:
        t = ("min", i[0], i[2])
        range_endpoints.append(t)
        t = ("max", i[1], i[2])
        range_endpoints.append(t)
    
    def keyfunc(t):
        return t[1]
    
    range_endpoints.sort(key=keyfunc)
    print(range_endpoints)
    stack = list()
    stack.append((None, None, None)) # Dummy value that marks the start and end of the set of ranges
    
    final_ranges = list()
    first_endpoint = range_endpoints.pop(0)
    current_range = [first_endpoint[1], None, first_endpoint[2]]
    stack.append(first_endpoint)
    
    for t in range_endpoints:
        if t[0] == "min":
            current_range[1] = t[1] - 1
            final_ranges.append(current_range)
            current_range = [t[1], None, t[2]]
            stack.append(t)
        elif t[0] == "max":
            # Check for end of a range that ends at the same point as the previous range
            if t[1] > final_ranges[-1][1]:
                current_range[1] = t[1]
                final_ranges.append(current_range)
            
            stack.pop()
            
            current_range = [t[1] + 1, None, stack[-1][2]]
    
    print(final_ranges)
    
    

Log in to reply