Interesting Query Problem

revenade

Hi,

I have a SQL query problem. I have a table with two columns like this one;

    A            32
   A            67
   B            32
   B            28
   C            32
   C            67
   C            74
   A            112
   A            344

How can I retrieve the records of column1 which has both '32' and '67'? I mean this output;

     A            32
   A            67
     C            32
   C            67

B is excluded because it does not have both '32' and '67'..

Thx..

ammoQ

@revenade said:

How can I retrieve the records of column1 which has both '32' and '67'? I mean this output;

select a,b from mytable mt1 where b in (32,67) and exists (select 'x' from mytable mt2 where mt2.a=mt1.a and mt2.b in (32,67) and mt2.b<>mt1.b);

revenade

That's so cool :)) thank you

asuffield

You should really reconsider your database structure. If you are doing complex queries on multiple fields that are not the key, you're probably doing something wrong (and if there's no way to avoid this, what you're doing wrong is using a relational database for something it isn't good at).

rmr

I'm not sure I follow. One of the advantages of a database is precisely that it allows you to do "complex queries on multiple fields that are not the key". What do you suggest he change?

There are more normalized ways to store the data, but they don't make the query any easier. Or am I missing something?

lpope187

@rmr said:

I'm not sure I follow. One of the advantages of a database is precisely that it allows you to do "complex queries on multiple fields that are not the key". What do you suggest he change?

There are more normalized ways to store the data, but they don't make the query any easier. Or am I missing something?

I agree with your statement. Wouldn't the OP original question be relevant to User/Group memberships? For example, you have the following schema:

Users Table
UserId int
UserName nvarchar(50)

Groups Table
GroupId int
GroupName nvarchar(50)

User/Group Table
UserId int
GroupId int

Where the existence of a record in the User/Group table indicates that the user is a member of a group.

Now isn't the OPs question exactly analogous to the question "Show me all users who are members of both the Sales and Administrator groups"? There are plenty of examples where this type of question might be asked and the schema would be very similar. How about "Show me all orders that had both Part A and Part B as line items" - Maybe I'm looking at bundling Part A and B to boost sales and/or minimize logistics.

asuffield

@lpope187 said:

Now isn't the OPs question exactly analogous to the question "Show me all users who are members of both the Sales and Administrator groups"?

No - write out that query, compare it to ammoQ's solution, and you'll see why. This question can be answered with a simple join, because all the things which you are interested in are part of the key to at least one of the tables. The OP's question is difficult only because nothing involved is a key.

asuffield

@rmr said:

I'm not sure I follow. One of the advantages of a database is precisely that it allows you to do "complex queries on multiple fields that are not the key". What do you suggest he change?

Without knowing anything about the structure of the database, it's impossible to suggest how to rearrange the tables sanely. It probably needs to be split into several tables.

Relational databases are good at doing complex operations on keys. They're quite clumsy at doing them on queries that don't involve any keys, as this example demonstrates. If you can't work at least one of these fields into a key, then some other kind of database may be more appropriate. Again, with only the information provided, it's impossible to suggest what kind.

msntfs

I am pretty lame at database, so can not he just use join with the table itself?

ammoQ

@msntfs said:

I am pretty lame at database, so can not he just use join with the table itself?

Definitely. Looking back, I must have been a bit dazed when proposing a subquery. So let's make that

select mt1.a,mt1.b from mytable mt1, mytable mt2 where mt1.b in (32,67) and mt1.a = mt2.a and mt2.b in (32,67) and mt2.b<>mt1.b;

But since he is really interested in column a only,

select mt1.a from mytable mt1, mytable mt2 where mt1.b = 32 and mt1.a = mt2.a and mt2.b = 67;

is what we are really looking for.

rmr

Re. assufield:

If by clumsy you mean slow, you can always add an index. If you mean something else, then I simply disagree with you. This type of operation is going to be easier with a DB than with flat files or spreadsheets or whatever other method you come up with.

asuffield

@rmr said:

Re. assufield:
If by clumsy you mean slow, you can always add an index. If you mean something else, then I simply disagree with you. This type of operation is going to be easier with a DB than with flat files or spreadsheets or whatever other method you come up with.

Did you even read what I wrote? A relational database is a specific kind of database that is good at certain things and bad at others. Notably, it is bad at operations not involving any keys. There are many other kinds of database, few of which share this problem. The world is not limited to relational databases, flat files, and spreadsheets. It's impossible to suggest a better method without more detail on the problem, but in general a likely answer is an object database or a hierarchical database - most problems are effectively solved by at least one of the relational/object/hierarchical set of database models. Failing that, a raw database like Berkeley DB (from Oracle) will handle just about anything.

Turns out that ammoQ missed the simple approach in this specific case and I didn't notice either, but if he'd wanted to select all the rows then you have to do that set selection/compare dance, and that is quite resistant to effective indexing in a relational database. In general, any attempt to construct a complex query involving no keys will give you that kind of trouble, because your query will have to range over multiple rows in the same table, and that has polynomial complexity in a relational database.

The worst-case scenario for the relational model is queries which involve ranging over an unbounded number of rows; they have O(N^R) complexity for examining R rows (per output row) from a table containing N rows, and cannot be indexed if R is not constant. In this example, R is the maximum number of rows with the same letter A/B/C, and we haven't been told what that number is. If it has no fixed maximum, and the intent is to construct queries for any two arbitrary numbers and not just 32+67, there is no possible index for the entire query. (You can construct partial indexes, but they don't really help for large values of R)

rmr

I'm perfectly willing to believe that I am completely misguided here. So lets assume he takes your advice and moves this data to a different format, say an object database. Where is the payoff? What would the simpler query look like? You certainly have enough of his schema information to show that.

I press this not because I'm some relational database evangelist, but that I think the essential difficulties of this query will remain pretty much regardless of the way you store the data. If I am incorrect about that I would very much like to know, as it could certainly make my day to day work easier.

asuffield

@rmr said:

I'm perfectly willing to believe that I am completely misguided here. So lets assume he takes your advice and moves this data to a different format, say an object database. Where is the payoff? What would the simpler query look like?

It's not the query code itself that's simpler, but rather the total effort involved in running the query - rather than fighting with an RDBMS and trying to persuade it to run a query of this complexity in a vaguely sane and efficient manner, with an object database you would implement a lookup algorithm rather than a declarative query.

There's nothing like SQL involved when you do that, just some code in your target language, so it could be anything really. Well-implemented object databases feel very similar to persistent native data structures - there's more going on behind the scenes, but that's what the interface looks like. Pseudocode might possibly be this (or something completely different, since I really don't know enough about the problem to give a good example - this could be entirely the wrong tool for the job and is pure guesswork on my part):

for each letter x:

does the number set contain all the numbers we're interested in?

Where the database structure would be records of this form:

{

letter x;

set of numbers relating to this letter;

}

and the 'set' object is something vaguely like perl's hashes, ruby's Set, or whatever, so we make that part of the query with one basic function call. The notable difference from a relational database is that we are (sub)querying a nested structure here, while relational tables are flat - that's the primary advantage of an object database: they accept the same kind of data structures that you use everywhere else.

This is not a good example.

Far more likely is that a relational database with a different schema would also solve the problem, but we have no idea. All I'm reasonably sure about is that what he's currently doing is in some way not right.

If I am incorrect about that I would very much like to know, as it could certainly make my day to day work easier.

Object databases are not usually worth the effort. The cases in which they really shine, and the problem cannot be restated differently to work on a relational database, are quite rare - you are unlikely to ever encounter one. They do exist, though.

Most of the time, you either want to rethink your schema, or use something like Berkeley DB which allows arbitrary structure (so you can build your own hierarchial/network/whatever system).

(PostgreSQL and Oracle both have some limited ability to function as an object-relational database, which attempts to handle both forms at once, usually with very little success. Their implementations of it are decidedly unpleasant to work with and not particularly efficient.)

Jeff_S

Can you explain again what is wrong with his schema? I did not see it written anywhere what the primary key of that table is (I assume it is a composite PK of both columns), and I did not see it written anywhere that perhaps BOTH of the columns are foreign keys to other tables. Either way, it is irrelevant. how you can suggest it is a bad design and needs to change when a) we have no idea about what this table structure is physically, or even if it is a table (it may be a resultset of a query) and b) we have no idea what he is logically trying to model.

Or, are you confused because you don't see a unique value like an identity or an autonumber in each row, so you just assume that the table cannot possibly have a primary key in place, like so many others who don't understand the concept of composite keys?

If he is modeling a relation between two entities, (as a previous poster suggested with a great example), this is EXACTLY how you do in a relational database; or if you are storing multiple flags or attributes or values for a single entity (assuming A and B and C are fk relations to an entity table, of course), you would do it this way as well. Maybe you are hung up on the fact that you assume this table DEFINES the entities A,B and C and it is not just a related table to those entities?

asuffield

@Jeff S said:

Can you explain again what is wrong with his schema?

No, because of the lack of information. I can, and have, explained why the example as given has terrible (exponential) complexity. Either the schema is wrong (because of facts we do not know about the structure of the data, which would permit a more efficient schema - likely, because problems with this complexity are rare) or a relational database is the wrong tool for the task.

If he is modeling a relation between two entities, (as a previous
poster suggested with a great example), this is EXACTLY how you do in a
relational database

Yes, if that's what he happens to be doing, there is no better approach than O(N^R) for this kind of query within a relational database and so you really want to use something else - there are efficient solutions to that problem, you just can't really do them within the constraints of a relational database. That is why I suggested the possibility of using something else in the first place.

I still don't think it's likely, we probably just don't have a complete description of the problem.

Jeff_S

@asuffield said:

Yes, if that's what he happens to be doing, there is no better approach than O(N^R) for this kind of query within a relational database and so you really want to use something else - there are efficient solutions to that problem, you just can't really do them within the constraints of a relational database. That is why I suggested the possibility of using something else in the first place.

huh ?

Assuming a PK of (col1,col2) -- a reasonable assumption -- all you need is a simple, 1 pass query that can fully use indexes to return all entities that have the attributes that you need. For example, if you are looking for those with 1,2 and 3, you just write:

select col1
from tbl
where col2 in (1,2,3)
group by col1
having count(*) = 3

Without the pk on those columns, you'd need to do a less efficient but equally correct and effective HAVING COUNT(distinct col2)=3 condition.

It's very short, easy, quick and efficient if you know SQL, definitely nowhere close to O(n^r), it the same cost as querying any table on a few (indexed) columns .... i have no idea why you'd think modeling this any other way would be more efficient.

It is also very easy to generalize this to work with any set of attributes you want to find.

asuffield

Read the thread. We have already established that the specific question asked is easier than we first thought (although it still becomes difficult quite rapidly). I'm not going to repeat explanations already given.

JvdL

assufield, as much as I value your frequent comments and admire your vast knowledge, this time you're out of your league.

The DB structure and this type of query are common enough. Yet you say that "you don't know enough about the problem to say something meaningful" even though the question was clear from the OP and the user/group or item/part example make perfect examples. I can only conclude you mean you don't know enough about RDBMS. You say, "Relational databases only work good on keys". That's crap and you know it. You say, the query runs O(N^R). Even in the worst case, single table queries will never run more than O(N) in any RDB that I know. The given solutions are all O(log N). You say, "A relational database (ie., SQL) is not meant for searching". As the name suggest, it was designed for searching, which is not to say that it is the best tool for it.

One does not choose to use a DB because it is the fastest querying engine, but because it is ACID, portable, and open. More likely, one uses the DB because it is the legacy you have to deal with. If someone has a problem working out a SQL query, it serves no one to spit out bile and bigotry that is unfounded and untrue. Or if you really know a better way to do this with, say, regular expressions, than share it instead of hiding behind "I don't know enough".

asuffield

@JvdL said:

You say, the query runs O(N^R). Even in the worst case, single table queries will never run more than O(N) in any RDB that I know.

You have a table containing integers, you want to select the ones that have no factors in the table (ie, the largest pairwise relatively prime subset). A single table, and if you have an O(N) solution, there's a lot of money waiting for you, and you've cracked most modern encryption algorithms. (There exists at least an O(N!) solution to this problem, and it is an unsolved problem as to whether any solutions exist with lower O() bounds)

There exists a large class of problems which do not have efficient solutions in relational databases. That's why we don't use relational databases for everything. This is well-established in the literature. One notable class is the set of problems which must find subsets of rows fulfilling certain (non-trivial) criteria, because a relational database can only do it by trying every permutation, giving O(N^R) or O(N!) behaviour, but faster solutions are known to exist for many of them.

You say, "A relational database (ie., SQL) is not meant for searching".

Now I know that you're just trolling, because that is not a quote (and in fact, the string "search" had not previously appeared in this thread before your post, as anybody can see for themselves).

JvdL

@asuffield said:

@JvdL said:
You say, the query runs O(N^R). Even in the worst case, single table queries will never run more than O(N) in any RDB that I know.

You have a table containing integers, you want to select the ones that have no factors in the table (ie, the largest pairwise relatively prime subset). A single table, and if you have an O(N) solution, there's a lot of money waiting for you, and you've cracked most modern encryption algorithms. (There exists at least an O(N!) solution to this problem, and it is an unsolved problem as to whether any solutions exist with lower O() bounds)

There exists a large class of problems which do not have efficient solutions in relational databases. That's why we don't use relational databases for everything. This is well-established in the literature. One notable class is the set of problems which must find subsets of rows fulfilling certain (non-trivial) criteria, because a relational database can only do it by trying every permutation, giving O(N^R) or O(N!) behaviour, but faster solutions are known to exist for many of them.

I agree I typed a bit fast when saying "any single table query", however I fail to see the relation with finding factors of integers,because that's not a query problem but an algorithmic problem. The query version of your problem would be a table with two columns: for each integer there would be multiple records, with the integer in one column and one of its prime factors in another. Finding all primes in that table is easy: select intcolumn where primecolumn = intcolumn, this finds all the primes in O((log N)^2), where N is the number of records, not the number of integers, assuming that the table is indexed. Obviously, that won't make me any money because it takes O(N!) to generate table, but that's not the querier's problem.

I agree with you that RDB is not the solution to every problem, anybody who would say that is as biased someone saying an RDB won't solve any problem

@asuffield said:

@JvdL said:

You say, "A relational database (ie., SQL) is not meant for searching".

Now I know that you're just trolling, because that is not a quote (and in fact, the string "search" had not previously appeared in this thread before your post, as anybody can see for themselves).

From your first comment

[quote user="asuffield"]

If you are doing complex queries on multiple fields that are not the key, you're probably doing something wrong (and if there's no way to avoid this, what you're doing wrong is using a relational database for something it isn't good at).

[/quote]

Forgive me for reading this as "A relational database isn't good at doing complex queries", and equating "query" with "search"

asuffield

@JvdL said:

@asuffield said:
@JvdL said:
You say, the query runs O(N^R). Even in the worst case, single table queries will never run more than O(N) in any RDB that I know.

You have a table containing integers, you want to select the ones that have no factors in the table (ie, the largest pairwise relatively prime subset). A single table, and if you have an O(N) solution, there's a lot of money waiting for you, and you've cracked most modern encryption algorithms. (There exists at least an O(N!) solution to this problem, and it is an unsolved problem as to whether any solutions exist with lower O() bounds)

I agree I typed a bit fast when saying "any single table query", however I fail to see the relation with finding factors of integers,because that's not a query problem but an algorithmic problem. The query version of your problem would be a table with two columns: for each integer there would be multiple records, with the integer in one column and one of its prime factors in another. Finding all primes in that table is easy: select intcolumn where primecolumn = intcolumn, this finds all the primes in O((log N)^2), where N is the number of records, not the number of integers, assuming that the table is indexed. Obviously, that won't make me any money because it takes O(N!) to generate table, but that's not the querier's problem.

I used the relatively prime factor problem because it's well-known; you get the same behaviour with any problem that requires the pairwise consideration of multiple rows at once, if the consideration function is not trivial (and therefore not indexable). The point is that relational databases don't have any options for this class of problems other than searching all pairs, even though there may exist algorithms which allow you to consider just some of the possible pairs - you can't express that sort of thing in relational algebra, because it's not a relation, it's some other kind of math.

However, it's worth noting that you have approached the problem by altering the schema, and that is exactly what I suggested the OP should think about doing. Pairwise comparison problems very rarely occur in databases, and you can normally change the schema to avoid them.

JvdL

@asuffield said:

I used the relatively prime factor problem because it's well-known; you get the same behaviour with any problem that requires the pairwise consideration of multiple rows at once, if the consideration function is not trivial (and therefore not indexable). The point is that relational databases don't have any options for this class of problems other than searching all pairs, even though there may exist algorithms which allow you to consider just some of the possible pairs - you can't express that sort of thing in relational algebra, because it's not a relation, it's some other kind of math.
However, it's worth noting that you have approached the problem by altering the schema, and that is exactly what I suggested the OP should think about doing. Pairwise comparison problems very rarely occur in databases, and you can normally change the schema to avoid them.

I altered the schema in the minimal way to transform your arithmetic problem into a query problem. Your line of reasong is "RDBs perform badly because cryptography is difficult". Duh. You might as well say "Grass is not green because the sky is blue". It bears no relationship.

Allow me to go slightly OT. You say that if I find an O(N) algorithm to factor primes I'll be a millionaire. Since I don't object to getting rich, explain me what's wrong with the following reasoning (I'll share the millions with you). Many cryptograhy schemes like public/private key pairs are based on the practical impossibility to factor a large number N = p*q into its prime factors, if p and q are big (~N/2). Yet there is a simple O(N) algorithm to solve it: just iterate all integers up to N and see if they divide N, you'll find p and q. That's easy! Of course, the problem is that N is so big you'll never get there in a lifetime...

Back to the topic, I'm still unable out to figure how an RDB would possibly manage to go O(N^R) on the original problem. Let's see you have records [A,1], [A,2], ... [Z,1], [Z,2]. That's N=52 records. All letters that have both a 1 and and a 2, so R=26. So N^R ~ 10^44.

Let's say an RDB takes 1 nanosecond to do one interation. Then it will take 10^35 seconds to perform the query, that is to say, about 10,000,000,000,000,000,000,000,000,000 years to search these 52 records for the answer. Somehow that doesn't match my experience.

RaspenJho

@ammoQ said:

select mt1.a from mytable mt1, mytable mt2 where mt1.b = 32 and mt1.a = mt2.a and mt2.b = 67;

Of course, if you have SQL2K5, you could do this for 1/2 the query cost...

[code]SELECT A
FROM mytable
PIVOT ( COUNT(B) FOR B IN ([32], [67])) PVT
WHERE PVT.[32] = 1
AND PVT.[67] = 1[/code]

asuffield

@JvdL said:

Allow me to go slightly OT. You say that if I find an O(N) algorithm to factor primes I'll be a millionaire. Since I don't object to getting rich, explain me what's wrong with the following reasoning (I'll share the millions with you). Many cryptograhy schemes like public/private key pairs are based on the practical impossibility to factor a large number N = p*q into its prime factors, if p and q are big (~N/2). Yet there is a simple O(N) algorithm to solve it: just iterate all integers up to N and see if they divide N, you'll find p and q. That's easy! Of course, the problem is that N is so big you'll never get there in a lifetime...

That is only O(N) if you can test for divisibility in constant time, which you can't for unlimited-precision numbers. However, it's actually not the same problem as the one I described - finding the true primes is much "easier" (O(N) combined with the complexity of divide, for the Sieve of Erastothenes or Atkin, basically what you described) than finding all the coprimes in a given set of integers (O(N!) by all-pairs comparison). If you can solve the coprime problem in O(N), where N is the number of integers in the set (not the value of the largest integer in the set) then you have a viable attack on all the coprime-based algorithms (like RSA).

Back to the topic, I'm still unable out to figure how an RDB would possibly manage to go O(N^R) on the original problem. Let's see you have records [A,1], [A,2], ... [Z,1], [Z,2]. That's N=52 records. All letters that have both a 1 and and a 2, so R=26. So N^R ~ 10^44.

Let's say an RDB takes 1 nanosecond to do one interation. Then it will take 10^35 seconds to perform the query, that is to say, about 10,000,000,000,000,000,000,000,000,000 years to search these 52 records for the answer. Somehow that doesn't match my experience.

Your experience is quite unlikely to have involved doing an all-permutations search through such a table. In this example, to get R=26, you would have to be constructing all permutations of length 26. The first combination that you examined would be:

[A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1]

And the next would be:

[A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,1] [A,2]

It should be easy to see that there are 52^26 combinations to inspect.

I cannot imagine why you would want a query like this. But that is how long it would take to run.

JvdL

@asuffield said:

That is only O(N) if you can test for divisibility in constant time, which you can't for unlimited-precision numbers. However, it's actually not the same problem as the one I described - finding the true primes is much "easier" (O(N) combined with the complexity of divide, for the Sieve of Erastothenes or Atkin, basically what you described) than finding all the coprimes in a given set of integers (O(N!) by all-pairs comparison). If you can solve the coprime problem in O(N), where N is the number of integers in the set (not the value of the largest integer in the set) then you have a viable attack on all the coprime-based algorithms (like RSA).

The problem I described is factoring integers, not finding primes. The former is much more difficult than the latter and is the basis of RSA public key cryptography. As usual, you stand me corrected about the O, it's between O(N^2) and O(N^3), depending on the method of division you apply - still a long way from O(N!).

@asuffield said:

Your experience is quite unlikely to have involved doing an all-permutations search through such a table. [...]

My point is that RDBs are smart enough not do an all-permutations search on the OP query.

asuffield

@JvdL said:

@asuffield said:

Your experience is quite unlikely to have involved doing an all-permutations search through such a table. [...]

My point is that RDBs are smart enough not do an all-permutations search on the OP query.

The OP didn't post a query. I doubt that either postgres or oracle can avoid doing an N^2 all-pairs search for ammoQ's first query. In general, queries of this form are too difficult for any of the RDBMSes to find faster solutions (although a few degenerate ones may be special-cased).

Jeff_S

@asuffield said:

@JvdL said:
@asuffield said:

Your experience is quite unlikely to have involved doing an all-permutations search through such a table. [...]

My point is that RDBs are smart enough not do an all-permutations search on the OP query.

The OP didn't post a query. I doubt that either postgres or oracle can avoid doing an N^2 all-pairs search for ammoQ's first query. In general, queries of this form are too difficult for any of the RDBMSes to find faster solutions (although a few degenerate ones may be special-cased).

I already gave you an example to demonstrate that it is very efficient and easy to query sets of data using pretty much any relational database that supports indexes, joins and sql syntax. Searching through data to find primes and factors through calculations of course isn't efficeint with a relational database, but that has absolutely NOTHING to do with the original question or anyone's comments on this thread.

asuffield

@Jeff S said:

Searching through data to find primes and factors through calculations of course isn't efficeint with a relational database, but that has absolutely NOTHING to do with the original question or anyone's comments on this thread.

Still obsessively wrong, despite repeated direction to read the thread - we've already established that it's not directly related to the original question, only to the other comments.

Really bored with this by now. Learn to read.

JvdL

@asuffield said:

I doubt that either postgres or oracle can avoid doing an N^2 all-pairs search for ammoQ's first query.

I can't believe I'm responding to this. Let's take a table with letters A..Z and numbers 1..100000, full density. That's N=2,600,000 rows. R=100,000 is the maximum number of rows with the same letter A/B/C, as defined by assufield in one of his or her previous posts. According to that post, an RDB will use 26000000^100000 iterations to solve this conundrum. For unexplained reasons assufield suddenly dropped that claim to 26000000^2 ~ 10^12, still a sizable number. I haven't got Oracle or Postgress around, so let's do it on SQL Server, using a mix of T-SQL and C#

Let's try ammoQ's offending query on my five-year old memory-constrained laptop. It runs at 2GHz, that's 2 clock cycles per nanosecond. Even under the impossible assumption that each iteration takes one clock cycle, assufield's betting it will take 10^(12-9) = 1000 seconds...

Let's put this claim to the test, using an enterprisey mix of mix of T-SQL, C# and XML.

-- create table:

CREATE TABLE dbo.assufield (letter varchar (50) NOT NULL, number int NOT NULL );
// Write 2,600,000 rows to tab separated file

string file = Path.GetTempFileName();

using(StreamWriter writer = new StreamWriter(file)) {

for(int i=1; i <= 100000; i++ )

for(char a='A'; a <= 'Z'; a++ )

writer.WriteLine("'{0}'\t{1}",a,i);

}
// Import the flat file to the DB

using( Transformer transformer = new Transformer(session) ) {

transformer.ImportFlatFileToSql(file,session.Database,"assufield",Transformer.FF_TSV);

}
-- index on letter and number:

CREATE INDEX IX_letter ON assufield(letter) ON [PRIMARY];

CREATE INDEX IX_number ON assufield(number) ON [PRIMARY];
<!-- ammoQ's badly nested query, and his joined improvement -->

<Queries>

<Query name="assufield/ammoQ-nested">

<Param name="@one" type="int" defaultValue="32" />

<Param name="@two" type="int" defaultValue="67" />

-- The offending query from ammoQ:

SELECT letter,number FROM [%DATABASE%].dbo.assufield a1

WHERE number in (@one,@two)

AND EXISTS (SELECT 'x' from [%DATABASE%].dbo.assufield a2

WHERE a2.letter=a1.letter AND a2.number in (@one,@two) AND a2.number<>a1.number);

/*

SET SHOWPLAN_TEXT ON

|--Nested Loops(Left Semi Join, OUTER REFERENCES:([a1].[letter], [a1].[number]))

|--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([assufield] AS [a1]))

|	|--Index Seek(OBJECT:([assufield].[IX_number] AS [a1]), SEEK:([a1].[number]=32 OR [a1].[number]=67) ORDERED FORWARD)

|--Filter(WHERE:([a2].[number]<>[a1].[number] AND ([a2].[number]=32 OR [a2].[number]=67)))

|--Bookmark Lookup(BOOKMARK:([Bmk1001]), OBJECT:([assufield] AS [a2]))

|--Index Seek(OBJECT:([assufield].[IX_letter] AS [a2]), SEEK:([a2].[letter]=[a1].[letter]) ORDERED FORWARD)

*/

</Report>
&lt;Query name="assufield/ammoQ-joined"&gt;
	&lt;Param name="@one" type="int" defaultValue="32" /&gt;
	&lt;Param name="@two" type="int" defaultValue="67" /&gt;
	-- The improved query from ammoQ:
	SELECT a1.letter FROM [%DATABASE%].dbo.assufield a1, [%DATABASE%].dbo.assufield a2
		WHERE a1.number = @one
		AND a1.letter = a2.letter
		AND a2.number = @two;
	/*
	SET SHOWPLAN_TEXT ON
	|--Hash Match(Inner Join, HASH:([a2].[letter])=([a1].[letter]), RESIDUAL:([a2].[letter]=[a1].[letter]))
		|--Bookmark Lookup(BOOKMARK:([Bmk1001]), OBJECT:([assufield] AS [a2]))
		|	|--Index Seek(OBJECT:([assufield].[IX_number] AS [a2]), SEEK:([a2].[number]=67) ORDERED FORWARD)
		|--Bookmark Lookup(BOOKMARK:([Bmk1000]), OBJECT:([assufield] AS [a1]))
			|--Index Seek(OBJECT:([assufield].[IX_number] AS [a1]), SEEK:([a1].[number]=32) ORDERED FORWARD)
	*/
&lt;/Report&gt;

</Queries>
// Run the queries

using( Session.Connection conn = session.CheckOpenConnection() )  {

foreach( string query in Queries ) {

long start, stop;

SqlCommand comm = session.QueryPool[query].CreateCommand(conn);

start = DateTime.Now.Ticks;

using( SqlDataReader reader = comm.ExecuteReader() ) {

while(reader.Read() )

Console.WriteLine(reader.GetValue(0));

}

stop = DateTime.Now.Ticks;

Console.WriteLine("Execute Query {0}: {1}ms",query,stop-start);

}

}

And what does it say in the console?
Execute Query assufield/ammoQ-nested: 0ms
Execute Query assufield/ammoQ-joined: 0ms

I rest my case.

Jeff_S

@asuffield said:

@Jeff S said:
Searching through data to find primes and factors through calculations of course isn't efficeint with a relational database, but that has absolutely NOTHING to do with the original question or anyone's comments on this thread.
Still obsessively wrong, despite repeated direction to read the thread - we've already established that it's not directly related to the original question, only to the other comments.
Really bored with this by now. Learn to read.

I responded and quoted DIRECTLY to a statement that you made, and you
ignored it completely and threw in an insult asking ME to read the
thread ???

Humor me. You are clearly much smarter than myself and everyone else here, please do me and society a favor by summarizing, in a clear, precise sentence or two, what your POINT is. I know, I know, I'm "not worth the time" , I "can't read so I still won't get it" and so on and so forth, I get it, but just for shits and giggles, it shouldn't take more than a few moments of your time, please humor me and others and write brief one or two sentence summary of what you are trying to say in this thread. yes, yes, if I "could actually read I could figure it myself" and "it will be way over my head since I am such a moron" and all that, etc, etc, please just take a moment and state your point. Thank you in advance.

Jeff_S

@JvdL said:

And what does it say in the console?

Execute Query assufield/ammoQ-nested: 0ms

Execute Query assufield/ammoQ-joined: 0ms

I rest my case.

JvdL -- if you put my simple SQL statement (repeated below, everyone seems to have missed it) in there, and examine the execution plan to the others, you will see it is even more efficient then either of those -- it only requires 1 pass through the entire table, O(n) at worst, with an index O (log n) at best -- further proving your (our) point. But I think it doesn't matter -- Ausfield is arguing about something he knows very little about, I doubt he will understand or acknowledge your efforts to explain it to him.

select a.letter
from assufield a
where a.number in (@one,@two)
having count(*) = 2

Note that if we *really* need to return results that echo out the numbers passed in per letter (I doubt it -- I am sure the key thing you are after is the letters that satisfy the condition of having both numbers passed in), you just wrap it up like this:

select a.letter, n.number
from (the above sql here ) a
cross join (select @one as number union all select @two) n

Ausfield -- Don't get confused by that cross join! It's not O(n^2) !! :)

JvdL

@Jeff S said:

JvdL -- if you put my simple SQL statement (repeated below, everyone seems to have missed it) in there, and examine the execution plan to the others, you will see it is even more efficient then either of those -- it only requires 1 pass through the entire table, O(n) at worst, with an index O (log n) at best -- further proving your (our) point.

You're right, credit where credit's due. To assufield's demise and Microsoft's glory, the tests show that RDB's are perfectly able and willing to optimize WTF queries (sorry, ammoQ) on large datasets into snappy algorithms.

ammoQ

@JvdL said:

WTF queries (sorry, ammoQ)

shame where shame is due

asuffield

@JvdL said:

@asuffield said:
I doubt that either postgres or oracle can avoid doing an N^2 all-pairs search for ammoQ's first query.

I can't believe I'm responding to this. Let's take a table with letters A..Z and numbers 1..100000, full density.

You're still obsessing over the OP's scenario, despite repeated statements that it's not directly relevant. You're not going to get it.

Jeff_S

@asuffield said:

@JvdL said:
@asuffield said:
I doubt that either postgres or oracle can avoid doing an N^2 all-pairs search for ammoQ's first query.

I can't believe I'm responding to this. Let's take a table with letters A..Z and numbers 1..100000, full density.
You're still obsessing over the OP's scenario, despite repeated statements that it's not directly relevant. You're not going to get it.

So, please MAKE YOUR POINT !!! We can only respond to what you are writing, and when we respond directly to your words, you keep telling us we are not listening or reading what you are writing. Dumb it down for us, make it clear and simple for us morons, what is your POINT ????

Jeff_S

Here is a summary of Ausfield's "words of wisdom" regarding relational databases.

Everything that is completely and unarguably WRONG as demonstrated over and over in this thread is highlighted in bold. Advice to Ausfield: read up on these silly little things called INDEXES.

Enjoy!

You should really reconsider your database structure. If you are doing complex queries on multiple fields that are not the key, you're probably doing something wrong (and if there's no way to avoid this, what you're doing wrong is using a relational database for something it isn't good at).
...
Without knowing anything about the structure of the database, it's impossible to suggest how to rearrange the tables sanely. It probably needs to be split into several tables.

...

Relational databases are good at doing complex operations on keys. They're quite clumsy at doing them on queries that don't involve any keys, as this example demonstrates. If you can't work at least one of these fields into a key, then some other kind of database may be more appropriate. Again, with only the information provided, it's impossible to suggest what kind.

...

Did you even read what I wrote? A relational database is a specific kind of database that is good at certain things and bad at others. Notably, it is bad at operations not involving any keys. There are many other kinds of database, few of which share this problem. The world is not limited to relational databases, flat files, and spreadsheets. It's impossible to suggest a better method without more detail on the problem, but in general a likely answer is an object database or a hierarchical database - most problems are effectively solved by at least one of the relational/object/hierarchical set of database models. Failing that, a raw database like Berkeley DB (from Oracle) will handle just about anything.

Turns out that ammoQ missed the simple approach in this specific case and I didn't notice either, but if he'd wanted to select all the rows then you have to do that set selection/compare dance, and that is quite resistant to effective indexing in a relational database. In general, any attempt to construct a complex query involving no keys will give you that kind of trouble, because your query will have to range over multiple rows in the same table, and that has polynomial complexity in a relational database.

The worst-case scenario for the relational model is queries which involve ranging over an unbounded number of rows; they have O(N^R) complexity for examining R rows (per output row) from a table containing N rows, and cannot be indexed if R is not constant. In this example, R is the maximum number of rows with the same letter A/B/C, and we haven't been told what that number is. If it has no fixed maximum, and the intent is to construct queries for any two arbitrary numbers and not just 32+67, there is no possible index for the entire query. (You can construct partial indexes, but they don't really help for large values of R)

...

I used the relatively prime factor problem because it's well-known; you get the same behaviour with any problem that requires the pairwise consideration of multiple rows at once, if the consideration function is not trivial (and therefore not indexable). The point is that relational databases don't have any options for this class of problems other than searching all pairs, even though there may exist algorithms which allow you to consider just *some* of the possible pairs - you can't express that sort of thing in relational algebra, because it's not a relation, it's some other kind of math.

However, it's worth noting that you have approached the problem by altering the schema, and that is exactly what I suggested the OP should think about doing. Pairwise comparison problems very rarely occur in databases, and you can normally change the schema to avoid them.

...
>>Can you explain again what is wrong with his schema?
No, because of the lack of information. I can, and have, explained why the example as given has terrible (exponential) complexity. Either the schema is wrong (because of facts we do not know about the structure of the data, which would permit a more efficient schema - likely, because problems with this complexity are rare) or a relational database is the wrong tool for the task.
>> If he is modeling a relation between two entities, (as a previous poster suggested with a great example), this is EXACTLY how you do in a relational database

Yes, if that's what he happens to be doing, there is no better approach than O(N^R) for this kind of query within a relational database and so you really want to use something else - there are efficient solutions to that problem, you just can't really do them within the constraints of a relational database. That is why I suggested the possibility of using something else in the first place.

...

There exists a large class of problems which do not have efficient solutions in relational databases. That's why we don't use relational databases for everything. This is well-established in the literature. One notable class is the set of problems which must find subsets of rows fulfilling certain (non-trivial) criteria, because a relational database can only do it by trying every permutation, giving O(N^R) or O(N!) behaviour, but faster solutions are known to exist for many of them.

...

The point is that relational databases don't have any options for this class of problems other than searching all pairs, even though there may exist algorithms which allow you to consider just *some* of the possible pairs - you can't express that sort of thing in relational algebra, because it's not a relation, it's some other kind of math.

Whew! that's an awful lot of completely, 100% wrong statements to make in a single thread. I think we have a new record.

By the way -- you are obsessing about finding matching pairs for some reason, and telling us how impossible it is to do efficiently with a relational database without "keys". My simple, junior programmer sql statement is easily modified to return matching rows for any number of attributes, not just two, and the cost is still O (N log N) at the WORST!

(Note: I was wrong previously saying the worst case, no index efficiency was O(N), since if there are no indexes at all, the table must be sorted).

But seriously, dude, stick to topics that you have some knowledge of. To discuss databases and not know basic SQL or the basics of indexes is embarrassing and really a waste of everyone's time here.

asuffield

@Jeff S said:

@asuffield said:
You're still obsessing over the OP's scenario, despite repeated statements that it's not directly relevant. You're not going to get it.
So, please MAKE YOUR POINT !!! We can only respond to what you are writing, and when we respond directly to your words, you keep telling us we are not listening or reading what you are writing. Dumb it down for us, make it clear and simple for us morons, what is your POINT ????

My rule of thumb is that everybody gets at most two explanations (counting only durable mediums that they can review at leisure, not unrecorded conversations) of the same thing, and then I quit bothering, because the chances of them getting it on any successive attempts are extremely low. You've had yours.

asuffield

@Jeff S said:

Everything that is completely and unarguably WRONG as demonstrated over and over in this thread is highlighted in bold.

Oh yes, let's play the game where we all sit around and say that the other people are wrong, and then say it again and again, waving our arms in the air and screaming "WRONG WRONG WRONG". I like to call this game "US politics".

Or better yet, let's not. It is a silly game.

Jeff_S

@asuffield said:

@Jeff S said:
Everything that is completely and unarguably WRONG as demonstrated over and over in this thread is highlighted in bold.

Oh yes, let's play the game where we all sit around and say that the other people are wrong, and then say it again and again, waving our arms in the air and screaming "WRONG WRONG WRONG". I like to call this game "US politics".
Or better yet, let's not. It is a silly game.

Nah, no thanks. Since I am not playing that game and didn't even come remotely close to doing it, no point in starting now. Nowhere did I shout "wrong, wrong" -- everything I quoted is something you wrote, and everything highlighted has been clearly and easily proven wrong in this very thread or is easily provable by anyone who has basic, common knowledge of SQL and relational databases. Maybe you should start actually reading what you write? Or, maybe you could actually stop being so damn condescending and actually join the discussion and try to actually provide concrete examples or proofs instead of shouting over and over that we are not reading what you are writing and that you are "tiring" of us? Just a thought. It would be great to have an intelligent discussion about this stuff, I am willing to listen and learn something new and accept the fact that I don't know everything, are you?

You refuse to address direct questions and examples that are in direct response to things you write, you refuse to clarify any of your positions, and you refuse to provide any specific examples at all to back up anything that you are saying. You're just throwing stuff out there to make yourself sound smart and whenever you are asked to actually back up your statements, you keep either a) ignoring what we write completely or b) accusing us not reading what you are writing. I think maybe what you should be accusing us of, instead, is a) not reading your mind to understand what you are thinking or b) refusing to simply accept your "alternate reality" and having the annoying tendency to question completely ignorant and incorrect blanket statements about the efficiency of relational databases from someone who has so far demonstrated very little knowledge about the topic. Feel free to prove us wrong. Otherwise, we can chalk this up to you simply bs'ing and trying to look and sound smart and hoping that maybe no one is paying close enough attention to expose you. Which is, unfortunately, what has happened so far .... until you simply decide to write a clear, concise, focused sentence or two and/or provide a simple example that shows otherwise, of course.

By the way, if you insist (just ask nicely), I would be happy to address your false statements one by one with either references or simple examples. Just let me know, it would be no problem at all.

asuffield

@Jeff S said:

Feel free to prove us wrong. Otherwise, we can chalk this up to you simply bs'ing and trying to look and sound smart and hoping that maybe no one is paying close enough attention to expose you.

See, there's the difference between us. You're playing to the crowd and fussing over what other people think, but I was only interested in talking about the OP's problem (or related problems), and I have no interest in a long argument about whose genitalia are larger. I really don't give a damn what you or anybody else thinks, and have only bothered to respond to those parts that were either direct questions or things that I particularly wanted to comment on (and usually only very briefly, most of this thread hasn't been worth more than a minute or two of writing). Primarily I just provided enough information for you to figure it out if you wanted to, because I have an overdeveloped sense of intellectual responsibility ("if you know something interesting or important, you should pass it on"). I have better things to do than "prove you wrong", so I'm not going to bother.

Random observation: if everybody who looked at "I have a truly marvellous proof of this proposition which this margin is too narrow to contain" had said "he doesn't know anything, he's just trying to sound smart" then we wouldn't have a lot of very interesting results in a whole range of fields of mathematics. That kind of thinking does not lead to progress.

Jeff_S

@asuffield said:

@Jeff S said:
Feel free to prove us wrong. Otherwise, we can chalk this up to you simply bs'ing and trying to look and sound smart and hoping that maybe no one is paying close enough attention to expose you.
See, there's the difference between us. You're playing to the crowd and fussing over what other people think, but I was only interested in talking about the OP's problem (or related problems), and I have no interest in a long argument about whose genitalia are larger.

Are you reading what you are writing? Did you even read this thread at all? You spent the ENTIRE thread over and over talking about things OTHER than what the OP needed! You made blanket statements about relational databases and how they cannot solve his problem or any related problem with any efficiency since "keys were not involved", you were quoted directly and questioned directly on this and each time you refused to address it! I don't give a crap what you think about me or my opinions, I haven't stated a single opinion this ENTIRE thread -- only you have! We have all been giving facts and concrete examples, that all DIRECTLY PERTAIN to the OP's question and directly related questions, you only go on and on about random theoretical bullshit that has no connection at all to how databases work and that directly contradicts the workings of simple indexes and simple SQL statements and execution plans! I don't care if anyone thinks I am smart, hell I definitely am not, I just know the absolute basics and I know enough to know that YOU are giving wrong and misleading information regarding how you THINK databases work to someone asking for help and now you are trying so damn hard to defend your statements by using insults, refusing to answer direct questions, and changing the subject over and over! Instead of simply saying "oh, that's all you need, some indexes and a a simple SELECT with a GROUP BY? Ah, how did I miss that, makes sense to me!" (which pertains DIRECTLY to the OP's question, of course) you spend the entire thread misdirecting and making a fool of yourself. I even SAID over and over that my solution is simple, basic SQL and we proved that it is not as inefficient as you think. Where have I ever implied that I gave a crap who thinks I smart or clever? I'm not! I said as much! This is SQL 101 stuff, you learn about it in day 10 of "learning sql in 21 days". It's truly sad that you are not capable of recognizing or acknowledging this.

What's even sadder is, this is a programming forum, and I have asked over and over for you to simply help US understand with a simple, specific example a situation where a sql database is not good for this situation, where it truly will require O(n^r) to find simple matching attributes for data, and you REFUSE to answer! You can't even do it! I am not looking to prove you wrong, I am looking to learn and find the right answer, it's really that simple for me. But, I guess you are not here to help people, or to learn, or to give sound advice, or to intelligently debate some topics -- you are simply here to demonstrate to us your intelligence. Unfortunately for you, you have succeeded in doing that.

revenade

Hello Guys,

Actually I haven't check the forum after I got my answer but today I saw that I got really amazing answers from all of you. I want to make it clear that the table is a result of a big query which I put into the temp table to get the results that I wanted.

Thanks for all answers again.

ammoQ

@Jeff S said:

select a.letter
from assufield a
where a.number in (@one,@two)
having count(*) = 2

Isn't there a "group by a.letter" missing, or does SQLServer insert that automagically? (Just curious - Oracle wouldn't like that query, but on the other hand, the group by clause could easily be determined automatically)

Of course this solution depends on the uniquenss of (letter,number); we might turn that to (Oracle syntax)

select a.letter
from assufield a
where a.number in (:one, :two)
group by a.letter
having min(a.number)=least(:one,:two)
and max(a.number)=greatest(:one,:two)

Jeff_S

@ammoQ said:

@Jeff S said:
select a.letter
from assufield a
where a.number in (@one,@two)
having count(*) = 2
Isn't there a "group by a.letter" missing, or does SQLServer insert that automagically? (Just curious - Oracle wouldn't like that query, but on the other hand, the group by clause could easily be determined automatically)
Of course this solution depends on the uniquenss of (letter,number); we might turn that to (Oracle syntax)
select a.letter
from assufield a
where a.number in (:one, :two)
group by a.letter
having min(a.number)=least(:one,:two)
and max(a.number)=greatest(:one,:two)

Yes! good catch, my mistake -- you need to group by letter. I had that in my original sql demonstrating the technique, but forgot it in the last two .... Thanks!! (see how easy it is to admit when you make a mistake or oversight, Ausiffield? )

As for the alternate solution, you'd have to check the execution plans to see which is more efficient, but that would certainly work. The big difference is that will only work for if you need to find letters that match exactly 2 numbers, and not for any number of numbers, so it is a bit more specific. Of course, it all depends on what you need.

JvdL

America's State Unit For Federal Intelligence Erratic Logic Database = a very expensive turing machine, that matches any input against the digitized version of Donald E. Knuth's The Art of Computer Programming using an O(N!) all-pairs matching algorithm. Unfortunately, the algorithm usually runs out of memory and halts with output that has only a 1/N! probability of being remotely related to the input. Some evil hax0r seems to have connected this machine to an RSS feed of WTF...

Jeff_S

@JvdL said:

America's State Unit For Federal Intelligence Erratic Logic Database = a very expensive turing machine, that matches any input against the digitized version of Donald E. Knuth's The Art of Computer Programming using an O(N!) all-pairs matching algorithm. Unfortunately, the algorithm usually runs out of memory and halts with output that has only a 1/N! probability of being remotely related to the input. Some evil hax0r seems to have connected this machine to an RSS feed of WTF...

Well done! :)