Best way to do this in SQL?

Jaloopa

I'm trying to put together a database model for a process that involves a series of steps to be performed in order. There are multiple step types (around 10) that all require different information, and each process instance needs an ordered list of steps to run through.

I don't want a single step table with columns for each type, most of which will end up being null. I also don't think it's complex enough to justify going full EAV, especially since I can't think of an obvious way to ensure that every Type1Step has all attributes required by a Type1Step, and no others.

I started going down the route of a table for each step type, with a StepTypeID column in each table that has a default and check constraint to effectively fix it to a particular value for each table, but then I'd lose the ability to have a foreign key relationship linking the two.

This doesn't seem like an unheard of use case. Is there a pattern name I could be searching for or some standardised way of approaching this sort of problem?

dkf

I'd be tempted to make each step type be its own table, but without any values that are fixed for that step type. I'd then make a view that combines the various steps together with their defaults (such as a step type code) and yes, I'd expect plenty of NULL columns in there; a Type2FooStep doesn't know what to do with a Type3BarStep's type-specific columns. However, I don't know if you can have a FK relationship with a view.

AlexMedia

Do you always have the same amount of steps and amount of information needed, or is there a chance that steps may be skipped, that parts of the step can be skipped, or that different responses can be given based on previously provided responses?

If the number of steps and sub-steps is fixed, the most sane approach would probably be to keep a series of Step#Tables and a master Process-table which links to the PK's of the rows of the individual steps. If it can all be dynamic, you're looking at some sort of questionnaire and you should adopt a table structure usually seen with those kinds of products. You might even want to consider using NoSQL if you feel that that approach better suits your model.

Jaloopa

@AlexMedia said in Best way to do this in SQL?:

Do you always have the same amount of steps and amount of information needed

It's an automated testing framework to allow a user to define a process and check certain parameters along the way. Each step is a simple action like "go to this URL", "enter text in this box", "check the value of this element is what you expect", etc.

The idea is that testers can build up a set of tests, each of which is a different combination of separate steps, so the number in each process instance could vary a lot.

gleemonk

If you have a separate table per step type, adding a step type is expensive. How likely are you to add new step types, how many are there anyway?

Is it a requirement that step parameter types and constraints are in the DB schema, or is that just a design decision and step parameters could be validated before storing them? A simple solution would be to store parameter values as serialized string in a single field. Sure that makes parameter values opaque to SQL, so you couldn't manage them directly in SQL. Do you need that?

Bonus: To improve your WTF score, you could use a different format per step type. So one step serializes its parameters to XML, another to JSON, and the third to YAML. There are so many formats to choose that you can add a lot of step types before you run out.

Jaloopa

@gleemonk said in Best way to do this in SQL?:

How likely are you to add new step types, how many are there anyway?

If any are added after the initial design it should be a fairly rare occurrence. Hopefully.

@gleemonk said in Best way to do this in SQL?:

Is it a requirement that step parameter types and constraints are in the DB schema, or is that just a design decision and step parameters could be validated before storing them?

Design decision. They need to be stored, and reinveniting the kind of checks SQL does seemed silly. Of course, now I'm fleshing it out more the constraints aren't really what SQL is optimised for.

@gleemonk said in Best way to do this in SQL?:

A simple solution would be to store parameter values as serialized string in a single field. Sure that makes parameter values opaque to SQL, so you couldn't manage them directly in SQL. Do you need that?

This should all be viewable from the front end. The backend storage is kind of irrelevant, but SQL seems less WTFy than making an XML schema or something

@gleemonk said in Best way to do this in SQL?:

Bonus: To improve your WTF score, you could use a different format per step type. So one step serializes its parameters to XML, another to JSON, and the third to YAML. There are so many formats to choose that you can add a lot of step types before you run out.

I hate you

Dragnslcr

@gleemonk said in Best way to do this in SQL?:

A simple solution would be to store parameter values as serialized string in a single field. Sure that makes parameter values opaque to SQL, so you couldn't manage them directly in SQL. Do you need that?

It might not be as bad as you think, depending on which database system you're using. Postgres, Oracle, and MSSQL have XML functions, and at least Postgres has JSON functions, so you can still unpack the single string into separate values if you need to. It's probably 50/50 on whether that's better or worse than having a bunch of NULL values in every row, though.

PleegWat

@Jaloopa I think you need both. You need a STEP table which has the common columns for all step types (at least PROCESS_ID, STEP_ID, STEP_TYPE, STEP_SEQNO). You also want separate tables per step type for the type-specific information (FK to STEP)

You may not strictly need STEP_ID (use PROCESS_ID+STEP_SEQNO instead) or STEP_TYPE, but your life is probably easier if you include them.

Jaloopa

@PleegWat that's the sort of approach I was moving towards I think. Or a check constraint with a function to effectively duplicate the foreign key behaviour

CREATE FUNCTION CheckStep(@StepTypeId INT, @StepId INT)
RETURNS INT
AS BEGIN
	DECLARE @StepExists INT = 0
	IF @StepTypeId = 1
	BEGIN
		SET @StepExists = CASE WHEN EXISTS (SELECT * FROM GoToURLStep WHERE ID = @StepId) THEN 1 ELSE 0 END
	END
	ELSE IF @StepTypeId = 2
	BEGIN
		SET @StepExists = CASE WHEN EXISTS (SELECT * FROM FillInputStep WHERE ID = @StepId) THEN 1 ELSE 0 END
	END
	ELSE BEGIN
		SET @StepExists = 0
	END
	RETURN @StepExists
END

PleegWat

@Jaloopa Huh, where is that getting the step ID+type from if you don't have a STEP table?

You may want a constraint to verify the appropriate child table contains a record and the others do not, but that will be tricky since you also want an FK check on the child table. Your check constraint is the other way round. However such a 'reverse' FK check may be tricky when inserting things (disable constraint when inserting, or only check on commit).

Jaloopa

@PleegWat Step ID in this is the primary key of the sub step table. Step type is a column each sub type has, constrained to a different value for each table. The function is run as a check constraint on the process table, so checks that the step type and ID entered exist in the correct table.

masonwheeler

@Jaloopa said in Best way to do this in SQL?:

It's an automated testing framework to allow a user to define a process and check certain parameters along the way.

Then why are you trying to model it as a database table rather than as a script?

anotherusername

@Jaloopa said in Best way to do this in SQL?:

with a StepTypeID column in each table that has a default and check constraint to effectively fix it to a particular value for each table, but then I'd lose the ability to have a foreign key relationship linking the two.

Composite foreign keys are a thing.

Also, this isn't really an SQL question; it's a database schema question.

Jaloopa

@anotherusername yeah, but not foreign keys that can reference multiple tables AFAIK

Jaloopa

@masonwheeler said in Best way to do this in SQL?:

Then why are you trying to model it as a database table rather than as a script?

In code, I'll have a selection of test objects, each of which has a collection of step objects. Maybe I've been using too much Entity Framework recently, but that looks like a collection of relational items to me, so a database makes sense

PleegWat

@Jaloopa The foreign key is from the STEP_* tables to the PROCESS (or TEST) table. The reason to have a global STEP table is so you can centrally define STEP_SEQNO, making sure it doesn't have clashes.
The only reason to have foreign keys onto STEP_* is if you have some additional (child) table which references steps. But I don't see why the test itself would reference a step.

Maybe if you have branching logic and one or more NEXT_STEP columns in your STEP_* rather than a STEP_SEQNO? In that case I'd say you definitely want a central administration of steps.

djls45

@Jaloopa A composite FK can reference a composite PK for another table, which itself could include a FK to a third table.

djls45

@anotherusername said in Best way to do this in SQL?:

database schema question

@Jaloopa, would a layout like this be useful?

Table	Process	StepType	Step	ProcessRun	StepValues
Columns	Process_ID (PK)	Process_ID (FK) (PK)	Process_ID (FK) (PK)	Process_ID (FK) (PK)	Process_ID (FK) (PK)
	...	StepType_ID (PK)	StepType_ID (FK) (PK)	Run_ID (PK)	Run_ID (FK) (PK)
		...	Step_ID (PK)	Run_DateTime	Step_ID (FK) (PK)
			Step_Order or Next_Step_ID (FK)	Running_User	Entered_Value
			...	...	...

cartman82

If all you need is to store step information, maybe you can get away with just a single steps table with common columns, and store anything specific for each
step type as a blob of JSON. If your DB can query JSON, even better.

The proper relational way to do this would be to store these additional blobs of data in FK-connected tables per step type. Depends on how normalized you want to get.

anotherusername

@cartman82 said in Best way to do this in SQL?:

The proper relational way to do this would be to store these additional blobs of data in FK-connected tables per step type. Depends on how normalized you want to get.

Yeah... I think he's asking how to do that.

cartman82

@anotherusername said in Best way to do this in SQL?:

Yeah... I think he's asking how to do that.

CREATE TABLE steps {
  id INT PRIMARY KEY,
  seq_id INT,
  seq_number: INT,
  step_type: VARCHAR,
  FOREIGN KEY (seq_id) REFERENCES sequences (id)
}

CREATE TABLE step_metadata_type_a {
  id INT PRIMARY KEY,
  step_id INT,
  column1 VARCHAR,
  column2 TEXT,
  FOREIGN KEY (step_id) REFERENCES steps(id)
}

CREATE TABLE step_metadata_type_b {
  id INT PRIMARY KEY,
  step_id INT,
  some_other_data BLOB
  FOREIGN KEY (step_id) REFERENCES steps(id)
}

...

INSERT INTO sequences(id) VALUES (1000);
INSERT INTO steps (id, seq_id, step_number, step_type) VALUES (1, 1000, 1, 'a');
INSERT INTO steps (id, seq_id, step_number, step_type) VALUES (2, 1000, 2, 'b');
INSERT INTO step_metadata_type_a (id, step_id, column1) VALUES (1, 1, 'whatever');

etc..

The JSON solution would involve discarding the metadata tables, and keeping that in a metadata blob field in the steps table.

blakeyrat

@PleegWat said in Best way to do this in SQL?:

I think you need both. You need a STEP table which has the common columns for all step types (at least PROCESS_ID, STEP_ID, STEP_TYPE, STEP_SEQNO). You also want separate tables per step type for the type-specific information (FK to STEP)

Basically a star-schema off the "Step" table.

@PleegWat said in Best way to do this in SQL?:

You may not strictly need STEP_ID (use PROCESS_ID+STEP_SEQNO instead) or STEP_TYPE, but your life is probably easier if you include them.

Please always add a auto-generated ID column, even if you do end up joining on process + sequence number.

PleegWat

@djls45 I'm not sure how your StepType/StepTypeDefinitions works; are you going EAV there?

As I understand it he has a base class ProcessStep which is subclassed to ProcessStepTypeA, ProcessStepTypeB, etc. So defining step types generically doesn't really make sense.

djls45

@PleegWat said in Best way to do this in SQL?:

@djls45 I'm not sure how your StepType/StepTypeDefinitions works; are you going EAV there?

I guess it's something sort of like that. It's not quite as simple as that, though; it might be kind of mixing two levels of EAV together. (I've updated the table above to correct the expression of how I envisioned this scenario.)

The Process is the primary entity. The attributes for Process include the Steps for each process and the StepTypes that are used in them. Each Step contains a link to the StepType that should be performed at that Step, so the Steps are a kind of value for the StepTypes, but they are also themselves attributes for the Process. If a process can be run multiple times, the Process_Run table keeps track of each run, and then the StepValues stores the actual values for each Step during a particular run.

The EAV connections can get kind of complicated if you try to think of it that way, but the two primary ones would be Process-StepType-Step and ProcessRun-Step-StepValues. The first one is set up when a new kind of process is added and configured. It would rarely be reconfigured, but could be. The second stores the information for each execution of the process.

UndergroundCode

It sounds like your users will be defining a bunch of processes, each of which holds a list of various steps in various orders. So you clearly gotta have a process table, and a steps table with a FK out to the process table and an order column, so you can query the steps table by the process id and order by the step order to get your steps in order.

Sounds like the trick is how to store the specific info for each step type then. There's several options, depending on your DB, ORM, and how you're inclined to do things. Main ones I can think of:

Your steps table has a column for step type and a column for a blob of JSON or XML or something. You serialize all of the step info out to your blob column, and on the way back in, you instantiate the appropriate step type and populate it with the blob. Some DBs can separate these out into columns for you if that makes sense.
Your steps table has a column for step type, and there's an EAV table for the info unique to step types. Can grab 'em all with a SQL query for each step. Also means either an extra query per step, or you get a big blob back on the main query and have to sort it out in code. Shouldn't be too big a deal if there's on the order of a dozen steps in most processes.
A steps table and a separate table for each step type holding the info for that type, each with a FK back to the main step table. For better or worse, each step property is a SQL table column, with possible DB constraints and requiring migrations to change. May be good when it forces you to think about how to add that property to steps already defined.

djls45

@UndergroundCode Wouldn't #3 allow multiple step types for any given step? And even if that was the desired format, they would all have to be queried/joined to make sure none of the them are missed. I can see lots of maintenance headaches doing it that way.

Captain

@AlexMedia said in Best way to do this in SQL?:

Do you always have the same amount of steps and amount of information needed,

Those are "applicative functors".

or is there a chance that steps may be skipped, that parts of the step can be skipped, or that different responses can be given based on previously provided responses?

You need a monad for that. That's a clue to use an imperative language, not SQL, or you'll be stuck implementing a monad on a category of queries, using a language wholly unsuited for the task.

UndergroundCode

@djls45 The ORMs that I've worked with that handle polymorphism all handle it like that. Yes, it does lead to rather long queries, since you have to left join every type table into the query and check for null on every type table's primary key to find the type, at least if your ORM isn't already doing it for you. In that scheme, I don't know any way to forbid creating multiple type rows for a step at the DB level, so it's pretty much "don't do that" or use the ORM that won't let you do that. It isn't perfect, but I don't know of any better way to hold the info for objects that could be multiple types like that in a typed RDBMS schema, vs. the other two basic ideas, being all strings or JSON blobs.

Dragnslcr

@Jaloopa What database system are you using? It might be worth experimenting with table inheritance to see if it will handle everything you need.

Jaloopa

@Dragnslcr said in Best way to do this in SQL?:

What database system are you using?

SQL Server.

@Dragnslcr said in Best way to do this in SQL?:

table inheritance

I haven't heard of that. I'll have to read up on it

bjolling

@Jaloopa I've done this using the table-per-type pattern, but we only had a few types of steps so that was quite manageable. But I think the problem is already in your question: "how to do this in SQL". Seems a lot simpler to solve in a NoSQL when you want the flexibility of many different types of steps.

Dragnslcr

@Jaloopa said in Best way to do this in SQL?:

@Dragnslcr said in Best way to do this in SQL?:

What database system are you using?

SQL Server.

@Dragnslcr said in Best way to do this in SQL?:

table inheritance

I haven't heard of that. I'll have to read up on it

I couldn't find anything with a quick search on Google about SQL Server supporting table inheritance. Maybe it's a feature that's specific to Postgres.