A 908,000+ record insert query (PHP issue)

Consul

Although calling this a PHP issue is kinda moot.

I was hired to do up a quick import script to pull third party information into a database. I was given some test data, access to the server I'm importing to, and a good luck message. In short order, I managed to figure out the schema, and get a PHP script working that could import to it properly. This information updates every day, some of the records (one per line, tab-delimited) update, someare removed, and some new ones are added. Because the third party simply publishes all of its information in one large file (no diffs or anything that would make my life easy), I was asked to simply delete all information from this third party from the database before importing the new data, and it would always be accurate. Not my ideal way of operating, but fine. Their money spends as well as anyone else's.

Well, today I finally got to deploy the script with real data. I received information on how to automatically pull the data from the third party's FTP server. I went and FTPd manually to their server, grabbed the data file, and took a look at it locally. It was a 75MB zip file. It expanded out to a 760MB text file.

The fact that I couldn't even open it in any of the text editors I had handy should've been clue number one. Well, actually, it was, but I should've trusted myself more. Finally, thanks to Linux and the less command, I was able to take a look, confirm the data looks the way I expected it to, then I tried running the script.

30 minutes later, I called to have the process killed.

For grins, and perhaps because I should've done this to begin with, I did a wc -l on the file, to count the lines. 908,556. One of those is a header, so you can remove one. In 30 minutes, we managed to import about 41,200. And that was without the delete needing to be run first. This clearly has not been my day.

The server is a dedicated managed server of unknown spec, running MySQL 5, PHP 5, and Linux of unknown flavor. I'm not the world's greatest PHP coder or admin, but I have a feeling it will never be able to handle the kind of load we're asking, no matter what I do. And this is where I turn to you, the ever-so-smarter-than-me WTF crowd, to see if there is anything I can do to make this work.

Obviously, simply deleting every entry and recreating it is out (to be fair to myself, I reeled from this suggestion in the first place, but as I said, it's their money). It's going to be hard enough to finally get that first import done. After that, if I can't get diffs from the third party, I'll have to try to make my own, which will take up a lot of disk space, and possibly still not work.

In short, any ideas? Is this doomed to failure? Thank you for your help, and for letting me post this story for my first ever post.

morbiuswilters

You're doing this as one-line at a time PHP inserts? You can make it more efficient by doing multi-inserts. This is done with they syntax "INSERT INTO tbl (field1, field2) VALUES ('value1a', 'value1b'), ('value2a', 'value2b'), ..." and so on. You should limit this to around 256k per insert statement, though, as MySQL will not accept a command larger than 512k or so by default. Additionally, if the table is InnoDB, it would be beneficial to do the entire thing as one transaction. PHP should not be the bottleneck here, it's clearly MySQL. Generally you should be able to load about ~5GB an hour into MySQL when using single-transaction inserts in InnoDB. For MyISAM, the number is about the same but you need to disable index updates before you begin and enable them afterwards. Check out the MySQL manual for info on disable keys.

If you check out the mysqldump and mysql commands you can see how they optimize the quick loading of data into MySQL. Generally they use the same tricks I mentioned above, but it can give you a good idea of how to approach things. Once again, your bottleneck here is going to be completely on the DB end. Finally, if you are truly deleting every row from the table before inserting, it would be best to do a drop table or truncate instead of a delete. Deletes get added to the transaction log and have to be processed per-row whereas truncates and drops tend to be instant.

Consul

Well, I can certainly modify the script to do multi-inserts. I realize that that is the most efficient way to work, but to be honest, before today, it's never been an issue. I've also never dealt with transactions (again, I've never dealt with this scale before), but I'll certainly read up on it.

I most definitely cannot drop the table or depend on truncation, though, as there are other users that insert and update information for this application. There is other data in this table that can appear from anyone at any time, in other words. This script is supposed to run from cron at night. We're definitely going to look into getting incremental updates from our third-party provider

Thank you very much for taking the time to answer my question.

lpope187

Well as I see it, you've got a few choices.

Bulk insert into a staging table (load from infile is the clause I believe), do an update on matching records, deletes on those that are in existing but not staging, and then an insert on those that are only in staging.
Select existing table in to hash table, read file into hash table and do same as above in code rather than sql.
Get provider to add tracking columns (DateCreated, DateModified, DateDeleted), store last run timestamp somewhere and only process records that have been modified since last run date.

There might be other options but off the top of my head those are what came to me. I'd probably go for the bulk insert first, then see if the provider can supply tracking columns in the future - preferably only those records matching the given last run date / timestamp.

bjolling

@Consul said:

I was asked to simply delete all information from this third party from the database before importing the new data

@Consul said:

I most definitely cannot drop the table or depend on truncation, though, as there are other users that insert and update information for this application. There is other data in this table that can appear from anyone at any time, in other words.

Now you've lost me.If you cannot do a "TRUNCATE [tablename]" then you also cannot do a "DELETE [tablename]" for exactly the same reasons...

brazzy

@bjolling said:

@Consul said:
I was asked to simply delete all information from this third party from the database before importing the new data

@Consul said:
I most definitely cannot drop the table or depend on truncation, though, as there are other users that insert and update information for this application. There is other data in this table that can appear from anyone at any time, in other words.

Now you've lost me.If you cannot do a "TRUNCATE [tablename]" then you also cannot do a "DELETE [tablename]" for exactly the same reasons...

I believe he was talking about DELETE FROM [tablename] WHERE submitter_ID='42'

brazzy

@Consul said:

I've also never dealt with transactions (again, I've never dealt with this scale before), but I'll certainly read up on it.

You really, really should. To be honest, it should be considered gross negligence to let anyone who doesn't understand transactions touch a database. It's not a amtter of performance (except that, as in this case, having one big transaction is better than having hundreds of thousands of implicit smaller ones). It's a matter of not destroying the data.

dhromed

@bjolling said:

Now you've lost me.If you cannot do a "TRUNCATE [tablename]" then you also cannot do a "DELETE [tablename]" for exactly the same reasons...

If the target table's contents are not exclusively what the OP is importing into it, you cannot boldly truncate. Truncating would then not only delete the (partially) out of date third-party data, but also the data that already lived there, and belonged to the OP's own users.

morbiuswilters

@brazzy said:

To be honest, it should be considered gross negligence to let anyone who doesn't understand transactions touch a database. It's not a amtter of performance (except that, as in this case, having one big transaction is better than having hundreds of thousands of implicit smaller ones). It's a matter of not destroying the data.

I don't think transactions are always necessary for every scenario, including the one that the OP is referring to. It's probably just a case of the table defaulting to Inno and nobody bothering to change it. In the OP's case the transactions aren't so much for data integrity as for performance due to the default configuration the customer is using.

Consul

To be honest, I agree on the transactions issue. I'm a coder, not a database admin. (Dammit, Jim!...) Nevertheless, I'll have to be more careful in the future, and will read up on transactions, as I've never dealt with this amount of data before. I was completely blindsided by it, actually.

I'm still left with the issue, though, of how to manage this data efficiently, or at all, for that matter. I'm trying to get in touch with the third party provider to see if there are other ways they can export the data for us.

Consul

Guess what? We're running MyISAM tables! So, no transactions anyway.

danixdefcon5

@Consul said:

Guess what? We're running MyISAM tables! So, no transactions anyway.

ALTER TABLE whatever ENGINE=InnoDB;

That will enable transactions for you. Anyway, you can do either morbius' multi-insert approach, or use Prepared Statements (I really don't know how to do that in PHP, I've used them on Java though). Prepared Statements are cached server-side, so you only send the INSERT INTO table VALUES (?,?,?,?) once, and all subsequent calls only send the "?" values.

I strongly recommend transactions, especially when you're doing something like a DELETE FROM ... which, depending on what solution you take, could leave you with half-updated stuff which may or may not have an impact when you re-start the process. I had to deal with something like this 3 years ago, my solution was to use a temp table which loaded the file, flagged all records in original table as "old", then go through the whole temp table to INSERT/REPLACE the "new" data. Finally, DELETE all rows which were still flagged as "old" (the insert/replace cleared that flag) and COMMIT. The final step (emptying the temp table) could be done now with a simple TRUNCATE TABLE.

This process could work without transactions, but I'd recommend using them anyway.

Why can't you make a program that takes yesterdays data and does a diff on it with todays data?

morbiuswilters

@danixdefcon5 said:

ALTER TABLE whatever ENGINE=InnoDB;

He probably isn't supposed to modify the table type.

@danixdefcon5 said:

Prepared Statements are cached server-side, so you only send the INSERT INTO table VALUES (?,?,?,?) once, and all subsequent calls only send the "?" values.

Prepared statements usually give very little performance benefit. The bottleneck should not be network I/O or the speed of the CPU parsing the queries, but instead will be writing all of that data to disk. I can't see prepared statements making that big of a difference.

@danixdefcon5 said:

I strongly recommend transactions, especially when you're doing something like a DELETE FROM ... which, depending on what solution you take, could leave you with half-updated stuff which may or may not have an impact when you re-start the process.

For this much data, I don't think transactions are going to be feasible. I also don't think they're all that necessary. He's doing the inserts and updates first and I doubt the client wants him to roll back if for some reason the process finishes. Transactions make sense when you are dealing with individual pieces of data that are linked, but with this much data it seems like rolling back 500,000 inserts would just be absurd.

morbiuswilters

@tster said:

Why can't you make a program that takes yesterdays data and does a diff on it with todays data?

I thought of that, but unless the dump was sorted the same every day it would be hard to tell line-by-line where things had been updated, added or deleted. If you're not doing it line-by-line then you are essentially just making a table with yesterday's dump and then running today's dump against it, which is going to be just as slow or slower than hitting the DB. Finally, it also sounds like the data is being updated from another source so he can't assume that yesterday's dump is the state the data will be in when he goes to process today's dump.

ajp

can't you use LOAD DATA INFILE syntax?

Daid

Run the script over the weekend. It just needs two days to finish.

piptheGeek

No problem then! we have a database upgrade running (3rd party app, nothing to do with me) that is expected to take a few months to convert all the old data into the new database.

Being slightly more helpful, we have a database that holds about 27 million records and needs to be updated from CSV file every quarter. We load the data into a temp table, do some processing, then run set based statements to process inserts, updates and deletes. I don't know much about MySQL as we are MS SQL but the idea of bulk insert (using LOAD DATA INFILE) to temp table then set based operations to merge the data with the main table feels like a good way to go.

Consul

Morbiuswilters is pretty much right on the money. The 3rd party data provider does no sorting, instead caring only about the efficiency of their export. I've been in email contact with their database admins, and they seem confused as to why we're having so much trouble with wanting to delete and re-load this data every night. Apparently they don't understand that we don't have a 16-core Unix machine dedicated to our database. Also, I seriously doubt I'll be allowed to touch the table type on the database, but that sounds like it might be a moot point, anyway.

The only solution that might be feasable is to dedicate a new table to this data, and then see if I can modify our front-end application (which is also third party, has been functioning well, and will be a hard sell) to use this new table for this third party data. That way, I can do a truncate every night, and hope for a more effiecient way of reloading it. I'll be discussing this with the clients today.

@morbiuswilters said:

@tster said:
Why can't you make a program that takes yesterdays data and does a diff on it with todays data?
I thought of that, but unless the dump was sorted the same every day it would be hard to tell line-by-line where things had been updated, added or deleted. If you're not doing it line-by-line then you are essentially just making a table with yesterday's dump and then running today's dump against it, which is going to be just as slow or slower than hitting the DB. Finally, it also sounds like the data is being updated from another source so he can't assume that yesterday's dump is the state the data will be in when he goes to process today's dump.

All the data is 750MB. There are about 1,000,000 records. I don't see the problem with finding a machine with 2 GB of ram that just reads in all yesterdays data and puts it in a hashtable or something based on id, and then goes through the new data and checks every record against what's in the hashtable. Should take less than an hour on a moderately fast machine.

savar

Are there non-unique indexes on the table that you're importing into? Or any other constraints? You might speed things up by disabling those, loading the data, and then re-enabling them.

Another idea would be to preprocess the data file into CSV or something similar, and then use LOAD DATA INFILE.

Edit: like Tster said, additional RAM might help. Run the script and a system monitor simultaneous to see if you are getting lots of page outs. That would be a sign that you're thrashing VM.

morbiuswilters

@savar said:

Are there non-unique indexes on the table that you're importing into? Or any other constraints? You might speed things up by disabling those, loading the data, and then re-enabling them.
Another idea would be to preprocess the data file into CSV or something similar, and then use LOAD DATA INFILE.

Edit: like Tster said, additional RAM might help. Run the script and a system monitor simultaneous to see if you are getting lots of page outs. That would be a sign that you're thrashing VM.

How would disabling constraints help? If he has to re-enable them at the end and some rows violate those contraints he would have to delete them before the contraints were successfully added.

morbiuswilters

@tster said:

All the data is 750MB. There are about 1,000,000 records. I don't see the problem with finding a machine with 2 GB of ram that just reads in all yesterdays data and puts it in a hashtable or something based on id, and then goes through the new data and checks every record against what's in the hashtable. Should take less than an hour on a moderately fast machine.

I'm sure the amount of data will not increase over time. I'm also positive that the OP has the ability to request a machine with 2 GB of dedicated ram when he can't even get basic requirements from the company. Finally, as he said, yesterday's export is not necessarily authorative on what data is currently in the table. So this will not work. Otherwise, yes it will work.

vr602

I agree with lpope: you really need to get the provider to add "last_updated" column to your incoming data. There's no point thrashing away at a million lines if only a couple of hundred have been updated. However hard you try, you will still be doing a thousand times more work than you need to. So insist on tracking column, sort the file by that column, read updated lines only into a work table on your database, then just (for simplicity) delete and insert from there.

lpope187

@morbiuswilters said:

@tster said:
All the data is 750MB. There are about 1,000,000 records. I don't see the problem with finding a machine with 2 GB of ram that just reads in all yesterdays data and puts it in a hashtable or something based on id, and then goes through the new data and checks every record against what's in the hashtable. Should take less than an hour on a moderately fast machine.
I'm sure the amount of data will not increase over time. I'm also positive that the OP has the ability to request a machine with 2 GB of dedicated ram when he can't even get basic requirements from the company. Finally, as he said, yesterday's export is not necessarily authorative on what data is currently in the table. So this will not work. Otherwise, yes it will work.

It really depends on whether you have a reasonable business key in the source data. Example, in a data warehouse you have a Customer Dimension which gets it's own unique identifier (usually auto-incremented int); the business key is the source system identifier usually also a auto-incremented int. If you have a suitable business key, and you don't care about looking for which columns have changed you just need to store the business keys in each of the hash tables. There is no way that is going to be anywhere near the 750MB but more along the lines of 5MB and also shouldn't take an hour to process but mere minutes. Streaming from the file and db is going to be the bottle neck in this case.

Without knowing the data definition and/or getting further information on data ownership issues, we can't really make any determination on whether this method will be practical or not.

LoztInSpace

TRWTF is that the database is so crap that ideas include buying an entire machine, writing database type functionality outside of the "database", sacrificing ACID, disabling integrity etc. in order to achieve something that should be a simple procedure using utilities provided by the database.

TRRWTF:MySQL.

morbiuswilters

@LoztInSpace said:

TRWTF is that the database is so crap that ideas include buying an entire machine, writing database type functionality outside of the "database", sacrificing ACID, disabling integrity etc. in order to achieve something that should be a simple procedure using utilities provided by the database.

TRRWTF:MySQL.

Are you a moron? What does this have to do with MySQL? The problem is that the guy is getting a massive dump of data instead of just the changes and he has to do a ton of selects, inserts, updates and deletes to reconcile the data in the database with the dump. What utility is going to help here? The customer is using MyISAM tables already and obviously they don't need transactions. As far as ref. integrity goes, the OP is operating on a single table which presumably isn't linked directly with anything else. Once again, how is the customer providing slop data and not utilitzing features of the RDBMS a problem with said RDBMS?

Consul

Let me try to answer some of the speculation here:

1) The server is a managed server at GoDaddy. I'm pretty much stuck with what I have. I don't know what its specs are. I was brought in on this recently, and after many of these decisions were made. Likewise, the software my client is running is a decision made a long time ago, and is pretty much set. That software vendor is who designed the database I'm importing into. I had nothing to do with any of that.

2) The Real WTF here is that, given the nature of this data, I should've known from the beginning that there would be this much of it, and fought for a more intelligent way of handling it. Instead, I blindly went forth with building what they asked for, instead of what they needed. In the end, I blame my own apparent lack of competence in this matter. Nevertheless, I have this experience under my belt now and will try not to let it happen again.

3) The third party does not offer a delta feed, although they implied that they were working on one.

4) There IS a unique identifier in the text dump from the third party that I can sort the data with. I've given some thought to writing a sorting program in Perl (or seeing if someone out there already has) and using Linux's diff to get the difference between yesterday's and today's file.

The adventure continues...

LoztInSpace

@morbiuswilters said:

@LoztInSpace said:

TRWTF is that the database is so crap that ideas include buying an entire machine, writing database type functionality outside of the "database", sacrificing ACID, disabling integrity etc. in order to achieve something that should be a simple procedure using utilities provided by the database.

TRRWTF:MySQL.

Are you a moron? What does this have to do with MySQL? The problem is that the guy is getting a massive dump of data instead of just the changes and he has to do a ton of selects, inserts, updates and deletes to reconcile the data in the database with the dump. What utility is going to help here? The customer is using MyISAM tables already and obviously they don't need transactions. As far as ref. integrity goes, the OP is operating on a single table which presumably isn't linked directly with anything else. Once again, how is the customer providing slop data and not utilitzing features of the RDBMS a problem with said RDBMS?

Because importing a text file into a database is a solved problem in most cases.

morbiuswilters

@LoztInSpace said:

Because importing a text file into a database is a solved problem in most cases.

Have you read any of this thread? He isn't importing a text file, he is updating, deleting and inserting records based on the dump he gets. He cannot just blow away the old data and load the new file.

LoztInSpace

@morbiuswilters said:

..he is updating, deleting and inserting records based on the dump he gets.

Are you suggesting this isn't a solved problem?

morbiuswilters

@LoztInSpace said:

Are you suggesting this isn't a solved problem?

Obviously it has been solved before, but the real problem here is the data the OP is getting is slop and he didn't really think through his system before implementing it. I'd love to hear what your solution to the OP's problem would be.

danixdefcon5

@LoztInSpace said:

Because importing a text file into a database is a solved problem in most cases.

You mean LOAD DATA INFILE?

Loading text into the DB isn't the problem. Its that the full 908,000-record table is being sent, overwriting the old table (DELETE WHERE supplier = 666), then re-loading the full table (LOAD DATA INFILE, or a crapload of INSERT INTO's).

No matter how big your RDBMS is, this is unnecesarilly taxing on resources, when the customer could send one full file once, then only send the differential files on subsequent updates.

lpope187

@morbiuswilters said:

@LoztInSpace said:
Are you suggesting this isn't a solved problem?
Obviously it has been solved before, but the real problem here is the data the OP is getting is slop and he didn't really think through his system before implementing it. I'd love to hear what your solution to the OP's problem would be.

This is a solved problem and is quite often seen in the ETL (Extract, Transfer, & Load) processes in data warehouses. It's seen quite often when loading data from legacy applications where for whatever reason cannot be changed to easily support partial extracts. There are plenty of ETL Packages out there that can do field by field comparisons and do the insert/update/delete/nothing automatically. I've only worked with SSIS (Sql Server Integration Services), but there are others. I don't know how easily it would be to connect to the hosted database from SSIS, but if you can attach via OLEDB or ODBC, it should work.

I would seriously look at utilizing SSIS for this as it will have most of the pieces prebuilt. Basically you just need to put them in the right order with the right config.

morbiuswilters

@lpope187 said:

This is a solved problem and is quite often seen in the ETL (Extract, Transfer, & Load) processes in data warehouses. It's seen quite often when loading data from legacy applications where for whatever reason cannot be changed to easily support partial extracts. There are plenty of ETL Packages out there that can do field by field comparisons and do the insert/update/delete/nothing automatically. I've only worked with SSIS (Sql Server Integration Services), but there are others. I don't know how easily it would be to connect to the hosted database from SSIS, but if you can attach via OLEDB or ODBC, it should work.
I would seriously look at utilizing SSIS for this as it will have most of the pieces prebuilt. Basically you just need to put them in the right order with the right config.

The problem here isn't that such a solution doesn't exist. The OP already wrote his own which was probably pretty simple and easy. There's no reason to buy some other product for what he is doing. The limitation is the DB speed which will definitely not be solved by these products.

bjolling

@lpope187 said:

I would seriously look at utilizing SSIS for this as it will have most of the pieces prebuilt.

SSIS really helps implementing ETL processes in an easy way, but it is just a tool and you still need to figure out your design yourself.

I've solved a similar problem using DTS (SSIS's predecessor) by performing a bulk insert of the complete file into a staging table and then using a "Data Driven Query" Task. The task basically loops over all records and uses a user-provided script to decide if he needs to execute the insert, update or delete statement. But I'm not sure if the performance is good enough for the kind of file sizes the OP is referring to.

lpope187

@bjolling said:

SSIS really helps implementing ETL processes in an easy way, but it is just a tool and you still need to figure out your design yourself.
I've solved a similar problem using DTS (SSIS's predecessor) by performing a bulk insert of the complete file into a staging table and then using a "Data Driven Query" Task. The task basically loops over all records and uses a user-provided script to decide if he needs to execute the insert, update or delete statement. But I'm not sure if the performance is good enough for the kind of file sizes the OP is referring to.

You've stated my position exactly. SSIS provides a lot of the building blocks necessary which you assemble to do what you need. IMO, why write a perl script that does some kind of matching via hashtable lookups when you can buy a tool for < $1000 US that does that for you and also provides robust logging, checkpointing, and data visualization/debugging.

Trust me you can get good performance out of SSIS, as I have an extract from our SAP system that does the same type of processing and does 1.5M records in the span of about 6 minutes. The bulk of that time is streaming from the two databases, metadata validation, and codepage conversion.

Jeff_S

What about importing the data into a staging table first? And then from there moving only updates/inserts (and processing deletes) on your live data based on what is in the staging table? This is typically how we write large imports where the source data is a flat file or doesn't have good indexing. The import to the staging table can be slow but it will not affect the live data, you can index and clean up and manipulate the staging data as necessary and then simpy do quick, indexable table-to-table operations (with transactions if necessary) when finally updating the data on your live tables?

(I apologize if this option was mentioned already, but I didn't see it)

@lpope187 said:

You've stated my position exactly. SSIS provides a lot of the building blocks necessary which you assemble to do what you need. IMO, why write a perl script that does some kind of matching via hashtable lookups when you can buy a tool for < $1000 US that does that for you and also provides robust logging, checkpointing, and data visualization/debugging.

Because with perl you can do what you need to do for free in about 3 minutes.

Honestly, at this point if the OP could post some example data somewhere I would write a perl script that takes two files and produces a diff. Just let me know what the key is. Time and space will be O(n) where n is the number of records. As far as worrying about data getting larger in the future. I'm assuming this 900,000 records is a lot of times worth of data. Assuming they add perhaps 1000 records per day, that means it will take almost 3 years to double the size of the data. At that rate I think you can just buy more memory to keep up with the demand. And you don't need a dedicated machine. Hell, run it on your dev machine at night.

morbiuswilters

@tster said:

Because with perl you can do what you need to do for free in about 3 minutes.

Honestly, at this point if the OP could post some example data somewhere I would write a perl script that takes two files and produces a diff. Just let me know what the key is. Time and space will be O(n) where n is the number of records. As far as worrying about data getting larger in the future. I'm assuming this 900,000 records is a lot of times worth of data. Assuming they add perhaps 1000 records per day, that means it will take almost 3 years to double the size of the data. At that rate I think you can just buy more memory to keep up with the demand. And you don't need a dedicated machine. Hell, run it on your dev machine at night.

I agree that writing your own app would be quick and much cheaper and easier than buying some 3rd party program. If it is available, perl is definitely a better choice for the OP as it is faster than PHP at large-scale text file processing. If the unique ID is an auto-incrementing int, be sure you leave a blank line whenever you encounter a missing row. Otherwise diff isn't going to be able to handle large stretches of deleted files because it only reads ahead so far for a matching row. It sounds like you're pretty much on-target for what the OP needs, though.