SQLite in Python

HardwareGeek

As mentioned in a Lounge thread, I'm scraping some public data from a US government website (so the data is totally in the public domain). There may be a way to download the data in a more machine-friendly way, but I couldn't find it, so scraping the web it is.

The data is big, and I'm having memory problems. My whole WSL VM is crashing when the data hits about 4 GB. Somebody suggested sticking it in a SQLite db, and that makes sense for how I want to query it when I finish munging it out of the HTML, so I'm trying that.

The script that's doing the munging is working fine (as much as it can without running out of memory; it works if I limit it to a small subset to test), so I don't need any Python help (at the moment, anyway); I'm just mentioning that for context.

But I'm not a db guy. I've done a little with MySQL, years ago, so my SQL is limited and very rusty, and I'm brand new to SQLite. This is the main part I need help with.

The Python is producing a nested dictionary (although I don't really need that if I stuff the data into the db on the fly). The data structure is like this: The dictionary key is a string; call it name. The value is another dictionary with 3 key-value pairs:

'count': integer specifying the number of times name occurs in the data;
'type': list of strings
'place': Yet another dictionary:
- Key: string foo
- Value: Count of times name occurs in foo

{ 
  'Fred': {'count': 17, 'type': [ ... ], 'place': {'here': 3, 'there', 14} } , 
  'Polygeekery': {'count': 1, 'type': ['flaming'], 'place': {'TDWTF': 1} } , 
...
}

The relationship between 'name' and 'type' is many-to-many. There are maybe a couple of dozen types, and each name may have 0 or more items from that set. I know how to create this many-to-many relationship in SQL.

There are 50-ish valid places. Each name may be associated with 1 or more places. Each place will be associated with many names. If that were all I cared about, that would be a basic many-to-many relationship, just like the 'name'-'type', but the count per place is kind of throwing me.

The raw data is ordered by place. All of the names for place1 will be added to the db with multiple occurrences of name for some names, then all the names for place2, again with multiple occurrences. So, I need to make sure the counts associated with name itself and the combination of name and place get incremented, rather than getting inserted again.

Actually, writing out this explanation for you guys, I think I've mostly figured it out, but it's still a bit fuzzy. In particular, I'm uncertain of the exact syntax for incrementing the counts if name or name+place are already in the db or inserting them if not.

Watson

@HardwareGeek said in SQLite in Python:

In particular, I'm uncertain of the exact syntax for incrementing the counts if name or name+place are already in the db or inserting them if not.

You mean something like UPSERT?

UPSERT

See the first example in §2.1.

HardwareGeek

@Watson Thanks, that's very helpful. And I think I figured out how to add the uniqueness constraints to make that work. Basically, column or pair of columns I want to be unique is what I'm going to be selecting by, so they should all be indexed anyway; adding the unique constraint to the index is trivial.

If it isn't all as simple as it seems at the moment, I'll be back with more questions. Also, it's midnight, so I may not remember any of what I think I've figured out by whatever time I get a chance to work on this tomorrow.

Zecc

@HardwareGeek said in SQLite in Python:

There are 50-ish valid places. Each name may be associated with 1 or more places. Each place will be associated with many names. If that were all I cared about, that would be a basic many-to-many relationship, just like the 'name'-'type', but the count per place is kind of throwing me.

Isn't it just a new column in the name_place table, along with name_id and place_id?

In any case, if you've kept the JSON files, I hate to be that guy
that throws links hoping one will be helpful even though I've only had limited experience trying what's in them; but I'll do it anyway.

Laurens Kuiper / Mar 3, 2023

Shredding Deeply Nested JSON, One Vector at a Time

We’ve recently improved DuckDB’s JSON extension so JSON files can be directly queried as if they were tables.

DuckDB is similar to SQLite, in that there is a single executable CLI requiring no installation, and there's also a Python binding, among many others. It can also read SQLite files.

Having said that, SQLite has its own JSON extensions, which you might be able to use if you want to avoid the whole need of building separate "physical" tables to add structure to the data, unless that's something you specifically need. It should be possible to create indices on JSON selections, like in here: https://dgl.cx/2020/06/sqlite-json-support

On the other hand, storing the original JSON inside SQLite may or may not make your memory problems worse. I'm honestly not sure how much of it SQLite will pull into memory.

Anyway, I'm just spitballing here. If this was unhelpful, just ignore me.

dkf

@Zecc said in SQLite in Python:

I'm honestly not sure how much of it SQLite will pull into memory.

Usually enough pages to hold whatever data is changing in the current transaction, but it might write some early when you slab in big chunks of data at once. (Reads don't need so much unless you force them to.) The authors of SQLite have heard of mmap() (and its Windows equivalent) and will use it as appropriate; part of the reason you use a library like this for that sort of thing is then you don't have to figure that shit out for yourself.

dangeRuss

@HardwareGeek I wouldn't use sqlite for anything that big, it likes getting corrupted. It's easy enough to spin up a postgresql docker and set it up to persist the data.

Of course the real WTF is not using a Mac. WSL is such a hog, I couldn't handle running windows anymore on my fairly high end work laptop. So much happier with a Mac. This is coming from a guy who's never had an iPhone and has hated Apple for a good 20+ years. The new ARM architecture is really something.

HardwareGeek

@Zecc said in SQLite in Python:

Isn't it just a new column in the name_place table, along with name_id and place_id?

Yeah, that's what I figured out as I was writing the OP.

In any case, if you've kept the JSON files, I hate to be that guy

There aren't any JSON files. The original data source is HTML. JSON is what I was going to use to write out the processed data, but [redacted, because it was a Lounge post; if he wants to take credit for the suggestion, that's up to him, but I won't leak the info] suggested SQLite before I got around to adding import json to the script, much less actually implementing it.

PleegWat

@dangeRuss said in SQLite in Python:

I wouldn't use sqlite for anything that big, it likes getting corrupted.

It's write-once-read-many. Surely readonly transactions won't corrupt it?

Personally, I'd suggest picking a tech stack you want to learn more about. You could even let out the NIH monkey if you feel like it.

HardwareGeek

@PleegWat said in SQLite in Python:

It's write-once-read-many. Surely readonly transactions won't corrupt it?

And if it does become corrupt somehow, it's not business-critical (it's probably the most uber-nerdy and unnecessary personal project ever) and very easy (slow, but easy) to recreate the thing from scratch.

HardwareGeek

@PleegWat said in SQLite in Python:

I'd suggest picking a tech stack you want to learn more about.

To the extent I care about learning an RDBMS at all, the decision of which one is pretty meh. The SQL itself is the hard part, and if SQLite gives me trouble, it should be fairly easy to switch. From what I've read, most of the quirks in SQLite's dialect aren't relevant for the simple stuff I need for this project, or at least shouldn't be a big problem since I'm not relying on the legacy behavior those quirks allow.

dangeRuss

@HardwareGeek also if sqlite is an in-memory DB, you might be better off with a real dB, like postgresql or mongodb.

If you're going to use JSON you're probably better off with something like Mongo or redis if you're just looking to cache the data.

HardwareGeek

@dangeRuss said in SQLite in Python:

if sqlite is an in-memory DB

It can be, but I'm not using it that way.

If you're going to use JSON ... if you're just looking to cache the data

I'm not doing either of those. I need to persist the processed data permanently, to be retrieved using a separate script. JSON was just my first idea of a simple way to persist the processed data. Using an RDBMS is a better way, because I don't have to parse GBs of JSON just to retrieve a dozen rows.

HardwareGeek

So, accessing SQLite from Python is working. I haven't finished the script, but I've created a db and some tables in the interactive Python interpreter. It's just a matter of polishing the SQL and typing it into my script.

However, accessing SQLite without Python is a problem:

$ sqlite3 my_db.db

Command 'sqlite3' not found, but can be installed with:

sudo apt install sqlite3

$ sudo apt install sqlite3
[sudo] password for [redacted]:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  fakeroot libalgorithm-diff-perl libalgorithm-diff-xs-perl libalgorithm-merge-perl libdpkg-perl libfakeroot
  libfile-fcntllock-perl libfwupdplugin1
Use 'sudo apt autoremove' to remove them.
Suggested packages:
  sqlite3-doc
The following NEW packages will be installed:
  sqlite3
0 upgraded, 1 newly installed, 0 to remove and 24 not upgraded.
Need to get 861 kB of archives.
After this operation, 2803 kB of additional disk space will be used.
Ign:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 sqlite3 amd64 3.31.1-4ubuntu0.3
Err:1 http://security.ubuntu.com/ubuntu focal-updates/main amd64 sqlite3 amd64 3.31.1-4ubuntu0.3
  404  Not Found [IP: 185.125.190.39 80]
E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/s/sqlite3/sqlite3_3.31.1-4ubuntu0.3_amd64.deb  404  Not Found [IP: 185.125.190.39 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

What's going on here? How do I install the SQLite command line if apt install can't find the package?

dangeRuss

@HardwareGeek

Try running apt-get update first, sounds like your cache is out of date.

dcon

@HardwareGeek I haven't really read much of this thread - but one tool I find useful is SQLite Studio for tinkering about with a database without dealing with the coding yet. Once I have a schema, then I was able to copy it from Studio and drop it into my program which was nice.

dkf

@dangeRuss said in SQLite in Python:

I wouldn't use sqlite for anything that big, it likes getting corrupted.

I've only heard of that being a problem where someone's been stupid enough to try and have multiple computers accessing the DB for writes on the one network drive. Don't do that. Networks are and so doing anything reliable on networked filesystems is asking for trouble (it's because doing locks and synchs over a network have some awful failure modes).

And you also need to be a bit careful if you're using an Apple system. Their filesystem implementation tells lies (specifically, it returns from a synch before the data has hit storage that will survive a power cycle) unless you enable the "please don't throw away my data" flag.

OTOH, I've not had many problems with running big data ingestions and running multithreaded services against it. (There are simple tricks to optimize all that to the point where you hit whatever hardware limits you've got.) The main limitation is you don't want to try to do multiple write (or otherwise exclusive) transactions at once, as there's no row-level locking.

dkf

@HardwareGeek said in SQLite in Python:

if he wants to take credit for the suggestion, that's up to him

dkf

@dangeRuss said in SQLite in Python:

@HardwareGeek also if sqlite is an in-memory DB, you might be better off with a real dB, like postgresql or mongodb.

It's an on-disk DB unless you open the connection in memory-only mode. Those are for people who don't want to keep the data in the DB at all.

I'm not sure I'd characterise Mongo as a real DB. A pawn in game of life though...

If you're going to use JSON you're probably better off with something like Mongo or redis if you're just looking to cache the data.

I'd assume that querying is going to be useful sometime.

Dragoon

@dkf said in SQLite in Python:

I'm not sure I'd characterise Mongo as a real DB. A pawn in game of life though...

I think it qualifies as a DB (then again I work with Universe all day, so I might be biased)

dangeRuss

@dkf said in SQLite in Python:

@dangeRuss said in SQLite in Python:

I wouldn't use sqlite for anything that big, it likes getting corrupted.

I've only heard of that being a problem where someone's been stupid enough to try and have multiple computers accessing the DB for writes on the one network drive. Don't do that. Networks are and so doing anything reliable on networked filesystems is asking for trouble (it's because doing locks and synchs over a network have some awful failure modes).

And you also need to be a bit careful if you're using an Apple system. Their filesystem implementation tells lies (specifically, it returns from a synch before the data has hit storage that will survive a power cycle) unless you enable the "please don't throw away my data" flag.

OTOH, I've not had many problems with running big data ingestions and running multithreaded services against it. (There are simple tricks to optimize all that to the point where you hit whatever hardware limits you've got.) The main limitation is you don't want to try to do multiple write (or otherwise exclusive) transactions at once, as there's no row-level locking.

What I don't get is the usefulness of such a db when you can fairly easily run a real DB in a docker. It's not very heavy, and it's leaps and bounds better than any embedded db.

HardwareGeek

@dangeRuss For me, for this project, I guess the advantages are that SQLite and its Python bindings are installed as part of Python by default. PostgreSQL, its Python bindings, and Docker are not. I'd have to install all 3 of those and keep them up to date. And I'd have to learn how to use Docker. Not that any of that is particularly difficult, but it's more moving pieces for what is a stupid, (supposed to be) low-effort personal project.

Carnage

@dangeRuss said in SQLite in Python:

@dkf said in SQLite in Python:

@dangeRuss said in SQLite in Python:

I wouldn't use sqlite for anything that big, it likes getting corrupted.

I've only heard of that being a problem where someone's been stupid enough to try and have multiple computers accessing the DB for writes on the one network drive. Don't do that. Networks are and so doing anything reliable on networked filesystems is asking for trouble (it's because doing locks and synchs over a network have some awful failure modes).

And you also need to be a bit careful if you're using an Apple system. Their filesystem implementation tells lies (specifically, it returns from a synch before the data has hit storage that will survive a power cycle) unless you enable the "please don't throw away my data" flag.

OTOH, I've not had many problems with running big data ingestions and running multithreaded services against it. (There are simple tricks to optimize all that to the point where you hit whatever hardware limits you've got.) The main limitation is you don't want to try to do multiple write (or otherwise exclusive) transactions at once, as there's no row-level locking.

What I don't get is the usefulness of such a db when you can fairly easily run a real DB in a docker. It's not very heavy, and it's leaps and bounds better than any embedded db.

Why add the extra layer of indirection of docker retardery when it's in no way needed?

dangeRuss

@Carnage said in SQLite in Python:

@dangeRuss said in SQLite in Python:

@dkf said in SQLite in Python:

@dangeRuss said in SQLite in Python:

I wouldn't use sqlite for anything that big, it likes getting corrupted.

I've only heard of that being a problem where someone's been stupid enough to try and have multiple computers accessing the DB for writes on the one network drive. Don't do that. Networks are and so doing anything reliable on networked filesystems is asking for trouble (it's because doing locks and synchs over a network have some awful failure modes).

And you also need to be a bit careful if you're using an Apple system. Their filesystem implementation tells lies (specifically, it returns from a synch before the data has hit storage that will survive a power cycle) unless you enable the "please don't throw away my data" flag.

OTOH, I've not had many problems with running big data ingestions and running multithreaded services against it. (There are simple tricks to optimize all that to the point where you hit whatever hardware limits you've got.) The main limitation is you don't want to try to do multiple write (or otherwise exclusive) transactions at once, as there's no row-level locking.

What I don't get is the usefulness of such a db when you can fairly easily run a real DB in a docker. It's not very heavy, and it's leaps and bounds better than any embedded db.

Why add the extra layer of indirection of docker retardery when it's in no way needed?

I've had to recover way too many sqlite dbs to ever trust them with my data.

Plus I'm guessing postgresql is actually going to perform.

HardwareGeek

This is simple, but I'm having a hard time getting the syntax right.

I've created name and place tables:

CREATE TABLE names (id INTEGER PRIMARY KEY, name TEXT UNIQUE, count INTEGER);
CREATE TABLE places (id INTEGER PRIMARY KEY, name TEXT UNIQUE);
CREATE TABLE name_place (name_id INTEGER, place_id INTEGER, count INTEGER, FOREIGN KEY(name_id) REFERENCES names(id), FOREIGN KEY(place_id) REFERENCES places(id));
CREATE UNIQUE INDEX name_place_idx ON name_place (name_id, place_id);

I've populated the names and places tables:

sqlite> select * from names;
1|foo|1
2|bar|2
3|baz|2
4|qwerty|1
10|asdf|1
11|jkl|1
12|uiop|1
sqlite> select * from places;
1|here
2|there
3|everywhere

Now I want to populate the name_place table. (Actually, I'll be populating it one-at-a-time as I'm populating the names table, but of course the row in names will be upserted first. The places table will also be populated incrementally, but the script will only need to touch the DB the first time it encounters the place name.)
I need to get the row ids corresponding to the name and place, but my ~~Google~~DuckDuckGo-foo is failing me on finding a good example of the syntax for INSERT ... SELECT ... where the select is referencing multiple tables.

Zecc

@HardwareGeek Do yourself a favor and run the following commands in sqlite3:

.headers on

.mode column

and the results will be returned as

sqlite> select * from name;
id  name    count
--  ------  -----
1   foo     1
2   bar     2
3   baz     2
4   qwerty  1
10  asdf    1
11  jkl     1
12  uiop    1
sqlite> select * from places;
id  name
--  ----------
1   here
2   there
3   everywhere

You might as well run .help mode to learn about other output formats. I just learned it can output HTML and Markdown!

sqlite> .mode markdown
sqlite> select * from places;
| id |    name    |
|----|------------|
| 1  | here       |
| 2  | there      |
| 3  | everywhere |

Unperverted Vixen

@HardwareGeek

Annoyingly, SQLite doesn't tell you when you create name_place that you typoed name as names in the foreign key. So first, you'll have to fix that.

If you really want to do it as a SELECT after the fact, this is how you could do it:

INSERT INTO name_place (name_id, place_id, count)
SELECT n.id, p.id, 3 FROM name n
JOIN places p WHERE n.name = 'foo' AND p.name = 'here'

Not sure if that's the best way to write that. That JOIN is kinda ugly. (In SQL Server I'd use CROSS APPLY, but SQLite doesn't have that. Either way, it feels wrong.)

Normally, instead of doing it as a SELECT from multiple tables, I will save the identity value from name after the INSERT, and then use that for the name_place INSERT:

INSERT INTO name (name, count) VALUES ('foo', 1);
INSERT INTO name_place (name_id, place_id, count) SELECT last_insert_rowid(), p.id, 3 FROM places p WHERE p.name = 'here'

HardwareGeek

@Unperverted-Vixen said in SQLite in Python:

SQLite doesn't tell you when you create name_place that you typoed name as names in the foreign key.

I think I've been consistent in the script, but messing around with the sqlite3 command line, I have been unintentionally switching between singular and plural table names. I have tables named both name and names in my test db, and apparently I pasted the wrong one into my post. The foreign key is actually correct.

I've been reading a bunch of confusing stuff about how to get the id of the last inserted row, and think it doesn't work if the last operation was an update, not an insert, and maybe not if the insert turned into an update due to the ON CONFLICT clause. But it's worth a try.

Unperverted Vixen

@HardwareGeek Ugh, yeah, that's not going to work with the UPDATE case. Ah well. Then just use the first query. Working is more important than ugly.

HardwareGeek

Yeah, I just verified that it no worky.

@Unperverted-Vixen said in SQLite in Python:

Working is more important than ugly.

QFT

Worst case, I can get the id(s) in the script and pass them into the query, but that's less efficient. Your first example what I was looking for when I asked the question, but my brain just kind of skipped over that when I first read your post.

HardwareGeek

What's wrong with this?

The ON CONFLICT works for this:

db_cur.execute('INSERT INTO names(name, count) VALUES (?, ?) ON CONFLICT(name) DO UPDATE SET count=count+1;', (name, 1))

but not for this:

db_cur.execute("INSERT INTO name_place(name_id, place_id, count) SELECT n.id, p.id, 1 FROM names n JOIN places p WHERE n.name = ? AND p.name = 'here' ON CONFLICT DO UPDATE SET count=count+1;", (name,))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
sqlite3.OperationalError: near "UPDATE": syntax error

Is it because it doesn't include a conflict target? The docs say I don't need one.

The conflict target may be omitted on the last ON CONFLICT clause in the INSERT statement, but is required for all other ON CONFLICT clauses.
... If the last ON CONFLICT clause omits the conflict target, then it will fire if any uniqueness constraint fails which is not captured by prior ON CONFLICT clauses.

Because the uniqueness constraint is in the multi-column index, not a column name, I don't know to specify the conflict target; I didn't run across any examples like that. It's the only uniqueness constraint for that table, and there's only one ON CONFLICT clause in the query, so in theory, at least, I shouldn't need to specify the conflict target. That's how I read it, anyway. I wish SQLite offered something more detailed than just "syntax error".

Unperverted Vixen

@HardwareGeek said in SQLite in Python:

It's the only uniqueness constraint for that table, and there's only one ON CONFLICT clause in the query, so in theory, at least, I shouldn't need to specify the conflict target. That's how I read it, anyway. I wish SQLite offered something more detailed than just "syntax error".

Old version of SQLite?

The syntax was generalized to permit multiple ON CONFLICT clauses and to allow DO UPDATE resolution without a conflict target in SQLite version 3.35.0 (2021-03-12).

HardwareGeek

@Unperverted-Vixen 3.31.1, and apt says that's the newest version available.

HardwareGeek

With a little experimentation, I found the conflict target that works. It did the upsert correctly. Now to copy-paste that into the script, and repeat that for the other tables.

HardwareGeek

Progress: The code in the script works when I paste it into the Python interactive interpreter. Now I just need to call it from the appropriate place in the script.

# FIXME And define the function to populate the place table. It gets populated separately from and prior to the function that upserts a name and its related fields, because it's encountered at a different place in the source data and only once for possibly thousands of names.

However, it's now 01:15, and that needs to wait until sometime later, after sleep. Um, and work. Yes, definitely after work.

Also FIXME, I should probably commit once in a while, since autocommit is off by default. And add some error handling.

dkf

@HardwareGeek said in SQLite in Python:

I've been reading a bunch of confusing stuff about how to get the id of the last inserted row, and think it doesn't work if the last operation was an update, not an insert, and maybe not if the insert turned into an update due to the ON CONFLICT clause. But it's worth a try.

That's when you might want a RETURNING clause. Assuming the SQLite version in that Python installation is new enough.

dkf

@HardwareGeek said in SQLite in Python:

3.31.1, and apt says that's the newest version available.

Version 3.42.0 (2023-05-16) is the newest available right now. I hate dealing with distributions that sit massively behind.

dkf

@HardwareGeek said in SQLite in Python:

autocommit is off by default

The classic, documented way of doing it from Python is:

con = sqlite3.connect("mydb.sqlite3")

# The context is the transaction boundary
# Outside of this, you're in autocommit mode unless you do a BEGIN yourself
with con:
    # Doing an insert in a transaction
    con.execute("INSERT INTO foo(bar) VALUES (?)", [123])

con.close()

It's only if you're doing multithreaded write-heavy workloads that you need to be more careful, and Python just sucks for that sort of thing anyway.

(I've noticed that many database drivers for various languages impose strange transaction semantics. The strangeness doesn't matter when there's only one connection to the DB so it is relatively hard to envision, but nailing down when transaction boundaries really happen — which is just a little bit important in some production scenarios! — can really expose that database driver authors like the old mind-bending drugs. You shouldn't need to care about that.)

PleegWat

@HardwareGeek said in SQLite in Python:

Worst case, I can get the id(s) in the script and pass them into the query, but that's less efficient.

If you need to query the name from DB every time, probably. But depending on the total size of the name table, it may be feasible to keep a name→id map in memory.

dangeRuss

@dkf said in SQLite in Python:

It's only if you're doing multithreaded write-heavy workloads that you need to be more careful, and Python just sucks for that sort of thing anyway.

Why do you think python sucks at this? Python handles multithreading just fine, as long as you don't need to use more than a single CPU.

HardwareGeek

@dkf said in SQLite in Python:

@HardwareGeek said in SQLite in Python:

autocommit is off by default

The classic, documented way of doing it from Python is:

None of the documentation I found showed that way of doing it. The clearest example I found (I think it was the official python.org documentation for the sqlite3 module, but I'm not sure offhand; I've got about 50 tabs open just for SQLite) created a connection, then a cursor, and used the cursor for everything after that. (Except commit(), which is a method of only of the connection object, not the cursor object, but I don't think it ever shows calling commit ().) Not a with in sight. And I'm not sure which tab talks about transactions and autocommit.

There's another page that shows using the connection object for everything (but still not using with), but con.execute() implicitly creates a new cursor every time it's called, which seems inefficient compared to explicitly creating and reusing a cursor.

It's only if you're doing multithreaded write-heavy workloads that you need to be more careful, and Python just sucks for that sort of thing anyway.

Well, it's definitely a write-heavy workload. And it was multithreaded while I was bouncing back and forth between Python and the SQLite command line. Some page said writes from Python wouldn't be visible to the other process without an explicit con.commit(), and I found that to be true.

I have a Python function that creates the db tables and indices (after nuking the existing db from orbit, if it exists) and returns the connection and cursor objects. Another function upserts a name and its associated place and type relationships (which are slightly more complicated than I've discussed here; I thought of a couple more name attributes I want to store, but they're just more of the same). That's the logical place for a transaction boundary. There is also a function to insert rows into the place table; unlike the type table, that isn't done when processing the name relationships, because the place is associated with the source data file, not the individual records in it, and applies to all of those records, so the place table is updated when opening a new source file. Obviously, that should be its own transaction. But I'm not sure how to do that without an explicit commit() (or rollback() in case of error).

topspin

@dangeRuss said in SQLite in Python:

Why do you think python sucks at this? Python handles multithreading just fine, as long as you don't need to use more than a single CPU.

Because people want to use more than a single CPU? That's what multithreading is for.

HardwareGeek

@PleegWat said in SQLite in Python:

But depending on the total size of the name table, it may be feasible to keep a name→id map in memory.

That will likely wind up being at least hundreds of thousands, maybe millions, of names, but almost certainly smaller than the 4GB that was crashing the WSL VM, so maybe. But I've got the INSERT ... SELECT... working, so I'm not going to mess with it.

Dragnslcr

@dkf said in SQLite in Python:

@HardwareGeek said in SQLite in Python:

3.31.1, and apt says that's the newest version available.

Version 3.42.0 (2023-05-16) is the newest available right now. I hate dealing with distributions that sit massively behind.

Based on the apt output earlier, it's Ubuntu 20.04, which is a 3-year-old long-term stable version. Ubuntu 22.04 (the most recent LTS) has 3.37, and Ubuntu 23.04 has 3.40.

dangeRuss

@topspin said in SQLite in Python:

@dangeRuss said in SQLite in Python:

Why do you think python sucks at this? Python handles multithreading just fine, as long as you don't need to use more than a single CPU.

Because people want to use more than a single CPU? That's what multithreading is for.

Not necessarily. If you're doing a lot of heavy processing in your threads then yes, you want more than a single CPU. If all you're doing is IO, you're probably fine with a single CPU.

HardwareGeek

@Dragnslcr said in SQLite in Python:

@dkf said in SQLite in Python:

@HardwareGeek said in SQLite in Python:

3.31.1, and apt says that's the newest version available.

Version 3.42.0 (2023-05-16) is the newest available right now. I hate dealing with distributions that sit massively behind.

Based on the apt output earlier, it's Ubuntu 20.04, which is a 3-year-old long-term stable version. Ubuntu 22.04 (the most recent LTS) has 3.37, and Ubuntu 23.04 has 3.40.

How can I update WSL to use the more recent LTS? (I admit I haven't even tried to google this, yet, but since you just mentioned it, I might as well ask.)

dkf

@dangeRuss said in SQLite in Python:

Why do you think python sucks at this?

Too much damn experience. Python has the world's worst implementation of multithreading. Even if you use CPU affinity, it still sucks like a galactic-class black hole.

dkf

@HardwareGeek said in SQLite in Python:

There's another page that shows using the connection object for everything (but still not using with), but con.execute() implicitly creates a new cursor every time it's called, which seems inefficient compared to explicitly creating and reusing a cursor.

You're worrying about the efficiency of a cursor in Python? Hah.

Let me let you into a secret. SQLite itself doesn't have cursors at all. It has statement objects that step through result sets. When you worry about the cursor, you're just worrying about a Python object that is wrapping the underlying API. (Statement preparation costs are usually lower in SQLite than most other DBs, but that's an implementation quirk that I'd expect any other in-process DB to share.)

dangeRuss

@HardwareGeek said in SQLite in Python:

@Dragnslcr said in SQLite in Python:

@dkf said in SQLite in Python:

@HardwareGeek said in SQLite in Python:

3.31.1, and apt says that's the newest version available.

Version 3.42.0 (2023-05-16) is the newest available right now. I hate dealing with distributions that sit massively behind.

Based on the apt output earlier, it's Ubuntu 20.04, which is a 3-year-old long-term stable version. Ubuntu 22.04 (the most recent LTS) has 3.37, and Ubuntu 23.04 has 3.40.

How can I update WSL to use the more recent LTS? (I admit I haven't even tried to google this, yet, but since you just mentioned it, I might as well ask.)

I think you just install a new one. You can have multiple WSLs. Not sure you can just upgrade from one to another.

For all intents and purposes you should just get a Mac if you can. The M1 Air was like $750 recently and it's probably lightyears better than whatever you're using. Of course I could be biased since I'm on a M1 Max.

dangeRuss

@dkf said in SQLite in Python:

@dangeRuss said in SQLite in Python:

Why do you think python sucks at this?

Too much damn experience. Python has the world's worst implementation of multithreading. Even if you use CPU affinity, it still sucks like a galactic-class black hole.

I don't know about worst. Node can't use more than a single CPU either. Although I did write a script to parse my gmail export file and the node version was lightyears faster than the python one.

You can use multiprocessing in python if you really need that much CPU. It gets a little tricky with passing the data around, but it mostly works.

HardwareGeek

@dangeRuss said in SQLite in Python:

lightyears better than whatever you're using.

Not to get into a dick-measuring contest, but I'm using an i9-11900 (8 cores — par with M1, half of M2), 64 GB of RAM (4x the max, 8x the default for M1 Air, and 2.67x the max for M2) and 7.5 TB of storage (6.5 internal, 1 external; 2.5 SSD, 4.5 spinning rust — the best M1 or M2 Air can do is 2TB internal), driving dual 4k monitors. So the M2 processor might be better, the whole system is lightyears short of lightyears better than mine, and I sure as hell couldn't get an Air that's even a little better than my system for $700.