InfluxDB "improvements"

cartman82

Powers that be decided we are to use something called InfluxDB to store our time-series style data.

On the surface, it seemed like a reasonable choice, TBH. Influx is specifically designed for that use case.

I coded my part of this 6 months ago, tested it and moved on.

Fast forward to yesterday.

I'm tasked with making a few unrelated final tweaks before the big launch this Monday.

I check out the code, do the documented setup steps (apt-get install influxdb), all normal. Try to start up the project and... CRASH. Errors galore, all related to influx DAL.

An hour of frantic debugging and googling reveals the following.

They changed the ports you use to connect to the database.
They removed the support for nulls. You have nulls in your data, that you planned to store in influx? TOO BAD!
Is this explained anywhere? NOPE! I found it in a laconic answer to a confused Stack Overflow question. Seriously, check out the guy's breakdown.

I'm personally still confused by all this. So what's the current best practice regarding nulls? This guy doesn't know. DOES ANYONE KNOW?
They added a whole new concept of indexed columns, called "tags". OK, so maybe tags support nulls? What's the overhead of indexing? How many columns should I index on average? Any examples of real world usage? NOPE! THINK OF IT AS AN ADVENTURE!
Of course, with all these changes, my node.js driver is no longer compatible with the current (easily installed and supported) version of the database. Upgraded the driver and voila! My code now breaks, because the driver itself changed API. Deprecation warnings? NOPE!
In case this wasn't clear, NONE OF THIS WAS EXPLAINED ANYWHERE. I had to hunt this info down from github commits and issue comment threads. For all the shit we give ember.js devs, every deprecation they made was meticulously documented and properly deprecated in advance.
Some commenters are advocating moving to an alternative driver. No idea how much work would be involved with that (still investigating).
Did I mention we launch this Monday? The day after tomorrow?

I can probably fix this crap, but man...

I wouldn't entrust a lemonade stand to these jokers, let alone my data.

dse

It is very new, like few months new. The whole concept is new as well as the implementation. There is also CityzenData if you are willing to pay, but it has its own weirdness, and is closed source service.

Why do you want nulls in a timed time series? If it is not there it is null, and if you must use a placeholder use nan

LB_

TR :WTF: is expecting deprecation warnings when the major version number is 0...

Did I mention the major version number is 0? You're launching knowing this?

anonymous234

Hey, can't find the bugs in a product if no one's using it!

Just ask .

loopback0

@LB_ said:

Did I mention the major version number is 0? You're launching knowing this?

Using version 0 Open Source software on a Production system that people want to work reliably? Couldn't possibly go wrong.

Just another example of the OSS attitude of not giving a shit about users.

LB_

@loopback0 said:

Just another example of the OSS attitude of not giving a shit about users.

It is literally a spelled-out thing: http://semver.org/

4. Major version zero (0.y.z) is for initial development. Anything may change at any time. The public API should not be considered stable.

loopback0

TRWTF is still using it for a system that actually matters.

edit: Irregardless of what most people understand to be version 0 products, even without that link, their own website:

cartman82

@LB_ said:

TR is expecting deprecation warnings when the major version number is 0...

Did I mention the major version number is 0? You're launching knowing this?

Version 0 could mean nothing or everything. Node.js that I'm using is still version 0.

Anyway, not my decision.

cartman82

On skype with the main dev.

-- "How are you dealing with nulls in influ-"

Queue 3 screens worth of rant about what a piece of shit influx is and how, if it was up to him, we'd just use MySQL for everything.

Highlights:

It has some internal clustering. Every few weeks it triggers some kind of rewrite, which causes queries to start returning duplicate data. He deals with that in code.
This clustering can't be disabled, it can only be delayed (like Win10 updates). He maxed it out at like 3 months.
If you insert a non-existent column, it is created. If you select a non-existent column, it throws an error and you can't extract the data. He deals with it by CURL-ing in a new empty row.
Migration is almost impossible. Not sure why.

cartman82

Agreed with the lead dev to just revert to 0.8.x for now.

Installed it using a deb from their site. Opened the local web interface, to create a db...

Broken.

Still says "0.9.3".

So now I need to hunt down the leftovers from the previous install and purge them.

dse

0.8 and 0.9 are two completely different products. Still not certain why you need nulls, and why you cannot use nan instead.

cartman82

@dse said:

0.8 and 0.9 are two completely different products. Still not certain why you need nulls, and why you cannot use nan instead.

Wait, are you like a maintainer or something?

Also what's "nan"?

dse

No, we started using it very early and I compared it with some other products to base our time series db on. It was too early at the time, so much that even input protocol was still json. I did not use any driver or anything, we directly recorded our data.

nan is not-a-number is a floating point numerical value, it is common to use it when some data in time series is missing.

cartman82

@dse said:

No, we started using it very early and I compared it with some other products to base our time series db on. It was too early at the time, so much that even input protocol was still json. I did not use any driver or anything, we directly recorded our data.

Damn, there goes my siren badge.

@dse said:

nan is not-a-number is a floating point numerical value, it is common to use it when some data in time series is missing.

There's zero mention of nan anywhere in their crappy documentation. Also, how would I even represent that in javascript? NaN doesn't seem supported.

And from the web interface, all I get "cannot find column nan" when I try using it in search.

As for why do I need NULL-s.... duh, if I don't have data for a particular column, I need to use something instead, do I? Why do people use NULL-s in general, since ancient times.

LB_

@cartman82 said:

Also, how would I even represent that in javascript?

http://www.w3schools.com/jsref/jsref_number_nan.asp ? (IMO NaN is not an appropriate substitute for null)

dse

Well, it could be called NaN or something else in JS, I think you should open an issue and ask how to do it in JS.
We did not use driver anyways, the protocol is simple, it gets a single line with a PUT and records it, no need to any driver.

@cartman82 said:

As for why do I need NULL-s.... duh, if I don't have data for a particular column, I need to use something instead, do I? Why do people use NULL-s in general, since ancient times.

You are thinking like a traditional database, InfluxDB is a timeseries DB. You should re-think the data model, or you are not fully using the potential of a timeseries database, you may not even need it. It is good for metrics (big data), that happen in given time. And that tags thingy you mentioned is the most important feature of 0.9 because without it everything will start to crawl when yo have huge amount of data.

cartman82

@dse said:

Well, it could be called NaN or something else in JS, I think you should open an issue and ask how to do it in JS. We did not use driver anyways, the protocol is simple, it gets a single line with a PUT and records it, no need to any driver.

I know the protocol.

I'm telling you, their own interface doesn't support it. You can't do SELECT * FROM series WHERE column = nan, it throws an error.

If there was a nan datatype in whatever pre-alpha version you used, it went the way of null in the current one.

@dse said:

You are thinking like a traditional database, InfluxDB is a timeseries DB. You should re-think the data model, or you are not fully using the potential of a timeseries database, you may not even need it. It is good for metrics (big data), that happen in given time.

We basically use it to log events, with like 20-30 columns of additional data. Some of which we may not have at the moment.

And yes, we would have been fine with an SQL. But architect astronauts got involved.

dse

If an entire row in a series is null, one should just not record that row. You can query the time series db like WHERE t0 < time < t1, it is not exactly SQL, and is why it is called SQL-like. If you re-think the db as a gigantic sparse array, then it makes sense to use nan when there is a hole in the data. Also it makes much more sense to avoid having multiple columns in InfluxDB, give each metric its own name and record when you need to.

dse

@cartman82 said:

And yes, we would have been fine with an SQL. But architect astronauts got involved.

Well that is bad, if you do not need it, it will be in the way. If somehow it is broken, can you avoid multiple columns? just record each data in its own measurement. You can query them with regex on the measurement name if you want.

One more thing, maybe you should try SELECT * FROM series WHERE column IS nan

_{my experience is with pre-alpha, I do not know if things are now broken but I suggest you at least open an issue}

cartman82

@dse said:

Well that is bad, if you do not need it, it will be in the way. If somehow it is broken, can you avoid multiple columns? just record each data in its own measurement. You can query them with regex on the measurement name if you want.

That would mean using timestamp as a primary key for joining data.

Heheh.

No.

@dse said:

my experience is with pre-alpha, I do not know if things are now broken but I suggest you at least open an issue

Or even better, switch over to something sane the first chance we get.

Hopefully, I won't be involved with the expectedly disastrous data migration.

@dse said:

One more thing, maybe you should try SELECT * FROM series WHERE column IS nan

Execute QueryERROR: syntax error, unexpected SIMPLE_NAME, expecting $end

Telling you, you got it mixed up with something else.

dse

@cartman82 said:

That would mean using timestamp as a primary key for joining data.

Heheh.

No.

Why not? that is the whole point of a timeseries db. It is crazy in traditional DB but not here.

cartman82

@dse said:

Why not? that is the whole point of a timeseries db. It is crazy in traditional DB but not here.

Didn't look into this, I just did a quick job according to the spec they gave me.
But wouldn't a duplicate timestamp from 2 events ruin my join?
Also, wouldn't that mean like 20 different queries ever time I wanted to insert a single record?

Also, are you gonna tell that poor PHP guy he needs to rewrite his whole db code from scratch?

dse

I think the architect astronaut did not even read the specs about the technology to see how things should change! It is not a substitute for existing SQL databases.
Even the queries on a timeseries db are different, joins and all are implemented for convenience of developers who are familiar with them, but with a different syntax (and implementation) that is all around time. The performance will not be hit, and that is the only reason one wants to have a timeseries db.

cartman82

The point is, I don't see how I can break apart discrete records into their own series and then later put them back together reliably based on timestamps.

Circuitsoft

If you could have two unrelated things with the same timestamp, then it's not time series data.

dse

Something similar to SELECT * FROM /series*/ WHERE t0 < time < t1 basically ask for the measurements in a time span.

blakeyrat

An open-source

That is all the words I needed to see on their homepage before knowing this product would be utter shit. AND LO AND BEHOLD IT WAS.

ben_lubar

C# is open source.

dkf

@dse said:

Still not certain why you need nulls, and why you cannot use nan instead.

Are there aggregate functions? Summing all the widgets known sold in a period works differently when you've got nulls to when you've got NaNs involved. You also probably can't look for NaN easily since the silly programmers probably think that every value is equal to itself…

NaN is not absence of data.

boomzilla

@dkf said:

You also probably can't look for NaN easily since the silly programmers probably think that every value is equal to itself…

Just like nulls (in SQL, at least).

julmu

@Circuitsoft said:

If you could have two unrelated things with the same timestamp, then it's not time series data.

From the frontpage of influxdb.com:

If you have more than one user to track then you're

dse

@dkf said:

Are there aggregate functions? Summing all the widgets known sold in a period works differently when you've got nulls to when you've got NaNs involved.

Yes, there are aggregates like this for convenient downsampling:

SELECT MEAN(field_key) FROM measurement WHERE time > now() - 1d GROUP BY time(10m)

@dkf said:

You also probably can't look for NaN easily since the silly programmers probably think that every value is equal to itself

Dealing with nan always requires special methods. There are usually isnan functions, even C++11 added one because silly programmers had no choice but to look at bit patterns . Because there is some overhead in properly aggregating series with nan some languages have normal aggregates such as mean and nan-aware aggregates like nanmean.

@dkf said:

NaN is not absence of data.

It depends how you want to use it so it becomes more useful. Pandas treats nan as missing value, numpy also has masked arrays as well as sparse arrays. If you have basic numerical elements, to model missing elements you have to take either of the above method: use a placeholder (nan seems good), use a mask (increases memory usage), or use a more sophisticated data structure like sparse arrays (can miss opportunity for vectorization by hardware if it is too sparse).

dse

@julmu said:

If you have more than one user to track then you're

If you do not read documentation you're . It is like someone who has used Excel all the time, then decides to use SQL without reading is SQL at all.

But just to clarify, you can track multiple users, and many measurements.
I do not suggest using it because it is version 0, I myself decided to wait till 1 for many reasons (mainly financial but also that the whole concept is too new).
However, I did read the docs and blogs to familiarize myself with the technology and its performance. To select any OSS I suggest you do the same. Test it, read the fucking docs and blog posts, open an issue or send in ML to see if developers are responsive and courteous (otherwise you end up hating yourself), then finally read the code to see if it is high quality.

ben_lubar

@dse said:

because silly programmers had no choice

- The Go Programming Language

dse

I guess a proper implementation should know if the target has floating point hardware, but I doubt anyone uses Go on something that does not

33		// To avoid the floating-point hardware, could use:
34		//	x := Float64bits(f);
35		//	return uint32(x>>shift)&mask == mask && x != uvinf && x != uvneginf

@dse said:

C++11 added one because silly programmers had no choice

Just one more reason C++11 is a better language than C++. I hate the latter but begin to like the former.

ben_lubar

GoArm

The Go programming language. Contribute to golang/go development by creating an account on GitHub.

dse

Interesting! I think you are well familiar because you are writing your own language to give you advantage on dwarf fortress?

The test to see if f != f is enough then. The only problem is to tell the compiler not to be too smart and optimize that simple-looking comparison, but it is more elegant.

ben_lubar

@dse said:

because you are writing your own language

It's actually Prof. John Boyland's language. I'm taking CS854 this semester. Also, there are no floats in Cool.

@dse said:

to give you advantage on dwarf fortress

No, that one is written in C++:

GitHub - DFHack/dfhack: Memory hacking library for Dwarf Fortress and a set of tools that use it

Memory hacking library for Dwarf Fortress and a set of tools that use it - DFHack/dfhack

@dse said:

The only problem is to tell the compiler not to be too smart and optimize that simple-looking comparison

Floating point breaks a lot of assumptions. For example, addition is no longer commutative.

blakeyrat

@ben_lubar said:

Floating point breaks a lot of assumptions. For example, addition is no longer commutative.

Yeah and those dwarves have to have the full 64-bits of precision.

sloosecannon

@ben_lubar said:

No, that one is written in C++:

TIL you work on DFHack

dkf

@dse said:

The only problem is to tell the compiler not to be too smart and optimize that simple-looking comparison

All compilers get that right. Except the Microsoft compilers, apparently.

antiquarian

@dse said:

If you do not read documentation you're .

Paging @blakeyrat.

Eldelshell

This is why enterprises use Java or .Net. If everything breaks, you don't want your report to state that the company lost 10MM because the driver you used was something put together by some guy on GitHub. This is specially worry in Node land. I mean, just look at the database drivers for well known RDBMS.

dkf

If everything breaks, it's your fault. Maybe because it is your bad code. Maybe because you foolishly used someone else's bad code. Doesn't matter; your responsibility, your fault.