NULL in SQL is a notorious source of errors and confusion (particularly when it ...

da_chicken · on Dec 4, 2018

That's not really NULL's fault that it causes confusion in SQL. That's just ternary logic. People who don't handle NULLs in SQL aren't really mishandling the NULL. NULL is just a value. They're simply failing to understand the Boolean value of UNKNOWN and what that means. They're so used to thinking only in bivalent logic that the additional complexity throws them off.

However, "It's more complex for me to think about," or, "I don't understand the convention," or even, "It's easy to forget the convention," are not a very convincing arguments. It's similar to arguments about little endian vs big endian. Yes, big endian is how we write our positional numbers, but little endian makes casts a noop. Or arguments about zero-based array indexing. These concepts aren't difficult. They're just more complex. Negative numbers aren't difficult, but they're more complicated than just cardinal numbers. Fractions and decimals aren't difficult, but they're more complex than integers. Multiplication and division aren't difficult, but they're more complex than addition and subtraction.

setr · on Dec 4, 2018

>That's not really NULL's fault that it causes confusion in SQL. That's just ternary logic

The problem is that its not simply ternary logic. It's a ternary logic that gets mapped onto a boolean algebra, which leads to the usual strange repercussions (particularly, the presence of nulls creates both false positives, and false negatives, silently).

The SQL language goes out of its way to pretend its not ternary, though in fact it is. You have to actively keep in mind when writing SQL that the database is trying to trick you. This is not a good thing, and it's hard to blame the programmer when they get tricked.

lmm · on Dec 4, 2018

> Or arguments about zero-based array indexing. These concepts aren't difficult. They're just more complex. Negative numbers aren't difficult, but they're more complicated than just cardinal numbers. Fractions and decimals aren't difficult, but they're more complex than integers. Multiplication and division aren't difficult, but they're more complex than addition and subtraction.

We usually consider it a good thing when programming languages let you opt out of the complex thing. In a good language, you can do integer arithmetic if you don't want to deal with fractions or decimals. You can do cardinal arithmetic if you don't want to deal with negative numbers. You can do ordinary Boolean logic if you don't want to do ternary logic.

The problem with SQL isn't that it has NULL. It's that it's too hard to not have NULL. Which is the problem with null in general.

da_chicken · on Dec 4, 2018

Yeah, but saying "I want to use SQL" and "I don't want to use NULL or ternary logic" is a bit like saying "I want to use the existing datetime types" and "I want all years to have 10 months, all months to have 30 days, etc." Or like saying, "I want everything to use integers" and "I need fractional components." Your requirements break the abstraction not because the system is constrained, but because you're breaking the conceptual model that's the foundation of what you're trying to use. It's not a language problem. It's not a data problem. It's not a computing problem at all. It's applying the wrong conceptual model to meet your needs. That isn't a problem with the conceptual model, either, since plenty of people use it very successfully.

lmm · on Dec 4, 2018

> Your requirements break the abstraction not because the system is constrained, but because you're breaking the conceptual model that's the foundation of what you're trying to use.

How so? Elsewhere in the thread it's claimed that the original relational model didn't have nulls, which is what I'd expect.

da_chicken · on Dec 4, 2018

Relational algebra doesn't have nulls, but there's a difference between the mathematical theory and concepts and the reality of a relational system.

As I mention elsewhere, Codd's own list of rules for a relational database [0] explicitly require nulls (see Rule 3).

[0]: https://en.wikipedia.org/wiki/Codd%27s_12_rules

lmm · on Dec 5, 2018

I don't see any entanglement with the rest of the rules, or with what makes a relational database a relational database. "A systemic way to represent missing and inapplicable information" may be necessary, but better alternatives to null are imaginable. A relational database without nulls sounds like an ML without exceptions: actually a pretty good idea.

bunderbunder · on Dec 4, 2018

I guess. My tendency is to think that it's more a problem for developers who are new to SQL, and are surprised to find out that, despite having the same name, nulls in SQL don't have the same semantics as nulls in other languages.

Once you get a handle on the semantics, though, they make a lot of sense. The trick is to understand how SQL's NULL is rooted in mathematical formalism, not the pragmatics of dealing with pointers. It has more in common with NaN in floating-point numbers. So, for SQL, "null <> null" behaves like "NaN <> NaN". For C and friends, "null == null" for the same reason that "0 == 0".

nayuki · on Dec 4, 2018

In SQL, NULL<>NULL yields false. You use IS NULL / IS NOT NULL to test for NULL values. In programming languages, NaN!=NaN yields true. You use x!=x to test for NaN values.

Saying that SQL NULL is rooted in mathematical formalism doesn't explain anything, because anything (even nullptr and NaN) can be explained in mathematical formalism. What we want is a simple semantic model that a human can understand and one that lacks nasty unintuitive surprises.

da_chicken · on Dec 4, 2018

> In SQL, NULL<>NULL yields false.

No, NULL <> NULL yields UNKNOWN. That's why NULL <> NULL and NOT (NULL <> NULL) behave the same: they have the same value. UNKNOWN is a first-order truth value in ternary logic.

The key is that in a WHERE clause, a record is only returned if the WHERE clause evaluates to TRUE. Not TRUE or UNKNOWN. TRUE.

masklinn · on Dec 4, 2018

> In SQL, NULL<>NULL yields false.

It yields NULL, not false. So do NULL = NULL or NOT NULL.

da_chicken · on Dec 4, 2018

NULL isn't a Boolean value in ternary logic any more than 3.2 or 'Hello' or December 12, 2018, have Boolean values. It's UNKNOWN. UNKNOWN is related to NULL, but they don't work identically.

NULL is a value that any column data type can potentially have. NULL is what comparison and evaluation operators work with. UNKNOWN is a ternary Boolean type, and the Boolean type is what Boolean operators work with (AND, OR, NOT) and nothing else. This Boolean type in an RDBMS is unavailable to the user and is for internal evaluation purposes only. RDBMSs that support a "bool" type are not implementing the same thing. You can never say UPDATE MyTable SET Col = Value1 AND Value2. That's not going to work. Many RDBMSs have a documentation page that explains this difference, like this one[0] from Microsoft SQL Server.

Notably, NULL + 3 and NULL * 5 are both NULL. Any mathematic operation on NULL is NULL. But UNKNOWN AND FALSE is FALSE, and UNKNOWN OR TRUE is TRUE.

[0]: https://docs.microsoft.com/en-us/sql/t-sql/language-elements...

icedchai · on Dec 4, 2018

NULL is an alias for UNKNOWN on many systems (like MySQL.) Other DBs don't even have UNKNOWN.

UPDATE table set col=value1 and value2 works fine IF value1 and value2 are booleans.

da_chicken · on Dec 4, 2018

That's a great example of MySQL creating a proprietary extension of ANSI SQL that does little more than deliberately mislead users.

icedchai · on Dec 4, 2018

According to https://en.wikipedia.org/wiki/Null_%28SQL%29#BOOLEAN_data_ty... NULL is the same as UNKNOWN. The standard also asserts that NULL and UNKNOWN "may be used interchangeably to mean exactly the same thing"

In 20+ years of DB work, I have NEVER seen anyone use UNKNOWN. It is always NULL. Always.

da_chicken · on Dec 5, 2018

Alright, I will withdraw my criticism of MySQL on this issue.

However....

> In 20+ years of DB work, I have NEVER seen anyone use UNKNOWN.

I mean, I've already shown where Microsoft does just that [0]. Oracle pretty clearly does the same [1] [2]. People don't use it because you can almost never refer to it directly. The language intentionally hides it. About the only place I know that you can is PostgreSQL [3], which supports the "boolean_expression IS UNKNOWN" predicate.

> The standard also asserts

I assume you've got the 2003 draft standard that's around [4]. I will use that because I don't see any more recent version of 9075-2 that's freely available.

Yes, the standard does say under 4.5 Boolean types:

> This specification does not make a distinction between the null value of the boolean data type and the truth value Unknown that is the result of an SQL <predicate>, <search condition>, or <boolean value expression>; they may be used interchangeably to mean exactly the same thing.

However, that's in the context of describing the Boolean user data type, a.k.a., BOOLEAN. You can tell because 4.2 describes character strings (CHAR, VARCHAR, etc), 4.3 describes binary strings, 4.4 describes the numeric data type, 4.6 describes DATETIME, and 4.7 describes user-defined types.

The standard is not saying that UNKNOWN and NULL are the same. It's saying that the Boolean user data type can use NULL to represent UNKNOWN. It's saying that if you choose implement a BOOLEAN user data type, you can use NULL to represent UNKNOWN. If you choose to assign a boolean expression to a column, that is. Nevertheless, an SQL <predicate>, <search condition>, or <boolean value expression> has a value of True, False, or Unknown. This shown by looking at 6.34 <boolean value expression>:

  <truth value> ::=
      TRUE
      | FALSE
      | UNKNOWN

Or by searching section 8 and seeing where every time they talk about one of the value expressions being the null value, then the predicate "is Unknown".

[0]: https://docs.microsoft.com/en-us/sql/t-sql/language-elements...

[1]: https://docs.oracle.com/cd/B19306_01/server.102/b14200/condi...

[2]: https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_e...

[3]: https://www.postgresql.org/docs/11/functions-comparison.html

[4]: http://www.wiscorp.com/sql_2003_standard.zip

Edit: Bit of cleanup.

icedchai · on Dec 5, 2018

Ok, I will concede you are technically correct! However, I never seen a developer use "is unknown", even with Postgres. (I have been working with Postgres for over 15 years.) They always use "is null", which is, for all intents and purposes, the same thing from a developer perspective.

da_chicken · on Dec 5, 2018

I've only seen it once that I can think of, and I don't remember where. It might've been an example when they added or explained that predicate. I recall something like (Column1 = Column2) IS NOT UNKNOWN, but I don't know why you wouldn't use Column1 IS NOT NULL AND Column2 IS NOT NULL instead. I guess it might save a bit of rewriting, but it still seems pretty narrow.

It's really not useful unless you're talking about the value of a boolean expression or the underlying concepts of SQL, and most RDBMSs don't let you manipulate that directly with DML (MySQL is the first one I've seen that let you do it, and you just taught me that was the case). It's somewhat hidden because of that.

mercer · on Dec 4, 2018

I thought in most (many?) languages NaN != NaN ?

lmm · on Dec 4, 2018

My degree's in mathematics and I share your disdain for pointer bit-twiddling. I still find SQL nulls difficult to reason about or diagnose. I'm sure there are times and places when their behaviour is what you want but most of the time they're just a big extra complication that you don't want or need.

lugg · on Dec 4, 2018

In the tables I define everything is not null with sane defaults by default.

The places I do allow null are few and far between (e.g. updated_at) and I'm struggling to think of instances I've used them as anything other than absence indicators.

In fact I don't think I ever treat it as anything other than that in code either.

Was the purpose of null ever to mean anything other than I have not been defined/set?

All my objects are statically typed so I never run into the issue of testing is thing.x a thing, it's always a thing, or it's a compile error. It's either set or not set, and thanks to the database convention I only have to worry about certain values having null, most of the time it makes sense anyway. Is updated_at turthy doubles as has been updated tests.

Am I incorrect in this method? With this method I fail to see big extra complication. Will switching to option types help me? I debate they will not. But I'm happy to be convinced. I do avoid nulls. I just haven't seen a problem with them in my own code. (Not true for others)

Specifically, with the caching problem, provided you constrain the cache to reason about null == not set. I see no problem.

    Cache.get(K) // null
    Cache.set(K,3) // void
    Cache.get(K) // 3
    Cache.set(K, null) // deletes, void
    Cache.get(K) // null
    Cache.set(K, false)
    Cache.get(K) // false

Only certain values of mine are going to potentially be null from the database, all of which will be contained within serialised objects.

I just never see the issue the author has. The times K do see it are when people get too clever with default values.

I understand it, I just don't see it in practice. Certainly my not frequently enough to make language changes.

Title should just read "Stop abusing null" because the only time I've seen it be an issue is when people are dual encoding meaning.

lmm · on Dec 5, 2018

> In fact I don't think I ever treat it as anything other than that in code either.

> Was the purpose of null ever to mean anything other than I have not been defined/set?

Different people understand null differently (it might mean "error", "value not in map", "invalid user input", ...) and there's never been a clear consensus. If "null" only ever has one meaning anywhere in your codebase, and any third-party libraries you use only ever use it to mean the same thing, you're probably ok. But as soon as there are multiple meanings you'll have confusion and bugs.

> Specifically, with the caching problem, provided you constrain the cache to reason about null == not set. I see no problem.

If you only have the one cache, sure. As soon as you have a two-level cache you start to have problems (you can no longer cache absent results, since you're using the same representation for absence from your outer cache). Or as soon as null shows up anywhere else.

It's the same problem as stuff that relies on evaluation order, or threadlocals: it's ok most of the time, as long as you're not combining it with something else that does the same thing. The trouble is the times when these noncompositional constructs break down are when you have complex nested code - which is precisely when you most need everything to work the way you'd expect.

lugg · on Dec 7, 2018

> Different people understand null differently (it might mean "error", "value not in map", "invalid user input", ...)

Null only has one meaning, null. That's the point.

As soon as you start applying more to it than that you get problems.

> and there's never been a clear consensus.

This is simply not true. Null is null. That is all. Period. End of story. It has never been more than that.

If you have libraries, functions or existing code that ignores this, then that's on you, the developer, to reason about.

I guess I'm taking your meaning a bit out of context, I do understand years of common practice have resulted in much abuse but I don't think the language designers would have ever denoted double meanig in null values.

> If you only have one cache, sure

I feel like you're missing my point. If you need to handle more meaning than null == notset/unset/absent then you need to resort to a new data type. Null only has one meaning, null. You can't get two meanings out of one.

You, as the consumer of the cache must then decide on how to represent or encode further meaning. Either using the Some<T> pattern or an empty string or something like that.

This serialisation can easily be wrapped around a base cache class that just deals with simple storage where null == not set or unset (absent value.)

But the underlying pattern shouldn't involve itself with further concerns than it needs to.

This opinion is precisely because I've seen this sort of oh I'll just add a has function, oh and then I'll add a sub par serialisation library. Ok now I need a sometimes unserialize. Oh some legacy? Ok now I need to deserialize once to one level and twice to all levels. Oh yea, now I should throw not found.

Ok now every single call to cache.get must be wrapped in try catch and we must cast some values to false and others to empty string, oh yea and you have to call has before every get even if you just want to take advantage of a dynamically typed language and test for that falsely value. Cache.get(test) && okdothing();

It's two functions set sets the thing. Get gets it. If it's not there it returns null. That is the whole contract. Why do people try to over complicate the base contract? It's just crazy over engineering.

> which is precisely when you most need everything to work the way you'd expect

Indeed. And I expect null == null.

Not null == not yet set, set but cleared, set but empty, false, error, not found, or anything else for that matter.

I can use null to represent that my cache does not have a value for that key because that is the design I chose that

lmm · on Dec 9, 2018

> I guess I'm taking your meaning a bit out of context, I do understand years of common practice have resulted in much abuse but I don't think the language designers would have ever denoted double meanig in null values.

The language designers didn't give any single clear meaning to null. They just put it in the language, and so different library authors (entirely understandably) used it for different things, and it's now impossible to standardise on any one universal meaning.

> I feel like you're missing my point. If you need to handle more meaning than null == notset/unset/absent then you need to resort to a new data type. Null only has one meaning, null. You can't get two meanings out of one.

Indeed, because null is a language-level special case. (Whereas using Option you wouldn't have any problem: Option is just another normal user-defined type in the language, so Option<Option<T>> works no differently from any other Option).

> You, as the consumer of the cache must then decide on how to represent or encode further meaning. Either using the Some<T> pattern or an empty string or something like that.

So you have a bunch of awkward complexity in precisely the case where you least want extra trouble. You don't know how many places the cache might assume that null values have its particular meaning, and you have no way to know whether you've got them all. The most dangerous pitfalls in programming are things that usually work.

> It's two functions set sets the thing. Get gets it. If it's not there it returns null. That is the whole contract. Why do people try to over complicate the base contract? It's just crazy over engineering.

There's nothing complicated about using option. Set sets the thing. Get gives you an option that's either some if the thing was set, none if it wasn't. Perfectly normal datatype like you'd write yourself, no special cases anywhere.

> I can use null to represent that my cache does not have a value for that key because that is the design I chose that

Only if you write all you own code and never use anyone else's libraries. And even then, you have to remember all the things you used it to mean in all the places you used it. There's only one null and there's no way to define a user-defined thing that works like null, so it begs to be abused (I'd argue to use it at all is to abuse it, given that it has no particular meaning defined in the language).