Thomas Tempelmann | REALbasic - Proposal for a better compiler

I like to propose a few changes to the REALbasic language semantics that should deal with some of the things I mention on the page REALbasic Annoyances.

My Background

I had worked on improving a Modula-2 compiler (called Megamax Modula-2) in the mid- and late 80s, facing the very same issues as we're now having with the REALbasic compiler: I added a new expressions evaluator and code generator to it, dealing with type conversion, constant folding and optimizations. I also was a member of the ISO standards committee for M-2 for a while, discussing solutions with a lot of bright people.

Proposed changes to the compiler and language

There are several steps that can be implemented one after one, improving user experience each time.

Step 1

This deals with the current problems of the huge bunch of unnecessary warnings when having code like "if unsignedIntVar = 0":

The compiler, if it encounters a sub-expression such as "a + b", must make sure that both have the same type.

If they have different types, and if one of them is a literal, its type is converted to the other's type. That way, if you write "if a > 0", the 0 gets the same type as that of "a", and then the compiler goes its way as before. This gets rid of the "precision loss" and similar warnings we're currently (as of RB 2008r4) facing. This change can be done in the compiler rather easily. It's static and requires no extra code generation.

Update (10 Apr 2010): Turns out this is more complicated as it should be, due to other mistakes that were done in the compiler, in which it tried to be "smart", now leading to problems for this fix. See end of this article for details.

Step 2

If a sub-expression such as "a + b" has different types, and none of them is a literal, it can lead to unexpected results currently. Some cases can be solved better as shown in Step 4, but for those where this is not possible, a more stringent rule has to be introduced:

If there's an ambiguity in how the calculation could be done internally, i.e. either signed or unsigned, it should be rejected as a compile error. For example, this would not compile any more:

  dim ui as UInt32 = 5
  dim si as Integer = -1
  ui = ui * si

This is is nonsense in arithmetic terms anyways. If the user meant to create a binary 2's complement value, then he should instead use the proper operators to do that, e.g. BitXOr. He could even multiply by &hFFFFFFFF, but then has to disable overflow checking (see Step 3).

Here's another example:

  dim ui as UInt32 = 5
  dim si as Integer = 2
  ui = ui * si

Here, the result seems obvious - there's no error to expect, right? Yet, the compiler has to decide which kind of CPU instruction to use - signed or unsigned. It could create code that decides this at runtime, based on the actual values, but that would lead to ineffient code even if the programmer knows what's the safe way to do this calculation. Hence, to prevent the compiler from guessing and creating always inefficient code, the user is forced to tell the compiler, e.g. by writing the expression like this: ui = ui * UInt32(si)

This means that some code will need changing, but it should be rare, and it would improve the predictability of calculations in the long run.

Step 3

There are cases where values do not fit, causing incorrect results. Examples are making an arithmetic calculation that overflows and assigning a negative number to a unsigned variable.

Currently, the compiler can give us warnings on demand only. But these warnings will not tell us if the overflow really occurs, only that it can happen.

Programs that depend on calculations need a definitive knowledge if a value overflows, though.

Therefore, we need extra protection which can only be provided by runtime exceptions. If a value overflows its storage range, or if a calculation overflows, we should get a Runtime Exception notifying us of the wrong outcome. Of course, there are cases were a overflow is actually anticipated, but that's rather the exeption. As such, the programmer should be able to locally disable the overflow checking for such exceptional cases, either by a pragma (which I find a rather ugly solution but may be more convenient in some areas) or by using special keywords and functions to indicate which operations shouldn't be checked.

Step 4

This step will change the way how expressions get evaluated. Think of the statement "int64var = int32var * int32var". Why shouldn't we be allowed to expect that the computation calculates the product of the two 32 bit values and assigns the result, which could be up to 64 bits now, to the 64 bit variable? Currently the compiler, however, calculates the product with a 32 bit multiplication and only keeps the lower 32 bit of the result before assigning it to the 64 bit var. This can be justified by the compiler writer but is still not user friendly.

Here's another even less obvious example:

  dim i1 as UInt8 = 200
  dim i2 as UInt8 = 2
  dim i3 as Integer
  i3  = i1 * i2

The experienced C programmer expects to find 400 in i3. However, RB calculates the result as 144. That's because RB does the calculation on the biggest type of the involved types. Meaning it calculates the product of 200 * 2 as a unsigned 8 bit value, which overflows.

No one wants that - unless it's binary magic, which is the exception and should be explicitly identified as such in the source code to avoid misinterpretation. For the normal arithmetical cases, however, this is not a wanted outcome.

Therefore, ideally, all calculations, should be done numerically correct, internally meaning on the data type that can hold the result without loss. Due to limitations of the CPU this can't always be done without the cost of efficiency, though. Therefore, a reasonable compromise has to be implemented:

As long as the involved numbers remain inside a 32bit range, do the calculation on 32 bit. This deals with the second example above (the one with 2*200 with 8 bit values).

If the result can exceed the 32bit range, it's getting complicated. Consider this:

  dim i1, i2, i3 as integer
  i1 = 10000000 // 10 million
  i2 = 20000000 // 20 million
  i3 = 1000000 // 1 million
  i1 = i1 * i2 / i3

While the final result fits into a Integer, the interim result exceeds 32 bit at the time of the multiplication. Hence, to get the correct result here, the calculation would need to be done 64 bits wide.

Since the compiler can't predict the involved values, it would have to turn _any_ addition and multiplication that is followed by a reduction (subtraction, division) in 64 bit to be on the safe side.

However, 64 bit calculations are a bit more time consuming than 32 bit operations on Intel CPUs (not on PPC, though) because they require several instructions instead of just one. But this is the only way to get a correct result in such cases where the values might overflow. If speed is of highest priority, the programmer can avoid this automatic promotion to 64 bit operations by explicitly forcing 32 bit on the interim results, like this: i1 = Integer(i1 * i2) / i3

Plus, a warning could let the user know of this possible (but small) performance hit so that he can identifiy these areas of optimization.

Agreed, there are still overflows possible once the results exceed even 64 bit, but that's much less likely to occur, and then there's still the overflow checking at runtime letting the programmer know if the value should indeed overflow.

Why the above is not enough (10 Apr 2010)

Norman Palardy showed an example of what the compiler currently allows a user to program:

    dim i as Uint32 = 10
    do
         i = i - 1
    loop until i < 0

This loop should never end, because the unsigned "i" variable can never become negative. Yet, the compiler makes it end nonetheless, by performing the comparison against zero in signed arithmetics. This means that the same "bug" that makes the expression "dim ui as UInt32 = 3000000000 : if ui > 0 then" return the wrong (i.e. commonly unexpected) result, makes this loop work where it shouldn't (i.e. due to integer expression rules with common languages such as C).

Fixing the compiler to use the common rules as proposed here would break code like that loop above, turning it into an endless loop. And such a breaking code existing code would be unacceptable because users would not know why their apps would suddenly misbehave, and it would be very difficult to find all these problemetic cases by hand in existing code.

If the compiler would be changed, the compiler would have need to flag these formerly working code cases in a way that the user is pointed to them so that he knows where and how to apply the code fixes (in fact, all the user has to do is to change the type of the involved variables just the same way the compiler now does it implicitly). The changes would therefore even remain backward compatible.

But for the compiler to be able to detect these cases, he'd have to detect these, based on at least these criteria:

The if expression is constant (i.e. will never be true)
An unsigned int and a literal or signed int are involved

If these checks apply, this could be flagged as an error where the user would be told to fix the type of the involved variables (similar to Step 2 above). But maybe there are other cases are well, I am not sure.

Of course, this would require that the compiler would be able to fold constant expressions during compile time. (i.e. turn "5+6" into "11" instead of generating code that does the addition), and that's still another missing part of the compiler, I believe.

REALbasic, Programming, Compilers