Monday, March 23, 2009

I don't think I posted this one yet..

Universal Programming Syntax

Any programming language syntax can basically be decomposed into nested lists. Something like XML. The lists would include parameters, keywords, function calls, whatever. It's essentially taking every structure and ordering its terms in the same way and conveying every operator to hierarchical structure or keywords. The idea is that if we could make a generalized language grammar, somewhat like XML but easier to type/read and perhaps more rich with structures, we could express any programming language is this form. That way learning a new language would be much easier, because you don't have to learn a new syntax or grammar--merely its constructs and functions--and also you wouldn't have to put up with really ugly syntax.

It isn't necessarily that every new language would have to use this specification, but that people could write front-ends that can convert from this specification to given languages and back, preferably as IDE plugins.

How exactly this language should be designed is hypothetical--I could take a shot at it, but that doesn't mean that my suggestion for a universal language is inextricably linked to my particular idea of an implementation of it.

One thing that comes to mind is that, although every nested structure in the program could be nested in the universal language in the same way, that could make it much less readable.
take, for example: for(int x=0;x<=255 && !y;x++) {do_this(exp((x+1),2)+3); }
you could write it as

for:
initial:
declare int x 0
compare:
and:
le x 255
not: y
action:
inc x
do:
function do_this:
sum:
exponent:
sum: x 1
2
3


and the above works okay for control structures, but is horrible for and's and or's and math--basically any operators.

and on the other hand, you could do it like this:

for(int(x,0), and(le(x,255),not(y)), inc x, function(do_this, sum(exp(sum(x,1), 2), 3))))

which is a little better for operators, but isn't so good for control structures.
and, of course, you could simply allow arbitrary line breaks and do it like this


for(
int x 0,
and(le(x,255),not y),
inc x,
function(do_this, sum(exp(sum(x,1), 2), 3))
)


but that still could be made a little bit more elegant, by allowing two forms of nesting:


for:
int x 0
and(le(x,255),not y)
inc x
function(do_this, sum(exp(sum(x,1), 2), 3))


(there indentation is being used as a grouping mechanism.)
and even futher, we could be more kind to operators, and technically we wouldn't even be changing the definition of the universal language:


for:
int x 0
((x le 255) and not y)
inc x
function(do_this, ((x plus 1) exp 2) plus 3)


although it might do to make some standards about how things in lists are ordered, so for example, you can't have the function/operator name be the 4th element in the list unless there are only three elements in which case it's the first element but only on tuesdays and depending on the price of beans as declared earlier in the source.

one thing we should not allow, though, is inexplicit priority of operators. all nesting should be explicit, that way you don't have to worry about learning the order of precedence for the particular language or thinking about it when you interpret some source code. exceptions maybe should be made, though, for basic numerical operators. i.e., everyone learns in elementary school or junior high that it goes: explicit grouping, then ^, then * and /, then + and -. although it's still on the table whether or not symbolic operators should be allowed in the specification. in some cases it makes it more readable, in other cases words would make their meaning more obvious. one solution would be to allow only >, <, <=, >=, *, /, +, -, . (namespaces), and either <> or !=. ^ shouldn't be allowed since it means exponent in some languages and XOR in others. and % can mean percent, modulus, string interpolation, etc. i'm being strict about it to make it easier for those who haven't done any learning of the language, although it could, perhaps, be made a language intended for people who do a little bit of studying. but that could make it a little more concise but a little less 'accessible'..

while it's up to whomever to specify how a particular language is translated into the universal language, there should probably be some guidelines set to foster consistency at little cost. for example, for loops exist in most every language, and we could dictate that for loops should start with the name 'for' as the first item. which they would probably do anyway, but perhaps there are other cases that are less normative. and more than just the 'for' would be specified.
common elements of a for loop include:

initialization
comparison
incrementation or whatever
variable name(s)
list you're selecting from
what to do

different languages would use different items of that list. each item could be given an official name, and a language uses whichever items are appropriate. it would be somewhat like the first example of code in this text, rather than the later examples where i just allowed positions in the list to determine meanings.

obviously mechanisms for literal strings and also comments need to be included. i'm a fan of Python's flexibility when it comes to literals. for comments i like C, I think they visually stand out well as being extraneous to the code. even moreso if it's all //'s but then you need an editor that can block comment and uncomment for convenience.

you may have noticed that i pulled some tricks with being able to use spaces to separate list items in some cases and commas in others. basically i tried to allow as much flexibility for the programmer in that as possible while maintaining that it can be interpreted determinately. so the three levels of separators/grouping would be spaces, commas and newlines, but they can be shifted up or down at whim. and parentheses can help too

i suppose other things that really demand symbols are dereferencers and subscripts. moreso dereferencers, because
a[10] can be handled as (a 10), a 10 or a(10), or even a sub 10, but dereferencers might be get tedious with having to type ptr ptr a, ptr ptr (ptr b), etc. however, instead of doing that we can do this: p2 a, p2(p b), etc. or _p _p (_p b) isn't too bad anyway. Should we have a mechanism for distinguishing language keywords from arbitrary names? this mechanism should probably be some non-enforced kind of Hungarian notation defined by the language translator. for example, key words could always be all caps.

another remaining issue is string literals. in what universal way should they be implemented? I would go for Python's syntax, with the possible exception that the 'u' modifier might become superfluous, as we could make everything always unicode, then translate to ascii or other encodings when necessary in the language translation. also we could add PHP's nowdoc syntax.

one other issue: the plain list vs. named sections formats, for example the way i did the 'for' command the first time vs. the subsequent times. should the language itself determine which one one uses, or should the user be able to use both styles for any given language? the parser could specify the components needed in a way similar to Python's defining function parameters, such that arguments may passed name, or just listed, and if particular grammar allows then names can even be passed that weren't pre-defined.

for those familiar with compiler technologies, yes, this is basically just a flexible, human-friendly way of specifying abstract syntax trees.

2 comments:

TabAtkins said...

You've just independently invented another Lisp dialect, only with horrible syntax. Don't worry, you are in company with many great thinkers who also didn't realize that what they wanted was invented in the 60s, and is still used by a decent group of people (and is even experiencing a bit of a renaissance in recent years).

To be more specific and less snarky, Lisp is essentially programming directly into an abstract syntax tree, which is exactly what you correctly guess you are doing at the end of your post.

Your problem, though, is that you're missing the *real* benefit of such a generic syntax. When the syntax is *this* simple and regular (well, most Lisps have a simple, regular syntax), you can *rewrite* that syntax easily.

You touch on this at the beginning, when you talk about writing a programming language in XML (such things exist, by the way...). XML has *structure*, and more importantly, the structure is *machine-readable* (and machine-writable). This means that you can have programs which take data and then write more programs for you.

Sound silly? It's a core part of Lisp (code that can do this is called a macro), and it's why the language is so powerful. Other languages recognize this as well in limited ways, though they can never do it reliably due to their complex syntax.

For example, Lisp has an extremely useful abstraction contained in the setf macro, which is nearly impossible in most languages. setf sets a variable to a value, similar to the = operator in C, with a simple (setf var val) call. The difference is that setf can do more than just set a variable, it can set a *place*. Frex, say you have a list with ten elements. The function that returns a particular element of the list is elt. You use it like (elt list 1) to read the element with index 1. Now, in a traditional language, if you wanted to *change* the element with index 1, you'd need a special setter function. Not in Lisp. You just type (setf (elt list 1) foo) and it'll automatically set the second element to the contents of the foo variable.

C-like languages can do this in limited ways, so this might not be especially impressive. Usually they allow you to set an array element directly, such as "a[1] = foo;". But that's usually it. If you want to go any further, you can't. If you create a class, you *have* to create both a getter and a setter for each property, and you can't use the convenient = operator to do it; "foo.slot1 = bar;" will usually give you a syntax error, rather than setting the slot1 field of the object "foo".

Lisp, though, lets you do this. You just have to tell setf how to transform the getter call into a setter call, and from then on you can just use the getter for both reading and writing. You can even do crazy nesting, like (setf (first (cell-slot (gethash foo bar)) "baz") to set the first element of the class member named slot of the object stored in the hashtable foo with the key "bar" to the value "baz". And all setf has to know is how to transform each of those calls (first, cell-slot, and gethash) individually, so you can mix in your own custom getters as well.

inhahe said...

i did know that already about lisp -- that the specification is essentially the syntax tree and that this allows a lot of neat self-modification. i imagine lisp to be unique among languages in this respect.. i've been (sort-of-maybe) meaning to learn it for a while now.

the point of my post was really less about any particular programming language and more about a universal syntax that can be used for all languages -- so the user can forgo half of the effort it takes to learn a new language. i suppose a lisp-like syntax *could* be used for this, but i think mine is more flexible, and while it may be slightly less readable than an original language it would be used for, lisp always seemed even *less* readable to me.

just to give a concrete example of what my intention is -- you would be able to use my syntax specification to code in, say, perl, if you so desired, if you didn't like perl's native syntax or you didn't want to learn it. independently of perl's ability or inability (namely inability) to self-program the syntax tree like lisp does.
(either the authors of Perl would gratuitously offer an alternative parser that understands this syntax, or an ide plugin or other tool-chain utility could be used to translate to perl prior to passing it to the interpreter.)

i figured there was probably already a way to program in xml, but probably not universally, and either way xml *is* a horrible syntax to program in.

of course, if you (or others) really think my syntax is that horrible, my fundamental idea here doesn't demand the use of my particular suggestion for the specifics of such a syntax. but i happen to like my syntax!:P