Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Detecting Code Indentation (medium.com/firefox-developer-tools)
62 points by ingve on March 3, 2016 | hide | past | favorite | 30 comments


People love to hate tabs, but in a world where we represent logic in megabytes of alpha-numeric characters, doesn't a semantic notation of indentation make just a little bit of sense?

Tabs are for indentation, spaces for alignment. These are different concepts, but spacers[1] just think whitespace is whitespace are forever double tapping their spacebar like some sick mental tick that can't be cured.

Your editor should not need to detect tab size. You should be able to size your tabs however you want. A tab is an indentation, weather you want it to be 2 chars or 20. It shouldn't matter.

The fact that a neural network is being used to detect tab size, just shows how much ground tabbers have lost to the space infidels[2].

[1] http://www.jwz.org/doc/tabs-vs-spaces.html

[2] http://blog.codinghorror.com/death-to-the-space-infidels/


We use the word "the" a lot in English. It would make sense to just replace it with a single character.

But joking aside, I think the point is that the spacers (a lot of them, at least) don't want a semantic space that people can adjust; they want the code to look exactly the same for everyone. I'm reminded of Frank Lloyd Wright, who would design every aspect of a building, even the furniture.


The spacers are right. Even if you use tabs, there is a correct tab size that must be used, due to inner alignment:

   if (foo) {    /* aligned
      blah();       comment */
My autotab.c relies on this as one of its heuristics to determine the proper hard tab size for a file. A tab size that produces good inner alignment gets a score boost.

The wrong tab size will wreck inner alignment.

http://www.kylheku.com/cgit/c-snippets/tree/autotab.c


> We use the word "the" a lot in English. It would make sense to just replace it with a single character.

Indeed the Shavian alphabet [1] has a single character abbreviation for "the" (should have been a th-e digraph). Of course, the same goes to the failure of Shavian and other reforms.

[1] https://en.wikipedia.org/wiki/Shavian_alphabet


In Portuguese, "the" is "o" (masculine) or "a" (feminine). "and" is "e". Pretty simple, eh?

Before you ask, no, "i" and "u" have no meaning :-)


You shouldn't use tabs for new projects, because spaces are more popular. It's like endianness: it doesn't matter which you use, as long as it's the same as everybody else.

Right now, "everybody else" is using spaces. So get onboard.


Big-endian and little-endian convey exactly the same information. Tabs and spaces don't; tabs impose fewer requirements on presentation and discourage people from wasting effort on fragile ASCII art that mostly just causes spurious merge conflicts.


New projects should standardize on whether they want tabs or spaces out of the gate, preferably enforced by something like editorconfig, and make that decision based on the preferences of the actual people involved at the time.


> Tabs are for indentation, spaces for alignment.

Tabs for indentation will wreck the alignment produced by spaces, unless the tab size is set to the size used by the file's author. Alignment is only preserved for those lines which have the same number of tabs for indentation. Alignment of inner material across lines that are indented to different levels is screwed up.

The whole concept of every coder using his or her favorite tab size for the same file is completely, utterly moronic.

If the coding convention calls for hard tabs, it must firmly establish their size and everyone must use that size.

Ironically, tabs are for alignment historically, on typewriters, and also in word processors! In a word processor, you don't produce alignment with spaces unless you're a slobbering moron, because the font is proportionally sized. You set up tabs by sliding little wedgies on a ruler. Those things are inspired by hardware tabs on a typewriter. The word tab is related to tabular and tabulate: producing vertically aligned columns.

So you have that backwards.

Programmers who use hard tabs for indentation also use them for inner alignment, like this:

   if (foo)        /* comment */
     bar();        /* comment */
                   ^ this is a hard tab stop
if foo is changed to xy_foo, comment will probably stay aligned with the one below. Of course, if too much material is added, then the comment will suddenly jump by 8 characters, and a tab has to be deleted, which restores the alignment.

When files use this kind of alignment, they are slightly more resilient in the face of a tab size change. Some of the alignments stay the same; some others jump by a full tab size and have to be fixed.

> forever double tapping their spacebar

That's just due to using a crappy editor, or not knowing how to use a decent one.

In Vim, Ctrl-T is used to indent by the shiftwidth amount (when this isn't automatic). This produces an indentation with tabs, spaces or a combination of both, based on whether hardtab or not is enabled, and the value of tabstop. Ctrl-D deindents. For instance, if you have tabstop=8, shiftwidth=4, noexpandtab, and you are currently indented two widths (one hard tab) and type Ctrl-D to deindent, the tab is replaced by four spaces.


> The whole concept of every coder using his or her favorite tab size for the same file is completely, utterly moronic.

> If the coding convention calls for hard tabs, it must firmly establish their size and everyone must use that size.

Letting coders and readers of code choose their own tab size isn't moronic, and you don't need to firmly establish the tab size. That's only an issue if you use column alignment after the initial indent.

Column alignment is not a requirement in any of the programming languages most of us work in. It's just a stylistic choice, and it causes many problems including the one you describe.

If you stop using column alignment, it doesn't matter what tab size someone uses when they view or edit the code. It doesn't even matter if they use a proportional font!

Take your example:

  if (foo)        /* comment */
    bar();        /* comment */
                  ^ this is a hard tab stop
and rewrite it without alignment:

  // Plenty of room for a longer comment here
  if( foo )
    bar();
Or if you really need a comment for each line:

  // Plenty of room for a longer comment here
  if( foo ) {
    // And room for a long comment here too
    bar();
  }
Either way, you're no longer dependent on the tab size, and not dependent on monospaced fonts either. The code is consistently formatted in any font, even if you set your tabs to display as something odd like five spaces:

  // Plenty of room for a longer comment here
  if( foo ) {
       // And room for a long comment here too
       bar();
  }
Here's a more detailed discussion of the problems that column alignment causes and the benefits of giving it up:

https://news.ycombinator.com/item?id=10206860


Yeah, I hate comments at the end of a line of code. No good can ever come from having a comment on the same line as code, it only encourages completely meaningless comments like:

  int a = 0; /* assign 0 to a */
  a += 5;    /* add 5 to a */
Until you realize that you were really meant to add 6 to a and change the code and not the comment.

Comments should be written on their own line(s) and only to explain the intention or warn against uninformed changes (or both):

  // WARNING! There be dragons here. It *may* look like we're reading
  // uninitialized memory here but it's all part of a greater plan,
  // the plan to get some good entropy into the random pool. So don't
  // remove the reads you don't understand, even if lint tells you
  // they're bad or you'll severely limit the entropy of all users
  // private keys from now on.


> Letting coders and readers of code choose their own tab size isn't moronic

I don't like 32 bit IP addresses or 128 bits. I think I will make them 48 bits on my machine. My personal preference beats conventions and standards; that's why I went into computing in the first place!

Why stop at tabs? Every developer should choose their own character encoding. Networking guy Ardeshir is from Iran; let him code from right to left. And let Shigeru can use SHIFT-JIS identifiers for functions, in the middle of UTF-8 source files.

Conformance was yesterday; today, personal preference is king!


You said..

> Programmers who use hard tabs for indentation also use them for inner alignment, like this:

But note that OP said..

> > Tabs are for indentation, spaces for alignment.

The distinction is important. Programmers who use both can actually realize the best of both worlds. I know this is a religious topic, but I've seen it work very well. Hear me out.

I managed teams of game programmers for about 15 years. We used mostly C, C++, and later, C#. It took me some work to get everyone on the same page with this, but once we achieved it, everyone was happy.

The end result is that indentation and alignment worked for everyone on the team. It didn't matter what tab size each individual configured in their editor of choice, everything just worked.


You can use variable sized tabs if your leading whitespace consists of tabbed indentation, followed by alignment spaces. E.g.:

  [TAB][TAB]object.function(argument1, argument2,
  [TAB][TAB]                argument3);
For this to work, all alignment among lines must occur only among lines which are at the same indentation level; alignment must not cross indentation levels.

However, internal alignment must not use tabs. Only indentation may use tabs; all else is spaces.

If you allow tabs for indentation, internal tabs will sneak in, because some people use dumb editors or don't know how to use their good ones. You just need some repository commit hooks to catch that.

However, the whole thing is misguided. If tabs are used at all, tab stops should be set to the de facto standard 8 by everyone. And then these tab stops should be freely used for internal alignment between lines, not only for indentation. Either fully invest in tabs and embrace them, or don't use them at all.


> forever double tapping their spacebar

???

I don't know anyone who does this.


People who develop in notepad, presumably. b^)


Tabs are for making TABles. Spaces are for reserving spaces. It's that simple. I would use Excel if I want to make a table. I also wouldn't put tables in my source code normally.


I have always wanted some editor plugins to help with 2-space code indentation. I just can't understand how some developers think that 2-space indentation is readable.

I don't have that many strong feelings on code style with one exception... code blocks should be indented at minimum of 4-spaces.

My feeling is that visually impaired people that prefer 4-spaces like myself have a physical perhaps even medical handicap/challenge of reading 2-spaces whereas the 2-space folks just prefer 2-spaces so they can shove as much things in one line as possible or have some sort of archaic 80 char per line argument.... or worse have shit loads of indentation (this is especially the case for Javscript and Scala... see Steve McConnell “Taming Dangerously Deep Nesting” section on "Code Complete").

And when it comes to diffing.. 2-space becomes so egregious that I have to take the code and replace with tabs at times.


Switch to indenting with tabs. Problem gone.


Oh I like tabs as well but many battles have been been fought and tabs is loosing ground in the overall war. Plus there are some fairly good arguments for languages that support variable indentation to not use tabs.

So I concede that loss and concentrate energy on the "please god do not use 2-spaces" battle.


Interesting, I've not seen this discussed before. JOE uses a variation of the GCD method:

https://sourceforge.net/p/joe-editor/mercurial/ci/default/tr...

Create a histogram of the indentation of the last 250 lines of the file (on the theory that there are more comments at the beginning of the file). The GCD of the three most popular indentations is selected. This way one oddly indented line does not screw it up. It does not ignore single space indentations, since I used to use that. Instead it ignores lines which begin with comment characters, like asterisk or '/'. It can still be confused by block comments like this:

    /* first
       second
       third */
The editor does in theory know the syntax, so it could be enhanced to just ignore comment lines.

Also, it determines if the user prefers to use tabs or spaces for indentation.


In Chocolat we take a similar approach, but the histogram is of the GCDs, not of the indentations.

So we just take a sample of 1000 lines, then take the GCD of every pair. The indentation width is simply the most common pair.

If it turns out that one bin is not clearly more dominant than the others, then we default to whatever was used last time for that language.


One thing I think is a bit annoying is that SublimeText doesn't seem to be able to properly detect the 4-space tab settings in Python if you have some visually aligned elements that are indented to a multiple of 2 - it always incorrectly assumes 2 for those files.

Given that Python itself relies on consistent indentation, once you know it's Python, you should be able to detect the file settings by finding an indented construct and checking the indentation of the next block. That or ignore lines that come between brackets, parens, quotes, etc, when doing the calculation.

I assume there is a plugin for this, but a cursory search hasn't turned it up.


"gofmt" makes this a non-issue for Go. In Python, indentation has meaning, and the compiler is smart enough to understand when tabs vs spaces is an ambiguity, so it's less of a problem.

The last time I ran a C++ project, we just ran everything through Artistic Style [1] in "ansi" mode.

Fussing over this too much is bikeshedding.

[1] http://astyle.sourceforge.net/


The indent-detective plugin [1], which implements this feature for Atom, is directly based on this post. The compare-lines heuristic works really nicely.

https://atom.io/packages/indent-detective


It's a great plugin :)

I would really love if that plugin has some options. For example never detect 1 or more than 8 spaces.


I wrote a JS module for detecting indentation some years ago. I ended up with an algorithm [0] that looks for the most common difference between two consecutive non-empty lines.

[0]: https://github.com/sindresorhus/detect-indent#algorithm


It would be nice if the language has a spec on how the code should be formatted. I think rust is heading in this direction with rust-fmt

https://github.com/rust-lang-nursery/rustfmt


See here: http://www.kylheku.com/cgit/c-snippets/tree/autotab.c

Autotab detects how wide tabs should be and what the indentation is, using a variety of heuristics.


Why are people still manually indenting in the first place? Just use an IDE that does it for you. It doesn't matter what convention it's applying because you don't have to do it. It'll automatically re-indent all your old code to match whatever it's own default way is and you can do programming instead of formatting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: