Pygments

generic syntax highlighter


Ticket #474 (new defect)

Opened 7 months ago

Last modified 4 months ago

Ruby: Non-ASCII Method Names Not Recognised

Reported by: guest Owned by: gbrandl
Priority: minor Milestone: Someday
Component: lexers Keywords:
Cc:

Description

Ruby 1.9 allows method names to include non-ASCII characters with the following caveats:

* The characters must be valid in the file's source encoding.

* A legal method name that does not end with '!', '?', or '=' may have one of these characters appended.

* The ASCII punctuation characters of which operator methods consist (e.g. [*%&^`~+-/\[<>=]) must not appear in any other permutation, with the exception of the above case.

Pygments does not recognise such method names, lexing the first non-ASCII character as an error. Examples of unrecognised method names are given in http://pygments.org/demo/3147/ .

Change History

Changed 4 months ago by thatch

  • milestone set to Someday

Do you have any reference to those rules, or perhaps the grammar itself? I checked the existing RubyLexer?'s rules and they're super-complicated:

            (r'(?:([a-zA-Z_][a-zA-Z0-9_]*)(\.))?'
             r'([a-zA-Z_][\w_]*[\!\?]?|\*\*?|[-+]@?|'
             r'[/%&|^`~]|\[\]=?|<<|>>|<=?>|>=?|===?)',
             bygroups(Name.Class, Operator, Name.Function), '#pop'),

Changed 4 months ago by thatch

I did some digging. I still can't find a formal announcement, but local rubyers confirm that such support was "rumored."

Checking the source (ruby 1.9 snapshot, parse.y) I see some code for this.

#define is_identchar(p,e,enc) (rb_enc_isalnum(*p,enc) || (*p) == '_' || !ISASCII(*p))
#define parser_is_identchar() (!parser->eofp && is_identchar((lex_p-1),lex_pend,parser->enc))
...

    mb = ENC_CODERANGE_7BIT;
    do {
        if (!ISASCII(c)) mb = ENC_CODERANGE_UNKNOWN;
        if (tokadd_mbchar(c) == -1) return 0;
        c = nextc();
    } while (parser_is_identchar());
    switch (tok()[0]) {
      case '@': case '$':
        pushback(c);
        break;
      default:
        if ((c == '!' || c == '?') && !peek('=')) {
            tokadd(c);
        }
        else {
            pushback(c);
        }
    }
    tokfix();
Note: See TracTickets for help on using tickets.