Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML encoded characters in markup-tag #585

Closed
oliverschneider opened this issue Jun 1, 2015 · 7 comments
Closed

HTML encoded characters in markup-tag #585

oliverschneider opened this issue Jun 1, 2015 · 7 comments

Comments

@oliverschneider
Copy link

I have some xml-tags with html encoded characters:

<pre>
<code class="language-markup">
<Springer>c6</Springer>
<Läufer>e5</Läufer>
</code>
</pre>
<pre>
<code class="language-markup">
&lt;Springer&gt;c6&lt;/Springer&gt;
&lt;L&auml;ufer&gt;e5&lt;/L&auml;ufer&gt;
</code>
</pre>

Prism doesn't recognize the second tags (Läufer) correctly and highlight them. Is there a way to add all german umlauts to the markup-plugin?

@uranusjr
Copy link
Contributor

uranusjr commented Jun 1, 2015

This happens because Prism uses \w+ to match a tag name, but in JavaScript this matches only ASCII characters. There’s no easy solution here; to override this you will need to explicitly specify what you want to match, which is doable but far from pretty. If all you want are characters with umlauts, you can only include what you want—for example, changing line 10 in the above file from

pattern: /^<\/?[\w:-]+/i,

to

pattern: /^<\/?[\w\u00e4:-]+/i,

and Prism will correctly highlight the above snippet. I am not sure how a more general fix is reasonable though. Would something like /^<\/?[^ >]+/ work?

@apfelbox
Copy link
Contributor

apfelbox commented Jun 1, 2015

HTML and XML don't share the same grammar, as XML allows additional characters while HTML only allows alphanumeric ASCII characters as tag names.

We could either create a separate XML language, that relaxes some of the rules of HTML or in the spirit of ("highlighter, not a linter") allow some invalid HTML.

@oliverschneider
Copy link
Author

Thank you for your comments. If I change line 10 as suggested by @uranusjr it doesn't change anything. If I add the characters to line 7, the syntax gets highlighted but the highlighter thinks that after the Umlaut the attr-name starts (see the picture below). Can I address that?

Line 7 now looks like this:

pattern:/&lt;\/?[\w\u00c4\u00d6\u00dc\u00e4\u00f6\u00fc:-]+\s*(?:\s+[\w\u00c4\u00d6\u00dc\u00e4\u00f6\u00fc:-]+(?:=(?:("|')(\\?[\w\W])*?\1|[^\s'">=]+))?\s*)*\/?>/gi,

bildschirmfoto 2015-06-01 um 16 40 16

@apfelbox
Copy link
Contributor

apfelbox commented Jun 1, 2015

You could try something like this [^\s>\/]+ instead of [\w:-]+:

'tag': {
               // ↓-------↓ here   
    pattern: /<\/?[^\s>\/]+\s*(?:\s+[\w:-]+(?:=(?:("|')(\\?[\w\W])*?\1|[^\s'">=]+))?\s*)*\/?>/i,
    inside: {
        'tag': {
                        // ↓-------↓ and here   
            pattern: /^<\/?[^\s>\/]+/i,
            inside: {
                'punctuation': /^<\/?/,
                'namespace': /^[\w-]+?:/
            }
        },
        // ...

Which translates to "a tag name is anything that is not a whitespace character, a closing bracket > or a slash /". Not sure whether this produces other weird special cases (by being too greedy), but it solves your issue.

@LeaVerou
Copy link
Member

LeaVerou commented Jun 5, 2015

@apfelbox is right. We should relax the markup grammar a bit, as long as it doesn’t result in incorrect highlighting of HTML examples.

@oliverschneider
Copy link
Author

So, I used @apfelbox' solution and came up with this (masked < with &lt;), which seems to work pretty great:

Prism.languages.markup = {
    'comment': /&lt;!--[\w\W]*?-->/g,
    'prolog':/&lt;\?.+?\?>/,
    'doctype': /&lt;!DOCTYPE.+?>/,
    'cdata': /&lt;!\[CDATA\[[\w\W]*?]]>/i,
    'tag': {
        pattern: /&lt;\/?[^\s>\/]+\s*(?:\s+[\w:-]+(?:=(?:("|')(\\?[\w\W])*?\1|[^\s'">=]+))?\s*)*\/?>/i,
        inside: {
            'tag': {
                pattern: /^&lt;\/?[^\s&>\/]+/i,
                inside: {
                    'punctuation': /^&lt;\/?/,
                    'namespace': /^[\w-]+?:/
                }
            },
            'attr-value': {
                pattern: /=(?:('|")[\w\W]*?(\1)|[^\s>]+)/i,
                inside: {
                    'punctuation': /=|>|"/
                }
            },
            'punctuation': /\/?>/,
            'attr-name': {
                pattern: /[\w:-]+/,
                inside: {
                    'namespace': /^[\w-]+?:/
                }
            }

        }
    },
    'entity': /&#?[\da-z]{1,8};/i
};

@Golmote
Copy link
Contributor

Golmote commented Jun 12, 2015

Fixed and added in the examples:
http://puu.sh/imjG2/296a063039.png

service-paradis pushed a commit to service-paradis/prism that referenced this issue Jun 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants