psychic stalker
(?)Community Member
- Posted: Thu, 01 Jul 2010 00:48:25 +0000
The way Gaia handles BBCode right now is badly broken, and it has not been improving in the last three years or so. The recent troubles with the quote tag were fixed, yes, but the "fixes" introduced a significant number of new problems, and several other problems have either gotten worse or remained unfixed.
On that note, the word filter is badly broken as well, making it harder to discuss HTML and JavaScript in C&T, as well as causing confusing and undesirable alterations to certain snippets of text, and most of this is related, devs have told me, to the XSS protection.
A rewrite is overdue.
Some examples of what's wrong:
I could probably go on for days, but this is just off the top of my head, and just what I've encountered in C&T.
Time/effort requirements: Rewriting this parser should take a single competent developer no more than about 2 weeks. This includes both the time to implement a proper lookahead parser and AST, as well as the implementation of the syntax itself, for both BBCode and smilies. (And while they're at it, adding the capability to disable smilies for a given post.)
On that note, the word filter is badly broken as well, making it harder to discuss HTML and JavaScript in C&T, as well as causing confusing and undesirable alterations to certain snippets of text, and most of this is related, devs have told me, to the XSS protection.
A rewrite is overdue.
Some examples of what's wrong:
- URLs not preceded or followed by whitespace, at the beginning or end of a post, or beginning or end of a quote are not detected by the parser for auto-linking. See Sonic's post below for an example of the problem.
A double-quote immediately before a right-paren are parsed into a " followed by a wink
Inside code tags, semicolons immediately following a right-paren are stripped. This makes sharing C, C++, PHP, Java, or similar code impossible on Gaia.
Backslashes are improperly stripped or unquoted. No one can type a Windows path without all of the pathnames getting scrunched together. C-style escape characters (like backslash-n and friends) are also unquoted improperly.
Unclosed tags break quotes.
Prematurely-closed quotes break posts.
Excessively-nested quotes break browsers. (I think someone demonstrated that a couple years ago with 31 nested quote tags.)
The words window, expression, Javascript, applet, and several others are redacted if they are followed by a : or period in certain conditions. If they're in "quotes," backslashes appear. I realize the reason for this is to prevent cross-site scripting attacks, but in this here Real World we live in, it's not only not doing that, but is "fighting" an "enemy" that doesn't exist and can't exist when the text is being properly escaped and handled (which it isn't - at all - right now.)
HTML in code tags are inappropriately mangled with ampersands. They should only be passed through a html_entities_encode(), not whatever is done to them now.
On that note, most HTML entities are mangled or destroyed in whatever process is being done to them. The only one that seems to be preserved correctly is "&" though it gets inappropriately "unencoded" in many cases.
Any percent sign followed by at least two letters or numbers gets stripped, mangled, or "URL-decoded." This makes it impossible to discuss Windows or DOS-style environment variables.
A parenthetical statement that happens to end in 8 is inappropriately turned into a smiley. (This is easy to fix if you write an actual parser instead of just doing regex matching, and has been a problem since long before 200 cool
URLs that the URL matching correctly detect are incorrectly wrapped with url tags when inside code tags.
Code tags strip leading whitespace. They should behave like <pre>, not as they do now. This makes it impossible to discuss Python programs (for which whitespace is significant) and makes sharing snippets of Haskell code inconvenient (since they have to be one-liners or use the ugly brace syntax).
Code tags in the journal are completely unstyled, in addition to the above problems.
Several tags break when nested. For example, bold italics.
The supposed XSS protection is not. It "protects" against attacks that are not possible, "fixes" problems that don't exist, while real holes are left wide open for any intelligent scammer to exploit.
I could probably go on for days, but this is just off the top of my head, and just what I've encountered in C&T.
Time/effort requirements: Rewriting this parser should take a single competent developer no more than about 2 weeks. This includes both the time to implement a proper lookahead parser and AST, as well as the implementation of the syntax itself, for both BBCode and smilies. (And while they're at it, adding the capability to disable smilies for a given post.)