rc3.org

Strong opinions, weakly held

Fun with MediaWiki markup

At the job, we run several instances of MediaWiki that we use for internal collaboration and to produce content that will eventually be published on a public Web site. My job generally involves building custom content management systems for various types of content. I chose to build them using Ruby on Rails. Because all of our users are already getting used to MediaWiki, we thought it would be cool to support MediaWiki’s markup format within our custom applications.

I assigned a developer to build a parser for MediaWiki Markup in Ruby, and after spending a couple of weeks of work using Racc, he threw up his hands. The parser in MediaWiki is written using regular expressions and has some oddly inconsistent behaviors. Furthermore, the language is so complex that he told me it would basically take him forever to get it finished. (There were also some features missing from Racc that were greatly complicating the task.)

Unfortunately, we told our users about our exciting MediaWiki parser project before I learned that writing such a parser was not going to be worth our time (or money). Rather than giving up, we instead created a Ruby on Rails plugin that actually shells out to PHP and calls the parser built into MediaWiki. A developer who knows PHP and Ruby on Rails well wrote the plugin in a couple of days. I still haven’t decided whether this is the worst hack in the history of the world or a brilliant solution to a thorny problem. I guess it’s probably a little of both.

The main downside is that now we have a full MediaWiki installation setting in the vendor directory of our Rails applications, and any server that hosts our Rails application has to have PHP installed and working as well. The upside is that we didn’t have to write the parser ourselves and we don’t have to worry about keeping up with changes to MediaWiki markup as it evolves over time.

In order to forestall performance problems, the content is only parsed when it’s saved, and the HTML version of the content is stored in the database so that we don’t have to run an external shell script every time we render a piece of content.

The next step is to package this thing up and see if anybody else finds it useful.

11 Comments

  1. I know of at least one other Rails app that shells out to PHP, though my fave I’ve seen recently is the one that use Parrot for its NLP.

    But the really Railsy solution would have been to use a transparent MediaWiki webservices via ActiveResource and worry about scalaing it later 🙂

  2. That’s a good idea for version 2.0.

  3. Fun with MediaWiki markup

    Rafe integrates MediaWiki markup into Ruby on Rails: I still haven’t decided whether this is the worst hack in the history of the world or a brilliant solution to a thorny problem. I guess it’s probably a little of both. Technorati Tags: rails

  4. I recently wrote this to the Ruby mailinglist:

    I’m involved in a few research projects, and like to keep my information well organized. I usually get most of it from Wikipedia, however, I hate printing HTML articles to PDF. I’d rather want them in pure, well laid out text. And I’m sure others would too. Being able to master ones knowledge provides a warm inner peace.

    Hence I’ve tried dumping the output from text browsers such as w3m, elinks, lynx etc. I am, however, only interested in the articles themselves, not their links, views, toolboxes, search bars, other available languages and so on. I tried running a whole bunch of regular expressions over the output, but that really felt like the hard way.

    So some guy gave me this:

    !/usr/bin/env ruby

    require ‘rexml/document’ require ‘cgi’ require ‘tempfile’ require ‘open-uri’

    url = ‘http://en.wikipedia.org/wiki/Special:Export/’ + CGI::escape(ARGV.join(” “).strip.squeeze(‘ ‘).tr(‘ ‘, ‘_’)).gsub(/%3[Aa]/,’:’).gsub(/%2[Ff]/,’/’).gsub(/%23/,’#’)

    open(url) { |f| puts REXML::XPath.first(REXML::Document::new(f.class == Tempfile ? f.open : f), ‘//text’).text }

    Which seem to take advantage of Wikipedia’s special export feature, which really seems cool. However there’s a few issues. First, the script looks kinda complex. I’m sure there’s a simpler way of writing it. Second, it does not yet output the kind of pure and well laid out text as it should. For instance, on http://en.wikipedia.org/wiki/GNU_Hurd, it outputs:

    #### BEGIN

    {{Infobox_Software | name = GNU Hurd | logo = [[Image:Hurd-logo.png]]
    | developer = [[Thomas Bushnell| Michael (now Thomas) Bushnell]] (original developer) and various contributors | latest_release_version = | latest_release_date = | operating_system = [[GNU]] | genre = [[Kernel (computer science)|Kernel]] | family = [[POSIX]]-conformant [[Unix]]-Clones | kernel_type = [[Microkernel]] | license = [[GNU General Public License|GPL]] | source_model = [[Free software]] | working_state = In production / development | website = [http://www.gnu.org/software/hurd/hurd.html http://www.gnu.org] }} {{redirect|Hurd}} ”’The GNU Hurd”’ is a computer operating system [[Kernel (computer science)|kernel]]. It consists of a set of [[Server (computing)|servers]] (or [[daemon (computer software)|daemons]], in [[Unix]]-speak) that work on top of either the [[GNU Mach]] [[microkernel]] or the [[L4 microkernel family|L4 microkernel]]; together, they form the [[kernel (computer science)|kernel]] of the [[GNU]] [[operating system]]. It has been under development since [[1990]] by the [[GNU]] Project and is distributed as [[free software]] under the [[GNU General Public License|GPL]]. The Hurd aims to surpass [[Unix]] kernels in functionality, security, and stability, while remaining largely compatible with them. This is done by having the Hurd track the [[POSIX]] specification, while avoiding arbitrary restrictions on the user.

    “HURD” is an indirectly [[recursive acronym]], standing for “HIRD of [[Unix]]-Replacing [[Daemon (computer software)|Daemons]]”, where “HIRD” stands for “HURD of Interfaces Representing Depth”. It is also a play of words to give “[[herd]] of [[wildebeest|gnus]]” reflecting how it works.

    ==Development history== Development on the GNU operating system began in 1984 and progressed rapidly. By the early 1990s, the only major component missing was the kernel.

    Development on the Hurd began in [[1990]], after an abandoned kernel attempt started from the finished research [[Trix (kernel)|Trix]] operating system developed by Professor [[Steve Ward (Computer Scientist)| Steve Ward]] and his group at [[Massachusetts Institute of Technology| MIT]]’s [[Laboratory for Computer Science]] (LCS). According to [[Thomas Bushnell| Michael (now T homas) Bushnell]], the initial Hurd architect, their early plan was to adapt the [[BSD]] 4.4-Lite kernel and, in hindsight, “It is now perfectly obvious to me that this would have succeeded splendidly and the world would be a very different place today”.{{cite web | url = http://www.groklaw.net/article.php?story=20050727225542530 | title = The Hurd and BSDI|accessdate = 2006-08-08 | author = Peter H. Salus | work = The Daemon, the GNU and the Penguin}} However, due to a lack of cooperation from the [[University of California, Berkeley|Berkeley]] programmers, [[Richard Stallman]] decided instead to use the [[Mach microkernel]], which subsequently proved unexpectedly difficult, and the Hurd’s development proceeded slowly.

    #### END

    This should instead be something like:

    #### BEGIN

    http://en.wikipedia.org/wiki/GNU_Hurd

    Name = GNU Hurd Developer = Thomas Bushnell (original developer) and various contributors Operating_system = GNU Genre = Kernel (computer science) Family = POSIX-conformant Unix-Clones Kernel type = Microkernel License = GNU General Public License Source model = Free software Working state = In production / development Website = http://www.gnu.org/software/hurd/hurd.html http://www.gnu.org

    The GNU Hurd is a computer operating system. It consists of a set of servers (or daemons, in Unix-speak) that work on top of either the GNU Mach microkernel or the L4 microkernel; together, they form the kernel of the GNU operating system. It has been under development since 1990 by the GNU Project and is distributed as free software under the GPL. The Hurd aims to surpass Unix kernels in functionality, security, and stability, while remaining largely compatible with them. This is done by having the Hurd track the POSIX specification, while avoiding arbitrary restrictions on the user.

    HURD'' is an indirectly recursive acronym, standing forHIRD of Unix-Replacing Daemons”, where HIRD'' stands forHURD of Interfaces Representing Depth”. It is also a play of words to give “herd of gnus” reflecting how it works.

    Development history

    Development on the GNU operating system began in 1984 and progressed rapidly. By the early 1990s, the only major component missing was the kernel.

    Development on the Hurd began in 1990, after an abandoned kernel attempt started from the finished research Trix operating system developed by Professor Steve Ward and his group at MIT’s Laboratory for Computer Science (LCS). According to Michael (now Thomas) Bushnell, the initial Hurd architect, their early plan was to adapt the BSD 4.4-Lite kernel and, in hindsight, “It is now perfectly obvious to me that this would have succeeded splendidly and the world would be a very different place today”. However, due to a lack of cooperation from the Berkeley programmers, Richard Stallman decided instead to use the Mach microkernel, which subsequently proved unexpectedly difficult, and the Hurd’s development proceeded slowly.

    #### END

    Looks real gorgeous doesn’t it? Had I only been skilled enough to do this myself. Which brings me to my question: Is anybody out there willing to help me fix my script?

    Thanks a lot, Kyrre

  5. Just an FYI, the “Racc” link in your post points to “meta.wikimedia.org/wiki/MediaWiki_Markup”. I think you wanted to link Racc to http://i.loveruby.net/en/projects/racc/

  6. Take a look at the MediaCloth project. Does just what you want. http://mediacloth.rubyforge.org/

  7. From the looks of it, it seems like rc3.org was actually the one who created MediaCloth. Or am I just too assuming?

  8. Wasn’t from here 🙂

  9. Can you guys post the ruby on rails plugin you used to call to mediawiki’s parse function? I was looking for a solution and was actually using mediacloth but it does not support all of mediawiki’s grammar. Thanks.

  10. Yes, please post the plugin. Don’t just tease us with it.

  11. I’ve developed a plugin that allows a Ruby script or Rails app to create/delete multiple Mediawiki instances living inside the same database. I’ll be posting it on my blog and Github soon.

Leave a Reply

Your email address will not be published.

*

© 2016 rc3.org

Theme by Anders NorenUp ↑