Fun with MediaWiki markup

August 2, 2006 / Rafe / 11 Comments

At the job, we run several instances of MediaWiki that we use for internal collaboration and to produce content that will eventually be published on a public Web site. My job generally involves building custom content management systems for various types of content. I chose to build them using Ruby on Rails. Because all of our users are already getting used to MediaWiki, we thought it would be cool to support MediaWiki’s markup format within our custom applications.

I assigned a developer to build a parser for MediaWiki Markup in Ruby, and after spending a couple of weeks of work using Racc, he threw up his hands. The parser in MediaWiki is written using regular expressions and has some oddly inconsistent behaviors. Furthermore, the language is so complex that he told me it would basically take him forever to get it finished. (There were also some features missing from Racc that were greatly complicating the task.)

Unfortunately, we told our users about our exciting MediaWiki parser project before I learned that writing such a parser was not going to be worth our time (or money). Rather than giving up, we instead created a Ruby on Rails plugin that actually shells out to PHP and calls the parser built into MediaWiki. A developer who knows PHP and Ruby on Rails well wrote the plugin in a couple of days. I still haven’t decided whether this is the worst hack in the history of the world or a brilliant solution to a thorny problem. I guess it’s probably a little of both.

The main downside is that now we have a full MediaWiki installation setting in the vendor directory of our Rails applications, and any server that hosts our Rails application has to have PHP installed and working as well. The upside is that we didn’t have to write the parser ourselves and we don’t have to worry about keeping up with changes to MediaWiki markup as it evolves over time.

In order to forestall performance problems, the content is only parsed when it’s saved, and the HTML version of the content is stored in the database so that we don’t have to run an external shell script every time we render a piece of content.

The next step is to package this thing up and see if anybody else finds it useful.

Commentary

11 Comments

kellan
August 2, 2006 at 4:24 pm

I know of at least one other Rails app that shells out to PHP, though my fave I’ve seen recently is the one that use Parrot for its NLP.

But the really Railsy solution would have been to use a transparent MediaWiki webservices via ActiveResource and worry about scalaing it later 🙂
Rafe
August 2, 2006 at 5:43 pm

That’s a good idea for version 2.0.
paulrobinson.net
August 2, 2006 at 7:50 pm

Fun with MediaWiki markup

Rafe integrates MediaWiki markup into Ruby on Rails: I still haven’t decided whether this is the worst hack in the history of the world or a brilliant solution to a thorny problem. I guess it’s probably a little of both. Technorati Tags: rails
Kyrre Nygård
October 11, 2006 at 9:05 am

I recently wrote this to the Ruby mailinglist:

I’m involved in a few research projects, and like to keep my information well organized. I usually get most of it from Wikipedia, however, I hate printing HTML articles to PDF. I’d rather want them in pure, well laid out text. And I’m sure others would too. Being able to master ones knowledge provides a warm inner peace.

Hence I’ve tried dumping the output from text browsers such as w3m, elinks, lynx etc. I am, however, only interested in the articles themselves, not their links, views, toolboxes, search bars, other available languages and so on. I tried running a whole bunch of regular expressions over the output, but that really felt like the hard way.

So some guy gave me this:

!/usr/bin/env ruby

require ‘rexml/document’ require ‘cgi’ require ‘tempfile’ require ‘open-uri’

url = ‘http://en.wikipedia.org/wiki/Special:Export/’ + CGI::escape(ARGV.join(” “).strip.squeeze(‘ ‘).tr(‘ ‘, ‘_’)).gsub(/%3[Aa]/,’:’).gsub(/%2[Ff]/,’/’).gsub(/%23/,’#’)

open(url) { |f| puts REXML::XPath.first(REXML::Document::new(f.class == Tempfile ? f.open : f), ‘//text’).text }

Which seem to take advantage of Wikipedia’s special export feature, which really seems cool. However there’s a few issues. First, the script looks kinda complex. I’m sure there’s a simpler way of writing it. Second, it does not yet output the kind of pure and well laid out text as it should. For instance, on http://en.wikipedia.org/wiki/GNU_Hurd, it outputs:

#### BEGIN

{{Infobox_Software | name = GNU Hurd | logo = [[Image:Hurd-logo.png]]
| developer = [[Thomas Bushnell| Michael (now Thomas) Bushnell]] (original developer) and various contributors | latest_release_version = | latest_release_date = | operating_system = [[GNU]] | genre = [[Kernel (computer science)|Kernel]] | family = [[POSIX]]-conformant [[Unix]]-Clones | kernel_type = [[Microkernel]] | license = [[GNU General Public License|GPL]] | source_model = [[Free software]] | working_state = In production / development | website = [http://www.gnu.org/software/hurd/hurd.html http://www.gnu.org] }} {{redirect|Hurd}} ”’The GNU Hurd”’ is a computer operating system [[Kernel (computer science)|kernel]]. It consists of a set of [[Server (computing)|servers]] (or [[daemon (computer software)|daemons]], in [[Unix]]-speak) that work on top of either the [[GNU Mach]] [[microkernel]] or the [[L4 microkernel family|L4 microkernel]]; together, they form the [[kernel (computer science)|kernel]] of the [[GNU]] [[operating system]]. It has been under development since [[1990]] by the [[GNU]] Project and is distributed as [[free software]] under the [[GNU General Public License|GPL]]. The Hurd aims to surpass [[Unix]] kernels in functionality, security, and stability, while remaining largely compatible with them. This is done by having the Hurd track the [[POSIX]] specification, while avoiding arbitrary restrictions on the user.

“HURD” is an indirectly [[recursive acronym]], standing for “HIRD of [[Unix]]-Replacing [[Daemon (computer software)|Daemons]]”, where “HIRD” stands for “HURD of Interfaces Representing Depth”. It is also a play of words to give “[[herd]] of [[wildebeest|gnus]]” reflecting how it works.

==Development history== Development on the GNU operating system began in 1984 and progressed rapidly. By the early 1990s, the only major component missing was the kernel.

Development on the Hurd began in [[1990]], after an abandoned kernel attempt started from the finished research [[Trix (kernel)|Trix]] operating system developed by Professor [[Steve Ward (Computer Scientist)| Steve Ward]] and his group at [[Massachusetts Institute of Technology| MIT]]’s [[Laboratory for Computer Science]] (LCS). According to [[Thomas Bushnell| Michael (now T homas) Bushnell]], the initial Hurd architect, their early plan was to adapt the [[BSD]] 4.4-Lite kernel and, in hindsight, “It is now perfectly obvious to me that this would have succeeded splendidly and the world would be a very different place today”.{{cite web | url = http://www.groklaw.net/article.php?story=20050727225542530 | title = The Hurd and BSDI|accessdate = 2006-08-08 | author = Peter H. Salus | work = The Daemon, the GNU and the Penguin}} However, due to a lack of cooperation from the [[University of California, Berkeley|Berkeley]] programmers, [[Richard Stallman]] decided instead to use the [[Mach microkernel]], which subsequently proved unexpectedly difficult, and the Hurd’s development proceeded slowly.

#### END

This should instead be something like:

#### BEGIN

http://en.wikipedia.org/wiki/GNU_Hurd

Name = GNU Hurd Developer = Thomas Bushnell (original developer) and various contributors Operating_system = GNU Genre = Kernel (computer science) Family = POSIX-conformant Unix-Clones Kernel type = Microkernel License = GNU General Public License Source model = Free software Working state = In production / development Website = http://www.gnu.org/software/hurd/hurd.html http://www.gnu.org

The GNU Hurd is a computer operating system. It consists of a set of servers (or daemons, in Unix-speak) that work on top of either the GNU Mach microkernel or the L4 microkernel; together, they form the kernel of the GNU operating system. It has been under development since 1990 by the GNU Project and is distributed as free software under the GPL. The Hurd aims to surpass Unix kernels in functionality, security, and stability, while remaining largely compatible with them. This is done by having the Hurd track the POSIX specification, while avoiding arbitrary restrictions on the user.

HURD'' is an indirectly recursive acronym, standing forHIRD of Unix-Replacing Daemons”, where HIRD'' stands forHURD of Interfaces Representing Depth”. It is also a play of words to give “herd of gnus” reflecting how it works.

Development history

Development on the GNU operating system began in 1984 and progressed rapidly. By the early 1990s, the only major component missing was the kernel.

Development on the Hurd began in 1990, after an abandoned kernel attempt started from the finished research Trix operating system developed by Professor Steve Ward and his group at MIT’s Laboratory for Computer Science (LCS). According to Michael (now Thomas) Bushnell, the initial Hurd architect, their early plan was to adapt the BSD 4.4-Lite kernel and, in hindsight, “It is now perfectly obvious to me that this would have succeeded splendidly and the world would be a very different place today”. However, due to a lack of cooperation from the Berkeley programmers, Richard Stallman decided instead to use the Mach microkernel, which subsequently proved unexpectedly difficult, and the Hurd’s development proceeded slowly.

#### END

Looks real gorgeous doesn’t it? Had I only been skilled enough to do this myself. Which brings me to my question: Is anybody out there willing to help me fix my script?

Thanks a lot, Kyrre
George Rypysc
September 14, 2007 at 12:20 am

Just an FYI, the “Racc” link in your post points to “meta.wikimedia.org/wiki/MediaWiki_Markup”. I think you wanted to link Racc to http://i.loveruby.net/en/projects/racc/
Eric Armstrong
December 13, 2007 at 2:51 pm

Take a look at the MediaCloth project. Does just what you want. http://mediacloth.rubyforge.org/
David Lee
December 18, 2007 at 6:15 pm

From the looks of it, it seems like rc3.org was actually the one who created MediaCloth. Or am I just too assuming?
Rafe
December 18, 2007 at 9:13 pm

Wasn’t from here 🙂
David Lee
December 19, 2007 at 4:04 pm

Can you guys post the ruby on rails plugin you used to call to mediawiki’s parse function? I was looking for a solution and was actually using mediacloth but it does not support all of mediawiki’s grammar. Thanks.
Jennifer Bell
May 30, 2008 at 3:10 pm

Yes, please post the plugin. Don’t just tease us with it.
Nate Burba
January 13, 2010 at 12:02 am

I’ve developed a plugin that allows a Ruby script or Rails app to create/delete multiple Mediawiki instances living inside the same database. I’ll be posting it on my blog and Github soon.

rc3.org

Strong opinions, weakly held

Fun with MediaWiki markup

11 Comments

!/usr/bin/env ruby

#### BEGIN

#### END

#### BEGIN

#### END

Leave a Reply Cancel reply

Recent Posts

Details

rc3.org

Strong opinions, weakly held

Fun with MediaWiki markup

Previous post

Next post

11 Comments

!/usr/bin/env ruby

#### BEGIN

#### END

#### BEGIN

#### END

Leave a Reply Cancel reply

Recent Posts

Details