martedì 1 settembre 2009

A regular expression to find XML tags

One common thing to do when dealing with XML files which are not well-formed, is to preprocess them to fix the problems. So you have to open them and to extract the xml tags. You can do this with a regular expression. The problem is that the regex has to find all XML tags but it should not match everything between < and >, because you could have the case when the text inside XML contains angular brackets. I couldn'd find on the web a regex that manages those cases, so I had to create one.
This regex covers most of these cases (except one...):

</?[A-Za-z][A-Za-z0-9]*(\\s+[a-zA-Z0-9]+=(\'|")?\\w*(\'|")?)*\\s*/?>

So, it matches every word starting and ending with angular brackets, that can have a / at the beginning or the end, starting with a letter followed by 0 or more letters or numbers. If there are attributes it look if there is at least one space, than a word, a = and another word. The double qoutes are left optional, since I want to catch also not-so-well-formed tags.

So for example, it matches <TAG>, </TAG>, <TAG/>, <TAG /> <TAG AT="OK">, <TAG AT=OK/> and so on.
It will not match anyway text like '< 10 mg and > 8 mg', '<100 and > 50' and so on.
The only problematic case is when you have a text like '<beta and=gamma > delta'. It's quite an unusual and weird case, so it isn't really a problem.