Ensuring Well-Formed Markup
Gentle Well-Formedness
We've all seen bad markup in our day. I'm talking about markup that
doesn't bother to close tags. I'm talking about putting <p>
tags
inside <p>
tags, and putting content into an <img>
tag.
Nokogiri
corrects bad markup like a boss, similarly to how a browser
would before rendering.
badly_formed = <<-EOXML
<root>
<open>foo
<closed>bar</closed>
</root>
EOXML
bad_doc = Nokogiri::XML badly_formed
puts bad_doc # => <?xml version="1.0"?>
# <root>
# <open>foo
# <closed>bar</closed>
# </open>
# </root>
And Nokogiri will even keep track of what the errors were, if the parse option NOERRORS and NOWARNINGS are turned off (the default for XML documents).
puts bad_doc.errors # => Opening and ending tag mismatch: open line 2 and root
# Premature end of data in tag root line 1
Thus, you could use errors.empty?
to determine whether the document was well-formed.
Strict Well-Formedness
Being friendly and fixing markup is all well and good, but sometimes you need to be a Markup Nazi.
If you demand compliance from your XML, then you can configure Nokogiri into "strict" parsing mode, in which it will raise an exception at the first sign of malformedness:
begin
bad_doc = Nokogiri::XML(badly_formed) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
puts "caught exception: #{e}"
end
# => caught exception: Premature end of data in tag root line 1