Class: Nokogiri::XML::ParseOptions
Relationships & Source Files | |
Inherits: | Object |
Defined in: | lib/nokogiri/xml/parse_options.rb |
Overview
Class to contain options for parsing XML or HTML4 (but not HTML5).
π‘ Note that HTML5 parsing has a separate, orthogonal set of options due to the API of the HTML5 library used. See ::Nokogiri::HTML5
.
About the Examples
Examples on this page assume that the following code has been executed:
ββ require βnokogiriβ # Make Nokogiri available. include Nokogiri # Allow omitting leading βNokogiri::β. xml_s = β<root />nβ # String containing XML. File.write
(βt.xmlβ, xml_s) # File containing XML. html_s = β<html />nβ # String containing HTML. File.write
(βt.htmlβ, html_s) # File containing HTML. ββ
Examples executed via IRB
(interactive Ruby) display ParseOptions instances using method #inspect.
Parsing Methods
Each of the parsing methods performs parsing for an XML or HTML4 source:
-
Each requires a leading argument that specifies the source of the text to be parsed; except as noted, the argumentβs value may be either:
- A string. - An open IO stream (must respond to methods {read} and {close}). Examples: ``` XML::parse(xml_s) HTML4.parse(html_s) XML::parse(File.open('t.xml')) HTML4.parse(File.open('t.html')) ```
-
Each accepts a trailing optional argument #options (or keyword argument #options) that specifies parsing options; the argumentβs value may be either:
- An integer: see [Bitmap Constants](rdoc-ref:ParseOptions@Bitmap+Constants). - An instance of \ParseOptions: see ParseOptions.new. Examples: ``` XML::parse(xml_s, options: XML::ParseOptions::STRICT) HTML4::parse(html_s, options: XML::ParseOptions::BIG_LINES) XML::parse(xml_s, options: XML::ParseOptions.new.strict) HTML4::parse(html_s, options: XML::ParseOptions.new.big_lines) ```
-
Each (except as noted) accepts a block that allows parsing options to be specified; see [Options-Setting Blocks](ParseOptions@Options-Setting+Blocks).
Certain other parsing methods use different options; see HTML5.
β Not all parse options are supported on JRuby. Nokogiri attempts to invoke the equivalent behavior in Xerces/NekoHTML on JRuby when itβs possible.
Bitmap Constants
Each of the [parsing methods](ParseOptions@Parsing+Methods) discussed here accept an integer argument #options that specifies parsing options.
That integer value may be constructed using the bitmap constants defined in ParseOptions.
Except for STRICT (see note below), each of the bitmap constants has a non-zero value that represents a bit in an integer value; to illustrate, here are a few of the constants, displayed in binary format (base 2):
ββ RECOVER.to_s(2) # => β1β NOENT.to_s(2) # => β10β DTDLOAD.to_s(2) # => β100β DTDATTR.to_s(2) # => β1000β DTDVALID.to_s(2) # => β10000β β`
Any of these constants may be used alone to specify a single option:
ββ .new(ParseOptions::DTDLOAD)
=> #<Nokogiri::XML::ParseOptions: β¦ strict, dtdload>
.new(ParseOptions::DTDATTR)
=> #<Nokogiri::XML::ParseOptions: β¦ strict, dtdattr>
ββ
Multiple constants may be ORed together to specify multiple options:
ββ options = BIG_LINES | COMPACT | NOCDATA .new(options)
=> #<Nokogiri::XML::ParseOptions: β¦ strict, nocdata, compact, big_lines>
ββ
Note: The value of constant STRICT is zero; it may be used alone to turn all options off:
ββ parse(β<root />β) {|options| puts options.inspect } #<Nokogiri::XML::ParseOptions: recover, nonet, big_lines, default_schema, default_xml> parse(β<root />β, nil, nil, STRICT) {|options| puts options.inspect } #<Nokogiri::XML::ParseOptions: strict> β`
The single-option bitmask constants are: BIG_LINES, COMPACT, DTDATTR, DTDLOAD, DTDVALID, HUGE, NOBASEFIX, NOBLANKS, NOCDATA, NODICT, NOENT, NOERROR, NONET, NOWARNING, NOXINCNODE, NSCLEAN, OLD10, PEDANTIC, RECOVER, SAX1, STRICT, XINCLUDE.
There are also several βshorthandβ constants that can set multiple options: DEFAULT_HTML, DEFAULT_SCHEMA, DEFAULT_XML, DEFAULT_XSLT.
Examples:
ββ .new(ParseOptions::DEFAULT_HTML)
=> #<Nokogiri::XML::ParseOptions: β¦ recover, nowarning, nonet, big_lines, default_schema, noerror, default_html, default_xml>
.new(ParseOptions::DEFAULT_SCHEMA)
=> #<Nokogiri::XML::ParseOptions: β¦ strict, nonet, big_lines, default_schema>
.new(ParseOptions::DEFAULT_XML)
=> #<Nokogiri::XML::ParseOptions: β¦ recover, nonet, big_lines, default_schema, default_xml>
.new(ParseOptions::DEFAULT_XSLT)
=> #<Nokogiri::XML::ParseOptions: β¦ recover, noent, dtdload, dtdattr, nonet, nocdata, big_lines, default_xslt, default_schema, default_xml> #
ββ
Nokogiri itself uses these shorthand constants for its parsing, and they are generally most suitable for Nokogiri usersβ code.
Options-Setting Blocks
Many of the [parsing methods](ParseOptions@Parsing+Methods) discussed here accept an options-setting block.
The block is called with a new instance of ParseOptions created with the defaults for the specific method:
ββ parse(xml_s) {|options| puts options.inspect } #<Nokogiri::XML::ParseOptions: @options=4196353 recover, nonet, big_lines, default_xml, default_schema> HTML4.parse(html_s) {|options| puts options.inspect } #<Nokogiri::XML::ParseOptions: @options=4196449 recover, nowarning, nonet, big_lines, default_html, default_xml, noerror, default_schema> β`
When the block returns, the parsing is performed using those #options.
The block may modify those options, which affects parsing:
ββ bad_xml = β<root>β # End tag missing. parse(bad_xml) # No error because option RECOVER is on. parse(bad_xml) {|options| options.strict } # Raises SyntaxError because option STRICT is on. β`
Convenience Methods
A ParseOptions object has three sets of convenience methods, each based on the name of one of the constants:
-
Setters: each is the downcase of an option name, and turns on an option:
``` options = ParseOptions.new # => #<Nokogiri::XML::ParseOptions: ... strict> options.big_lines # => #<Nokogiri::XML::ParseOptions: ... strict, big_lines> options.compact # => #<Nokogiri::XML::ParseOptions: ... strict, compact, big_lines> ```
-
Unsetters: each begins with
no
, and turns off an option.Note that there is no unsetter {nostrict}, but the setter {recover} serves the same purpose: ``` options.nobig_lines # => #<Nokogiri::XML::ParseOptions: ... strict, compact> options.nocompact # => #<Nokogiri::XML::ParseOptions: ... strict> options.recover # Functionally equivalent to nostrict. # => #<Nokogiri::XML::ParseOptions: ... recover> options.noent # Set NOENT. # => #<Nokogiri::XML::ParseOptions: ... recover, noent> options.nonoent # Unset NOENT. # => #<Nokogiri::XML::ParseOptions: ... recover> ``` π‘ Note that some begin with {no}, leading to the logical but perhaps unintuitive double negative: ``` po.nocdata # Set the NOCDATA parse option po.nonocdata # Unset the NOCDATA parse option ```
-
Queries: each ends with
?
, and returns whether an option is on or off:``` options.recover? # => true options.strict? # => false ```
Each setter and unsetter method returns self
, so the methods may be chained:
ββ options.compact.big_lines
=> #<Nokogiri::XML::ParseOptions: β¦ strict, compact, big_lines>
ββ
Constant Summary
-
BIG_LINES =
Support line numbers up to
long int
(default is a βshort int`). On by default for forDocument
,DocumentFragment
,::Nokogiri::HTML4::Document
,::Nokogiri::HTML4::DocumentFragment
,::Nokogiri::XSLT::Stylesheet
, andSchema
.1 << 22
-
COMPACT =
Compact small text nodes. Off by default.
β No modification of the DOM tree is allowed after parsing.
1 << 16
-
DEFAULT_HTML =
RECOVER | NOERROR | NOWARNING | NONET | BIG_LINES
-
DEFAULT_SCHEMA =
# File 'lib/nokogiri/xml/parse_options.rb', line 342
NONET | BIG_LINES
-
DEFAULT_XML =
Shorthand options mask useful for parsing
::Nokogiri::XML
: sets RECOVER, NONET, BIG_LINES.RECOVER | NONET | BIG_LINES
-
DEFAULT_XSLT =
RECOVER | NONET | NOENT | DTDLOAD | DTDATTR | NOCDATA | BIG_LINES
-
DTDATTR =
Default DTD attributes. On by default for
::Nokogiri::XSLT::Stylesheet
.1 << 3
-
DTDLOAD =
Load external subsets. On by default for
::Nokogiri::XSLT::Stylesheet
.β <b>It is UNSAFE to set this option</b> when parsing untrusted documents.
1 << 2
-
DTDVALID =
Validate with the
DTD
. Off by default.1 << 4
-
HUGE =
Relax any hardcoded limit from the parser. Off by default.
β <b>It is UNSAFE to set this option</b> when parsing untrusted documents.
1 << 19
-
NOBASEFIX =
Do not fixup XInclude xml:base URIs. Off by default.
1 << 18
-
NOBLANKS =
Remove blank nodes. Off by default.
1 << 8
-
NOCDATA =
Merge CDATA as text nodes. On by default for
::Nokogiri::XSLT::Stylesheet
.1 << 14
-
NODICT =
Do not reuse the context dictionary. Off by default.
1 << 12
-
NOENT =
Substitute entities. Off by default.
β This option enables entity substitution, contrary to what the name implies. β <b>It is UNSAFE to set this option</b> when parsing untrusted documents.
1 << 1
-
NOERROR =
Suppress error reports. On by default for
::Nokogiri::HTML4::Document
and::Nokogiri::HTML4::DocumentFragment
.1 << 5
-
NONET =
Forbid network access. On by default for
Document
,DocumentFragment
,::Nokogiri::HTML4::Document
,::Nokogiri::HTML4::DocumentFragment
,::Nokogiri::XSLT::Stylesheet
, andSchema
.β <b>It is UNSAFE to unset this option</b> when parsing untrusted documents.
1 << 11
-
NOWARNING =
Suppress warning reports. On by default for
::Nokogiri::HTML4::Document
and::Nokogiri::HTML4::DocumentFragment
.1 << 6
-
NOXINCNODE =
Do not generate XInclude START/END nodes. Off by default.
1 << 15
-
NSCLEAN =
Remove redundant namespaces declarations. Off by default.
1 << 13
-
OLD10 =
Parse using XML-1.0 before update 5. Off by default.
1 << 17
-
PEDANTIC =
Enable pedantic error reporting. Off by default.
1 << 7
-
RECOVER =
Recover from errors in input; no strict parsing.
1 << 0
-
SAX1 =
Use the
SAX1
interface internally. Off by default.1 << 9
-
STRICT =
Strict parsing; do not recover from errors in input.
0
-
XINCLUDE =
Implement XInclude substitution. Off by default.
1 << 10
Class Method Summary
-
.new(options = ParseOptions::STRICT) ⇒ ParseOptions
constructor
:markup: markdown.
Instance Attribute Summary
Instance Method Summary
-
#==(object)
Returns true if the same options are set in
self
andobject
. -
#inspect
Returns a string representation of
self
that includes the numeric value of@options
:
Constructor Details
.new(options = ParseOptions::STRICT) ⇒ ParseOptions
:markup: markdown
Returns a new ParseOptions object with options as specified by integer argument #options. The value of #options may be constructed using [Bitmap Constants](ParseOptions@Bitmap+Constants).
With the simple constant STRICT (the default), all options are off (#strict means norecover
):
ββ .new
=> #<Nokogiri::XML::ParseOptions: β¦ strict>
ββ
With a different simple constant, one option may be set:
ββ .new
(ParseOptions::RECOVER)
=> #<Nokogiri::XML::ParseOptions: β¦ recover>
.new
(ParseOptions::COMPACT)
=> #<Nokogiri::XML::ParseOptions: β¦ strict, compact>
ββ
With multiple ORed constants, multiple options may be set:
ββ options = COMPACT | RECOVER | BIG_LINES .new
(options)
=> #<Nokogiri::XML::ParseOptions: β¦ recover, compact, big_lines>
ββ
# File 'lib/nokogiri/xml/parse_options.rb', line 389
def initialize( = STRICT) @options = end
Instance Attribute Details
#options (rw) Also known as: #to_i
# File 'lib/nokogiri/xml/parse_options.rb', line 354
attr_accessor :
#strict (readonly)
Turns off option recover
:
ββ options = .new.recover.compact.big_lines
=> #<Nokogiri::XML::ParseOptions: β¦ recover, compact, big_lines>
options.strict
=> #<Nokogiri::XML::ParseOptions: β¦ strict, compact, big_lines>
ββ
# File 'lib/nokogiri/xml/parse_options.rb', line 424
def strict @options &= ~RECOVER self end
#strict? ⇒ Boolean
(readonly)
#to_i (readonly)
Alias for #options.
# File 'lib/nokogiri/xml/parse_options.rb', line 462
alias_method :to_i, :
Instance Method Details
#==(object)
Returns true if the same options are set in self
and object
.
ββ options = .new
=> #<Nokogiri::XML::ParseOptions: β¦ strict>
options == options.dup # => true options == options.dup.recover # => false ββ
#inspect
Returns a string representation of self
that includes the numeric value of @options
:
ββ options = .new options.inspect
=> β#<Nokogiri::XML::ParseOptions: @options=0 strict>β
ββ
In general, the returned string also includes the (downcased) names of the options that are on (but omits the names of those that are off):
ββ options.recover.big_lines options.inspect
=> β#<Nokogiri::XML::ParseOptions: @options=4194305 recover, big_lines>β
ββ
The exception is that always either recover
(i.e, *not strict*) or the pseudo-option #strict is reported:
ββ options.norecover options.inspect
=> β#<Nokogiri::XML::ParseOptions: @options=4194304 strict, big_lines>β
ββ
# File 'lib/nokogiri/xml/parse_options.rb', line 495
def inspect = [] self.class.constants.each do |k| << k.downcase if send(:"#{k.downcase}?") end super.sub(/>$/, " " + .join(", ") + ">") end