123456789_123456789_123456789_123456789_123456789_

Class: Nokogiri::HTML4::Document

Relationships & Source Files
Extension / Inclusion / Inheritance Descendants
Subclasses:
Super Chains via Extension / Inclusion / Inheritance
Class Chain:
Instance Chain:
Inherits: Nokogiri::XML::Document
Defined in: ext/nokogiri/html4_document.c,
lib/nokogiri/html4/document.rb

Constant Summary

::Nokogiri::XML::PP::Node - Included

COLLECTIONS

::Nokogiri::XML::Searchable - Included

LOOKS_LIKE_XPATH

::Nokogiri::ClassResolver - Included

VALID_NAMESPACES

::Nokogiri::XML::Node - Inherited

ATTRIBUTE_DECL, ATTRIBUTE_NODE, CDATA_SECTION_NODE, COMMENT_NODE, DECONSTRUCT_KEYS, DECONSTRUCT_METHODS, DOCB_DOCUMENT_NODE, DOCUMENT_FRAG_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, DTD_NODE, ELEMENT_DECL, ELEMENT_NODE, ENTITY_DECL, ENTITY_NODE, ENTITY_REF_NODE, HTML_DOCUMENT_NODE, IMPLIED_XPATH_CONTEXTS, NAMESPACE_DECL, NOTATION_NODE, PI_NODE, TEXT_NODE, XINCLUDE_END, XINCLUDE_START

::Nokogiri::XML::Document - Inherited

IMPLIED_XPATH_CONTEXTS, NCNAME_CHAR, NCNAME_RE, NCNAME_START_CHAR, OBJECT_CLONE_METHOD, OBJECT_DUP_METHOD

Class Method Summary

::Nokogiri::XML::Document - Inherited

.new

Alias for XML::Comment.new.

.parse

Parse XML input from a String or IO object, and return a new ::Nokogiri::XML::Document.

.read_io

Create a new document from an IO object.

.read_memory

Create a new document from a String.

.wrap

⚠ This method is only available when running JRuby.

.empty_doc?

::Nokogiri::XML::Node - Inherited

.new

documented in lib/nokogiri/xml/node.rb.

Instance Attribute Summary

::Nokogiri::XML::Document - Inherited

#encoding

Get the encoding for this Document.

#encoding=

Set the encoding string for this Document.

#errors

The errors found while parsing a document.

#namespace_inheritance

When true, reparented elements without a namespace will inherit their new parent’s namespace (if one exists).

#root

Get the root node for this document.

#root=

Set the root element on this document.

::Nokogiri::XML::Node - Inherited

#blank?
Returns

true if the node is an empty or whitespace-only text or cdata node, else false.

#cdata?

Returns true if this is a CDATA.

#children

:category: Traversing Document Structure.

#children=

Set the content for this Node node_or_tags

#comment?

Returns true if this is a Comment.

#content

[Returns].

#content=

Set the content of this node to input.

#default_namespace=

Adds a default namespace supplied as a string url href, to self.

#document

:category: Traversing Document Structure.

#document?

Returns true if this is a Document.

#elem?
#element?

Returns true if this is an Element node.

#fragment?

Returns true if this is a DocumentFragment.

#html?

Returns true if this is an Document or ::Nokogiri::HTML5::Document node.

#inner_html

Get the inner_html for this node’s Node#children

#inner_html=

Set the content for this Node to node_or_tags.

#inner_text
#lang

Searches the language of a node, i.e.

#lang=

Set the language of a node, i.e.

#line
Returns

The line number of this Node.

#line=

Sets the line for this Node.

#name
#namespace
Returns

The Namespace of the element or attribute node, or nil if there is no namespace.

#namespace=

Set the default namespace on this node (as would be defined with an “xmlns=” attribute in ::Nokogiri::XML source), as a Namespace object ns.

#native_content=

Set the content of this node to input.

#next
#next=
#node_name

Returns the name for this Node.

#node_name=

Set the name for this Node.

#parent

Get the parent Node for this Node.

#parent=

Set the parent Node for this Node.

#previous
#previous=
#processing_instruction?

Returns true if this is a ProcessingInstruction node.

#read_only?

Is this a read only node?

#text
#text?

Returns true if this is a Text node.

#to_str
#xml?

Returns true if this is an ::Nokogiri::XML::Document node.

#prepend_newline?, #data_ptr?

Instance Method Summary

::Nokogiri::XML::Document - Inherited

#<<
#add_child,
#canonicalize

Canonicalize a document and return the results.

#clone

Clone this node.

#collect_namespaces

Recursively get all namespaces from this node and its subtree and return them as a hash.

#create_cdata

Create a CDATA Node containing string

#create_comment

Create a Comment Node containing string

#create_element

Create a new Element with name belonging to this document, optionally setting contents or attributes.

#create_entity

Create a new entity named name.

#create_text_node

Create a Text Node with string

#deconstruct_keys

Returns a hash describing the Document, to use in pattern matching.

#decorate

Apply any decorators to node

#decorators

Get the list of decorators given key

#document

A reference to self

#dup

Duplicate this node.

#fragment

Create a ::Nokogiri::XML::DocumentFragment from tags Returns an empty fragment if tags is nil.

#name

The name of this document.

#namespaces

Get the hash of namespaces on the root ::Nokogiri::XML::Node

#remove_namespaces!

Remove all namespaces from all nodes in the document.

#slop!

Explore a document with shortcut methods.

#to_java

⚠ This method is only available when running JRuby.

#to_xml
#url

Get the url name for this document.

#validate

Validate this Document against its DTD.

#version

Get the ::Nokogiri::XML version for this Document.

#xpath_doctype
Returns

The document type which determines CSS-to-XPath translation.

#inspect_attributes,
#initialize

rubocop:disable Lint/MissingSuper.

::Nokogiri::XML::Node - Inherited

#<<

Add node_or_tags as a child of this Node.

#<=>

Compare two Node objects with respect to their Document.

#==

::Nokogiri::Test to see if this Node is equal to other

#[]

Fetch an attribute from this node.

#[]=

Update the attribute name to value, or create the attribute if it does not exist.

#accept

Accept a visitor.

#add_child

Add node_or_tags as a child of this Node.

#add_class

Ensure HTML ::Nokogiri::CSS classes are present on self.

#add_namespace
#add_namespace_definition

:category: Manipulating Document Structure.

#add_next_sibling

Insert node_or_tags after this Node (as a sibling).

#add_previous_sibling

Insert node_or_tags before this Node (as a sibling).

#after

Insert node_or_tags after this node (as a sibling).

#ancestors

Get a list of ancestor Node for this Node.

#append_class

Add HTML ::Nokogiri::CSS classes to self, regardless of duplication.

#attr

Alias for XML::Node#[].

#attribute

:category: Working With Node Attributes.

#attribute_nodes

:category: Working With Node Attributes.

#attribute_with_ns

:category: Working With Node Attributes.

#attributes

Fetch this node’s attributes.

#before

Insert node_or_tags before this node (as a sibling).

#canonicalize,
#child

:category: Traversing Document Structure.

#classes

Fetch CSS class names of a Node.

#clone

Clone this node.

#create_external_subset

Create an external subset.

#create_internal_subset

Create the internal subset of a document.

#css_path

Get the path to this node as a ::Nokogiri::CSS expression.

#deconstruct_keys

Returns a hash describing the Node, to use in pattern matching.

#decorate!

Decorate this node with the decorators set up in this node’s Document.

#delete
#description

Fetch the ElementDescription for this node.

#do_xinclude

Do xinclude substitution on the subtree below node.

#dup

Duplicate this node.

#each

Iterate over each attribute name and value pair for this Node.

#element_children

[Returns].

#elements
#encode_special_chars

Encode any special characters in string

#external_subset

Get the external subset.

#first_element_child
Returns

The first child Node that is an element.

#fragment

Create a DocumentFragment containing tags that is relative to this context node.

#get_attribute

Alias for XML::Node#[].

#has_attribute?

Alias for XML::Node#key?.

#initialize

Create a new node with name that belongs to document.

#internal_subset

Get the internal subset.

#key?

Returns true if attribute is set.

#keys

Get the attribute names for this Node.

#kwattr_add

Ensure that values are present in a keyword attribute.

#kwattr_append

Add keywords to a Node’s keyword attribute, regardless of duplication.

#kwattr_remove

Remove keywords from a keyword attribute.

#kwattr_values

Fetch values from a keyword attribute of a Node.

#last_element_child
Returns

The last child Node that is an element.

#matches?

Returns true if this Node matches selector

#namespace_definitions

[Returns].

#namespace_scopes
Returns

Array of all the Namespaces on this node and its ancestors.

#namespaced_key?

Returns true if attribute is set with namespace

#namespaces

Fetch all the namespaces on this node and its ancestors.

#next_element

Returns the next ::Nokogiri::XML::Element type sibling node.

#next_sibling

Returns the next sibling node.

#node_type

Get the type for this Node.

#parse

Parse string_or_io as a document fragment within the context of this node.

#path

Returns the path associated with this Node.

#pointer_id

[Returns].

#prepend_child

Add node_or_tags as the first child of this Node.

#previous_element

Returns the previous ::Nokogiri::XML::Element type sibling node.

#previous_sibling

Returns the previous sibling node.

#remove

Alias for XML::Node#unlink.

#remove_attribute

Remove the attribute named name

#remove_class

Remove HTML ::Nokogiri::CSS classes from this node.

#replace

Replace this Node with node_or_tags.

#serialize

Serialize Node using options.

#set_attribute

Alias for XML::Node#[]=.

#swap

Swap this Node for node_or_tags

#to_html

Serialize this Node to ::Nokogiri::HTML.

#to_s

Turn this node in to a string.

#to_xhtml

Serialize this Node to XHTML using options

#to_xml

Serialize this Node to ::Nokogiri::XML using options

#traverse

Yields self and all children to block recursively.

#type
#unlink

Unlink this node from its current context.

#value?

Does this Node’s attributes include <value>.

#values

Get the attribute values for this Node.

#wrap

Wrap this Node with the node parsed from markup or a dup of the node.

#write_html_to

Write Node as ::Nokogiri::HTML to io with options

#write_to

Serialize this node or document to io.

#write_xhtml_to

Write Node as XHTML to io with options

#write_xml_to

Write Node as ::Nokogiri::XML to io with options

#add_child_node_and_reparent_attrs, #add_sibling,
#compare

Compare this Node to other with respect to their Document.

#dump_html

Returns the Node as html.

#get

Get the value for attribute

#html_standard_serialize,
#in_context

TODO: DOCUMENT ME.

#inspect_attributes, #keywordify,
#native_write_to

Write this Node to io with encoding and options

#process_xincludes

Loads and substitutes all xinclude elements below the node.

#set

Set the property to value

#set_namespace

Set the namespace to namespace

#to_format, #write_format_to, #add_child_node, #add_next_sibling_node, #add_previous_sibling_node, #replace_node

::Nokogiri::ClassResolver - Included

#related_class

Find a class constant within the.

::Nokogiri::XML::Searchable - Included

#%
#/
#>

Search this node’s immediate children using ::Nokogiri::CSS selector selector

#at

Search this object for paths, and return only the first result.

#at_css

Search this object for ::Nokogiri::CSS rules, and return only the first match.

#at_xpath

Search this node for XPath paths, and return only the first match.

#css

Search this object for ::Nokogiri::CSS rules.

#search

Search this object for paths.

#xpath

Search this node for XPath paths.

#css_internal, #css_rules_to_xpath, #xpath_impl, #xpath_internal, #xpath_query_from_css_rule, #extract_params

::Nokogiri::XML::PP::Node - Included

Constructor Details

.new(uri=nil, external_id=nil) → HTML4::Document)

Create a new empty document with base URI uri and external ID external_id.

[ GitHub ]

  
# File 'ext/nokogiri/html4_document.c', line 14

static VALUE
rb_html_document_s_new(int argc, VALUE *argv, VALUE klass)
{
  VALUE uri, external_id, rest, rb_doc;
  htmlDocPtr doc;

  rb_scan_args(argc, argv, "0*", &rest);
  uri = rb_ary_entry(rest, (long)0);
  external_id = rb_ary_entry(rest, (long)1);

  doc = htmlNewDoc(
          RTEST(uri) ? (const xmlChar *)StringValueCStr(uri) : NULL,
          RTEST(external_id) ? (const xmlChar *)StringValueCStr(external_id) : NULL
        );
  rb_doc = noko_xml_document_wrap_with_init_args(klass, doc, argc, argv);
  return rb_doc ;
}

Class Method Details

.parse(input) {|options| ... } ⇒ Document .parse(input, url:, encoding:, options:) ⇒ Document

Parse HTML4 input from a String or IO object, and return a new Document.

Required Parameters
  • input (String | IO) The content to be parsed.

Optional Keyword Arguments
  • url: (String) The base URI for this document.

  • encoding: (String) The name of the encoding that should be used when processing the document. When not provided, the encoding will be determined based on the document content.

  • options: (Nokogiri::XML::ParseOptions) Configuration object that determines some behaviors during parsing. See ParseOptions for more information. The default value is ParseOptions::DEFAULT_HTML.

Yields

If a block is given, a Nokogiri::XML::ParseOptions object is yielded to the block which can be configured before parsing. See Nokogiri::XML::ParseOptions for more information.

Returns

Document

Yields:

  • (options)
[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 189

def parse(
  input,
  url_ = nil, encoding_ = nil, options_ = XML::ParseOptions::DEFAULT_HTML,
  url: url_, encoding: encoding_, options: options_
)
  options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
  yield options if block_given?

  url ||= input.respond_to?(:path) ? input.path : nil

  if input.respond_to?(:encoding)
    unless input.encoding == Encoding::ASCII_8BIT
      encoding ||= input.encoding.name
    end
  end

  if input.respond_to?(:read)
    if input.is_a?(Pathname)
      # resolve the Pathname to the file and open it as an IO object, see #2110
      input = input.expand_path.open
      url ||= input.path
    end

    unless encoding
      input = EncodingReader.new(input)
      begin
        return read_io(input, url, encoding, options.to_i)
      rescue EncodingReader::EncodingFound => e
        encoding = e.found_encoding
      end
    end
    return read_io(input, url, encoding, options.to_i)
  end

  # read_memory pukes on empty docs
  if input.nil? || input.empty?
    return encoding ? new.tap { |i| i.encoding = encoding } : new
  end

  encoding ||= EncodingReader.detect_encoding(input)

  read_memory(input, url, encoding, options.to_i)
end

.read_io(io, url, encoding, options)

Read the ::Nokogiri::HTML document from io with given url, encoding, and options. See Nokogiri::HTML4.parse

[ GitHub ]

  
# File 'ext/nokogiri/html4_document.c', line 39

static VALUE
rb_html_document_s_read_io(VALUE klass, VALUE rb_io, VALUE rb_url, VALUE rb_encoding, VALUE rb_options)
{
  VALUE rb_doc;
  VALUE rb_error_list = rb_ary_new();
  htmlDocPtr c_doc;
  const char *c_url = NIL_P(rb_url) ? NULL : StringValueCStr(rb_url);
  const char *c_encoding = NIL_P(rb_encoding) ? NULL : StringValueCStr(rb_encoding);
  int options = NUM2INT(rb_options);

  xmlSetStructuredErrorFunc((void *)rb_error_list, noko__error_array_pusher);

  c_doc = htmlReadIO(noko_io_read, noko_io_close, (void *)rb_io, c_url, c_encoding, options);

  xmlSetStructuredErrorFunc(NULL, NULL);

  /*
   * If EncodingFound has occurred in EncodingReader, make sure to do
   * a cleanup and propagate the error.
   */
  if (rb_respond_to(rb_io, id_encoding_found)) {
    VALUE encoding_found = rb_funcall(rb_io, id_encoding_found, 0);
    if (!NIL_P(encoding_found)) {
      xmlFreeDoc(c_doc);
      rb_exc_raise(encoding_found);
    }
  }

  if ((c_doc == NULL) || (!(options & XML_PARSE_RECOVER) && (RARRAY_LEN(rb_error_list) > 0))) {
    VALUE rb_error ;

    xmlFreeDoc(c_doc);

    rb_error = rb_ary_entry(rb_error_list, 0);
    if (rb_error == Qnil) {
      rb_raise(rb_eRuntimeError, "Could not parse document");
    } else {
      VALUE exception_message = rb_funcall(rb_error, id_to_s, 0);
      exception_message = rb_str_concat(rb_str_new2("Parser without recover option encountered error or warning: "),
                                        exception_message);
      rb_exc_raise(rb_class_new_instance(1, &exception_message, cNokogiriXmlSyntaxError));
    }

    return Qnil;
  }

  rb_doc = noko_xml_document_wrap(klass, c_doc);
  rb_iv_set(rb_doc, "@errors", rb_error_list);
  return rb_doc;
}

.read_memory(string, url, encoding, options)

Read the ::Nokogiri::HTML document contained in string with given url, encoding, and options. See Nokogiri::HTML4.parse

[ GitHub ]

  
# File 'ext/nokogiri/html4_document.c', line 97

static VALUE
rb_html_document_s_read_memory(VALUE klass, VALUE rb_html, VALUE rb_url, VALUE rb_encoding, VALUE rb_options)
{
  VALUE rb_doc;
  VALUE rb_error_list = rb_ary_new();
  htmlDocPtr c_doc;
  const char *c_buffer = StringValuePtr(rb_html);
  const char *c_url = NIL_P(rb_url) ? NULL : StringValueCStr(rb_url);
  const char *c_encoding = NIL_P(rb_encoding) ? NULL : StringValueCStr(rb_encoding);
  int html_len = (int)RSTRING_LEN(rb_html);
  int options = NUM2INT(rb_options);

  xmlSetStructuredErrorFunc((void *)rb_error_list, noko__error_array_pusher);

  c_doc = htmlReadMemory(c_buffer, html_len, c_url, c_encoding, options);

  xmlSetStructuredErrorFunc(NULL, NULL);

  if ((c_doc == NULL) || (!(options & XML_PARSE_RECOVER) && (RARRAY_LEN(rb_error_list) > 0))) {
    VALUE rb_error ;

    xmlFreeDoc(c_doc);

    rb_error = rb_ary_entry(rb_error_list, 0);
    if (rb_error == Qnil) {
      rb_raise(rb_eRuntimeError, "Could not parse document");
    } else {
      VALUE exception_message = rb_funcall(rb_error, id_to_s, 0);
      exception_message = rb_str_concat(rb_str_new2("Parser without recover option encountered error or warning: "),
                                        exception_message);
      rb_exc_raise(rb_class_new_instance(1, &exception_message, cNokogiriXmlSyntaxError));
    }

    return Qnil;
  }

  rb_doc = noko_xml_document_wrap(klass, c_doc);
  rb_iv_set(rb_doc, "@errors", rb_error_list);
  return rb_doc;
}

Instance Attribute Details

#meta_encoding (rw)

Get the meta tag encoding for this document. If there is no meta tag, then nil is returned.

[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 12

def meta_encoding
  if (meta = at_xpath("//meta[@charset]"))
    meta[:charset]
  elsif (meta = meta_content_type)
    meta["content"][/charset\s*=\s*([\w-]+)/i, 1]
  end
end

#meta_encoding=(encoding) (rw)

Set the meta tag encoding for this document.

If an meta encoding tag is already present, its content is replaced with the given text.

Otherwise, this method tries to create one at an appropriate place supplying head and/or html elements as necessary, which is inside a head element if any, and before any text node or content element (typically <body>) if any.

The result when trying to set an encoding that is different from the document encoding is undefined.

Beware in CRuby, that libxml2 automatically inserts a meta tag into a head element.

[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 36

def meta_encoding=(encoding)
  if (meta = meta_content_type)
    meta["content"] = format("text/html; charset=%s", encoding)
    encoding
  elsif (meta = at_xpath("//meta[@charset]"))
    meta["charset"] = encoding
  else
    meta = XML::Node.new("meta", self)
    if (dtd = internal_subset) && dtd.html5_dtd?
      meta["charset"] = encoding
    else
      meta["http-equiv"] = "Content-Type"
      meta["content"] = format("text/html; charset=%s", encoding)
    end

    if (head = at_xpath("//head"))
      head.prepend_child(meta)
    else
      (meta)
    end
    encoding
  end
end

#title (rw)

Get the title string of this document. Return nil if there is no title tag.

[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 70

def title
  (title = at_xpath("//title")) && title.inner_text
end

#title=(text) (rw)

Set the title string of this document.

If a title element is already present, its content is replaced with the given text.

Otherwise, this method tries to create one at an appropriate place supplying head and/or html elements as necessary, which is inside a head element if any, right after a meta encoding/charset tag if any, and before any text node or content element (typically <body>) if any.

[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 85

def title=(text)
  tnode = XML::Text.new(text, self)
  if (title = at_xpath("//title"))
    title.children = tnode
    return text
  end

  title = XML::Node.new("title", self) << tnode
  if (head = at_xpath("//head"))
    head << title
  elsif (meta = at_xpath("//meta[@charset]") || meta_content_type)
    # better put after charset declaration
    meta.add_next_sibling(title)
  else
    (title)
  end
end

Instance Method Details

#fragment(tags = nil)

[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 149

def fragment(tags = nil)
  DocumentFragment.new(self, tags, root)
end

#meta_content_type (private)

[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 60

def meta_content_type
  xpath("//meta[@http-equiv and boolean(@content)]").find do |node|
    node["http-equiv"] =~ /\AContent-Type\z/i
  end
end

#serialize(options = {})

Serialize Node using options. Save options can also be set using a block.

See also ::Nokogiri::XML::Node::SaveOptions and Node@Serialization+and+Generating+Output.

These two statements are equivalent:

node.serialize(:encoding => 'UTF-8', :save_with => FORMAT | AS_XML)

or

node.serialize(:encoding => 'UTF-8') do |config|
  config.format.as_xml
end
[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 142

def serialize(options = {})
  options[:save_with] ||= XML::Node::SaveOptions::DEFAULT_HTML
  super
end

#set_metadata_element(element) (private)

[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 103

def (element) # rubocop:disable Naming/AccessorMethodName
  if (head = at_xpath("//head"))
    head << element
  elsif (html = at_xpath("//html"))
    head = html.prepend_child(XML::Node.new("head", self))
    head.prepend_child(element)
  elsif (first = children.find do |node|
           case node
           when XML::Element, XML::Text
             true
           end
         end)
    # We reach here only if the underlying document model
    # allows <html>/<head> elements to be omitted and does not
    # automatically supply them.
    first.add_previous_sibling(element)
  else
    html = add_child(XML::Node.new("html", self))
    head = html.add_child(XML::Node.new("head", self))
    head.prepend_child(element)
  end
end

#type

The type for this document

[ GitHub ]

  
# File 'ext/nokogiri/html4_document.c', line 144

static VALUE
rb_html_document_type(VALUE self)
{
  htmlDocPtr doc = noko_xml_document_unwrap(self);
  return INT2NUM(doc->type);
}

#xpath_doctype() → Nokogiri::CSS::XPathVisitor::DoctypeConfig)

Returns

The document type which determines CSS-to-XPath translation.

See XPathVisitor for more information.

[ GitHub ]

  
# File 'lib/nokogiri/html4/document.rb', line 159

def xpath_doctype
  Nokogiri::CSS::XPathVisitor::DoctypeConfig::HTML4
end