123456789_123456789_123456789_123456789_123456789_

Class: LibXML::XML::HTMLParser

Relationships & Source Files
Namespace Children
Modules:
Classes:
Inherits: Object
Defined in: ext/libxml/ruby_xml_html_parser.c,
ext/libxml/ruby_xml_html_parser.c,
lib/libxml/html_parser.rb

Overview

The HTML parser implements an HTML 4.0 non-verifying parser with an API compatible with the Parser. In contrast with the Parser, it can parse “real world” HTML, even if it severely broken from a specification point of view.

The HTML parser creates an in-memory document object that consist of any number of Node instances. This is simple and powerful model, but has the major limitation that the size of the document that can be processed is limited by the amount of memory available.

Using the html parser is simple:

parser = XML::HTMLParser.file('my_file')
doc = parser.parse

You can also parse documents (see XML::HTMLParser.document), strings (see .string) and io objects (see .io).

Class Method Summary

Instance Attribute Summary

Instance Method Summary

Constructor Details

XML::HTMLParser.initializeparser

Initializes a new parser instance with no pre-determined source.

[ GitHub ]

  
# File 'ext/libxml/ruby_xml_html_parser.c', line 39

static VALUE rxml_html_parser_initialize(int argc, VALUE *argv, VALUE self)
{
  VALUE context = Qnil;

  rb_scan_args(argc, argv, "01", &context);

  if (context == Qnil)
  {
    rb_raise(rb_eArgError, "An instance of a XML::Parser::Context must be passed to XML::HTMLParser.new");
  }

  rb_ivar_set(self, CONTEXT_ATTR, context);
  return self;
}

Class Method Details

XML::HTMLParser.file(path) ⇒ HTMLParser XML::HTMLParser.file(path, encoding: XML::Encoding::UTF_8) .optionsHTMLParser

Creates a new parser by parsing the specified file or uri.

Parameters:

path - Path to file to parse
encoding - The document encoding, defaults to nil. Valid values
           are the encoding constants defined on XML::Encoding.
options - Parser options.  Valid values are the constants defined on
          XML::HTMLParser::Options.  Mutliple options can be combined
          by using Bitwise OR (|).
[ GitHub ]

  
# File 'lib/libxml/html_parser.rb', line 21

def self.file(path, encoding: nil, options: nil)
  context = XML::HTMLParser::Context.file(path)
  context.encoding = encoding if encoding
  context.options = options if options
  self.new(context)
end

XML::HTMLParser.io(io) ⇒ HTMLParser XML::HTMLParser.io(io, encoding: XML::Encoding::UTF_8) .options .base_uriHTMLParser

Creates a new reader by parsing the specified io object.

Parameters:

io - io object that contains the xml to parser
base_uri - The base url for the parsed document.
encoding - The document encoding, defaults to nil. Valid values
           are the encoding constants defined on XML::Encoding.
options - Parser options.  Valid values are the constants defined on
          XML::HTMLParser::Options.  Mutliple options can be combined
          by using Bitwise OR (|).
[ GitHub ]

  
# File 'lib/libxml/html_parser.rb', line 45

def self.io(io, base_uri: nil, encoding: nil, options: nil)
  context = XML::HTMLParser::Context.io(io)
  context.base_uri = base_uri if base_uri
  context.encoding = encoding if encoding
  context.options = options if options
  self.new(context)
end

XML::HTMLParser.string(string) XML::HTMLParser.string(string, encoding: XML::Encoding::UTF_8) .options .base_uriHTMLParser

Creates a new parser by parsing the specified string.

Parameters:

string - String to parse
base_uri - The base url for the parsed document.
encoding - The document encoding, defaults to nil. Valid values
           are the encoding constants defined on XML::Encoding.
options - Parser options.  Valid values are the constants defined on
          XML::HTMLParser::Options.  Mutliple options can be combined
          by using Bitwise OR (|).
[ GitHub ]

  
# File 'lib/libxml/html_parser.rb', line 70

def self.string(string, base_uri: nil, encoding: nil, options: nil)
  context = XML::HTMLParser::Context.string(string)
  context.base_uri = base_uri if base_uri
  context.encoding = encoding if encoding
  context.options = options if options
  self.new(context)
end

Instance Attribute Details

#file=(value) (writeonly)

This method is for internal use only.
[ GitHub ]

  
# File 'lib/libxml/html_parser.rb', line 80

def file=(value)
  warn("XML::HTMLParser#file is deprecated.  Use XML::HTMLParser.file instead")
  @context = XML::HTMLParser::Context.file(value)
end

#input (readonly)

Atributes

[ GitHub ]

#io=(value) (writeonly)

This method is for internal use only.
[ GitHub ]

  
# File 'lib/libxml/html_parser.rb', line 85

def io=(value)
  warn("XML::HTMLParser#io is deprecated.  Use XML::HTMLParser.io instead")
  @context = XML::HTMLParser::Context.io(value)
end

#string=(value) (writeonly)

This method is for internal use only.
[ GitHub ]

  
# File 'lib/libxml/html_parser.rb', line 90

def string=(value)
  warn("XML::HTMLParser#string is deprecated.  Use XML::HTMLParser.string instead")
  @context = XML::HTMLParser::Context.string(value)
end

Instance Method Details

#parseXML::Document

Parse the input ::LibXML::XML and create an Document with it’s content. If an error occurs, XML::Parser::ParseError is thrown.

[ GitHub ]

  
# File 'ext/libxml/ruby_xml_html_parser.c', line 62

static VALUE rxml_html_parser_parse(VALUE self)
{
  xmlParserCtxtPtr ctxt;
  VALUE context = rb_ivar_get(self, CONTEXT_ATTR);
  
  Data_Get_Struct(context, xmlParserCtxt, ctxt);

  if (htmlParseDocument(ctxt) == -1 && ! ctxt->recovery)
  {
    rxml_raise(&ctxt->lastError);
  }

  rb_funcall(context, rb_intern("close"), 0);

  return rxml_document_wrap(ctxt->myDoc);
}