Class: REXML::Parsers::BaseParser

Relationships & Source Files
Namespace Children
Modules: `Private`
Inherits:	Object
Defined in:	lib/rexml/parsers/baseparser.rb

Overview

Using the Pull Parser

This API is experimental, and subject to change.

parser = PullParser.new( "<a>text<b att='val'/>txet</a>" )
while parser.has_next?
  res = parser.next
  puts res[1]['att'] if res.start_tag? and res[0] == 'b'
end

See the PullEvent class for information on the content of the results. The data is identical to the arguments passed for the various events to the ::REXML::StreamListener API.

Notice that:

parser = PullParser.new( "<a>BAD DOCUMENT" )
while parser.has_next?
  res = parser.next
  raise res[1] if res.error?
end

Nat Price gave me some good ideas for the API.

Constant Summary

ATTDEF =
# File 'lib/rexml/parsers/baseparser.rb', line 108
```
"\\s#{NAME}\\s#{ATTTYPE}\\s+#{DEFAULTDECL}"
```
ATTDEF_RE =
# File 'lib/rexml/parsers/baseparser.rb', line 109
```
/#{ATTDEF}/
```
ATTLISTDECL_PATTERN =
# File 'lib/rexml/parsers/baseparser.rb', line 111
```
/\A\s*<!ATTLIST\s+#{NAME}(?:#{ATTDEF})*\s*>/um
```
ATTLISTDECL_START =
# File 'lib/rexml/parsers/baseparser.rb', line 110
```
/\A\s*<!ATTLIST/um
```
ATTRIBUTE_PATTERN =
# File 'lib/rexml/parsers/baseparser.rb', line 81
```
/\s*(#{QNAME_STR})\s*=\s*(["'])(.*?)\4/um
```

ATTTYPE =
# File 'lib/rexml/parsers/baseparser.rb', line 105

"(CDATA|ID|IDREF|IDREFS|ENTITY|ENTITIES|NMTOKEN|NMTOKENS|#{ENUMERATEDTYPE})"

ATTVALUE =
# File 'lib/rexml/parsers/baseparser.rb', line 106

"(?:\"((?:[^<&\"]|#{REFERENCE})*)\")|(?:'((?:[^<&']|#{REFERENCE})*)')"

CDATA_END =
# File 'lib/rexml/parsers/baseparser.rb', line 85
```
/\A\s*\]\s*>/um
```
CDATA_PATTERN =
# File 'lib/rexml/parsers/baseparser.rb', line 86
```
/<!\[CDATA\[(.*?)\]\]>/um
```
CDATA_START =
# File 'lib/rexml/parsers/baseparser.rb', line 84
```
/\A<!\[CDATA\[/u
```
CLOSE_MATCH =
# File 'lib/rexml/parsers/baseparser.rb', line 92
```
/\A\s*<\/(#{QNAME_STR})\s*>/um
```
COMBININGCHAR =

TODO

# File 'lib/rexml/parsers/baseparser.rb', line 61
```
''
```
COMMENT_PATTERN =
# File 'lib/rexml/parsers/baseparser.rb', line 83
```
//um
```
COMMENT_START =
# File 'lib/rexml/parsers/baseparser.rb', line 82
```
/\A<!--/u
```
DEFAULTDECL =
# File 'lib/rexml/parsers/baseparser.rb', line 107
```
"(#REQUIRED|#IMPLIED|(?:(#FIXED\\s+)?#{ATTVALUE}))"
```

DEFAULT_ENTITIES =
# File 'lib/rexml/parsers/baseparser.rb', line 136

{
  'gt' => [/&gt;/, '&gt;', '>', />/],
  'lt' => [/&lt;/, '&lt;', '<', /</],
  'quot' => [/&quot;/, '&quot;', '"', /"/],
  "apos" => [/&apos;/, "&apos;", "'", /'/]
}

DIGIT =
# File 'lib/rexml/parsers/baseparser.rb', line 59
```
'[:digit:]'
```
DOCTYPE_END =
# File 'lib/rexml/parsers/baseparser.rb', line 80
```
/\A\s*\]\s*>/um
```
DOCTYPE_START =
# File 'lib/rexml/parsers/baseparser.rb', line 79
```
/\A\s*<!DOCTYPE\s/um
```
ELEMENTDECL_PATTERN =
# File 'lib/rexml/parsers/baseparser.rb', line 100
```
/\A\s*(<!ELEMENT.*?)>/um
```
ELEMENTDECL_START =
# File 'lib/rexml/parsers/baseparser.rb', line 99
```
/\A\s*<!ELEMENT/um
```
ENCODING =
# File 'lib/rexml/parsers/baseparser.rb', line 95
```
/\bencoding\s*=\s*["'](.*?)['"]/um
```
ENTITYDECL =
# File 'lib/rexml/parsers/baseparser.rb', line 127
```
/\s*(?:#{GEDECL})|\s*(?:#{PEDECL})/um
```
ENTITYDEF =
# File 'lib/rexml/parsers/baseparser.rb', line 124
```
"(?:#{ENTITYVALUE}|(?:#{EXTERNALID}(#{NDATADECL})?))"
```

ENTITYVALUE =
# File 'lib/rexml/parsers/baseparser.rb', line 122

%Q{((?:"(?:[^%&"]|#{PEREFERENCE}|#{REFERENCE})*")|(?:'([^%&']|#{PEREFERENCE}|#{REFERENCE})*'))}

ENTITY_START =
# File 'lib/rexml/parsers/baseparser.rb', line 98
```
/\A\s*<!ENTITY/
```
ENUMERATEDTYPE =
# File 'lib/rexml/parsers/baseparser.rb', line 104
```
"(?:(?:#{NOTATIONTYPE})|(?:#{ENUMERATION}))"
```
ENUMERATION =
# File 'lib/rexml/parsers/baseparser.rb', line 102
```
"\$\\s*#{NMTOKEN}(?:\\s*\\|\\s*#{NMTOKEN})*\\s*\$"
```
EREFERENCE =
# File 'lib/rexml/parsers/baseparser.rb', line 134
```
/&(?!#{NAME};)/
```
EXTENDER =

TODO

# File 'lib/rexml/parsers/baseparser.rb', line 62
```
''
```

EXTERNALID =
# File 'lib/rexml/parsers/baseparser.rb', line 119

"(?:(?:(SYSTEM)\\s#{SYSTEMLITERAL})|(?:(PUBLIC)\\s#{PUBIDLITERAL}\\s+#{SYSTEMLITERAL}))"

EXTERNAL_ID_PUBLIC =
# File 'lib/rexml/parsers/baseparser.rb', line 130
```
/\A\s*PUBLIC\s#{PUBIDLITERAL}\s#{SYSTEMLITERAL}\s*/um
```
EXTERNAL_ID_SYSTEM =
# File 'lib/rexml/parsers/baseparser.rb', line 131
```
/\A\s*SYSTEM\s+#{SYSTEMLITERAL}\s*/um
```
GEDECL =
# File 'lib/rexml/parsers/baseparser.rb', line 126
```
"<!ENTITY\\s#{NAME}\\s#{ENTITYDEF}\\s*>"
```
INSTRUCTION_PATTERN =
# File 'lib/rexml/parsers/baseparser.rb', line 90
```
/<\?#{NAME}(\s+.*?)?\?>/um
```
INSTRUCTION_START =
# File 'lib/rexml/parsers/baseparser.rb', line 89
```
/\A<\?/u
```
LETTER =
# File 'lib/rexml/parsers/baseparser.rb', line 58
```
'[:alpha:]'
```
NAME =
# File 'lib/rexml/parsers/baseparser.rb', line 73
```
"([\\w:]#{NAMECHAR}*)"
```
NAMECHAR =
# File 'lib/rexml/parsers/baseparser.rb', line 72
```
'[\-\w\.:]'
```
NCNAME_STR =
# File 'lib/rexml/parsers/baseparser.rb', line 64
```
"[#{LETTER}_][-[:alnum:]._#{COMBININGCHAR}#{EXTENDER}]*"
```
NDATADECL =
# File 'lib/rexml/parsers/baseparser.rb', line 120
```
"\\sNDATA\\s#{NAME}"
```
NMTOKEN =
# File 'lib/rexml/parsers/baseparser.rb', line 74
```
"(?:#{NAMECHAR})+"
```
NMTOKENS =
# File 'lib/rexml/parsers/baseparser.rb', line 75
```
"#{NMTOKEN}(\\s+#{NMTOKEN})*"
```
NOTATIONDECL_START =
# File 'lib/rexml/parsers/baseparser.rb', line 129
```
/\A\s*<!NOTATION/um
```
NOTATIONTYPE =
# File 'lib/rexml/parsers/baseparser.rb', line 103
```
"NOTATION\\s+\$\\s*#{NAME}(?:\\s*\\|\\s*#{NAME})*\\s*\$"
```
PEDECL =
# File 'lib/rexml/parsers/baseparser.rb', line 125
```
"<!ENTITY\\s(%)\\s#{NAME}\\s+#{PEDEF}\\s*>"
```
PEDEF =
# File 'lib/rexml/parsers/baseparser.rb', line 123
```
"(?:#{ENTITYVALUE}|#{EXTERNALID})"
```
PEREFERENCE =
# File 'lib/rexml/parsers/baseparser.rb', line 121
```
"%#{NAME};"
```
PUBIDCHAR =

::REXML::Entity constants

# File 'lib/rexml/parsers/baseparser.rb', line 116
```
"\x20\x0D\x0Aa-zA-Z0-9\\-()+,./:=?;!*@$_%#"
```
PUBIDLITERAL =
# File 'lib/rexml/parsers/baseparser.rb', line 118
```
%Q{("[#{PUBIDCHAR}']*"|'[#{PUBIDCHAR}]*')}
```
PUBLIC_ID =
# File 'lib/rexml/parsers/baseparser.rb', line 132
```
/\A\s*PUBLIC\s+#{PUBIDLITERAL}\s*/um
```
QNAME =
# File 'lib/rexml/parsers/baseparser.rb', line 66
```
/(#{QNAME_STR})/
```
QNAME_STR =
# File 'lib/rexml/parsers/baseparser.rb', line 65
```
"(?:(#{NCNAME_STR}):)?(#{NCNAME_STR})"
```
REFERENCE =
# File 'lib/rexml/parsers/baseparser.rb', line 76
```
"&(?:#{NAME};|#\\d;|#x[0-9a-fA-F];)"
```
REFERENCE_RE =
# File 'lib/rexml/parsers/baseparser.rb', line 77
```
/#{REFERENCE}/
```
STANDALONE =
# File 'lib/rexml/parsers/baseparser.rb', line 96
```
/\bstandalone\s*=\s*["'](.*?)['"]/um
```
SYSTEMENTITY =
# File 'lib/rexml/parsers/baseparser.rb', line 101
```
/\A\s*(%.*?;)\s*$/um
```
SYSTEMLITERAL =
# File 'lib/rexml/parsers/baseparser.rb', line 117
```
%Q{((?:"[^"]*")|(?:'[^']*'))}
```
TAG_MATCH =
# File 'lib/rexml/parsers/baseparser.rb', line 91
```
/\A<((?>#{QNAME_STR}))/um
```
TEXT_PATTERN =
# File 'lib/rexml/parsers/baseparser.rb', line 113
```
/\A([^<]*)/um
```
UNAME_STR =

Just for backward compatibility. For example, kramdown uses this. It’s not used in ::REXML.

# File 'lib/rexml/parsers/baseparser.rb', line 70
```
"(?:#{NCNAME_STR}:)?#{NCNAME_STR}"
```
VERSION =
# File 'lib/rexml/parsers/baseparser.rb', line 94
```
/\bversion\s*=\s*["'](.*?)['"]/um
```
XMLDECL_PATTERN =
# File 'lib/rexml/parsers/baseparser.rb', line 88
```
/<\?xml\s+(.*?)\?>/um
```
XMLDECL_START =
# File 'lib/rexml/parsers/baseparser.rb', line 87
```
/\A<\?xml\s/u
```

Class Method Summary

.new(source) ⇒ BaseParser constructor

Instance Attribute Summary

#empty? ⇒ Boolean readonly

Returns true if there are no more events.
#entity_expansion_count readonly
#entity_expansion_limit=(value) writeonly
#entity_expansion_text_limit=(value) writeonly
#has_next? ⇒ Boolean readonly

Returns true if there are more events.
#source readonly
#stream=(source) writeonly

Instance Method Summary

#add_listener(listener)
#entity(reference, entities)
#normalize(input, entities = nil, entity_filter = nil)

Escapes all possible entities.
#peek(depth = 0)

Peek at the depth event in the stack.
#position
#pull

Returns the next event.
#reset
#unnormalize(string, entities = nil, filter = nil)

Unescapes all possible entities.
#unshift(token)

Push an event back on the head of the stream.
#add_namespace(prefix, uri) private
#need_source_encoding_update?(xml_declaration_encoding) ⇒ Boolean private
#parse_attributes(prefixes) private
#parse_id(base_error_message, accept_external_id:, accept_public_id:) private
#parse_id_invalid_details(accept_external_id:, accept_public_id:) private
#parse_name(base_error_message) private
#pop_namespaces_restore private
#process_instruction private
#pull_event private
#push_namespaces_restore private
#record_entity_expansion(delta = 1) private
#scan_quote private

See additional method definition at line 773.

Constructor Details

.new(source) ⇒ `BaseParser`

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 163


def initialize( source )
  self.stream = source
  @listeners = []
  @prefixes = Set.new
  @entity_expansion_count = 0
  @entity_expansion_limit = Security.entity_expansion_limit
  @entity_expansion_text_limit = Security.entity_expansion_text_limit
  @source.ensure_buffer
end

Instance Attribute Details

#empty? ⇒ `Boolean` (readonly)

Returns true if there are no more events

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 208


def empty?
  return (@source.empty? and @stack.empty?)
end

#entity_expansion_count (readonly)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 178


attr_reader :entity_expansion_count

#entity_expansion_limit=(value) (writeonly)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 179


attr_writer :entity_expansion_limit

#entity_expansion_text_limit=(value) (writeonly)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 180


attr_writer :entity_expansion_text_limit

#has_next? ⇒ `Boolean` (readonly)

Returns true if there are more events. Synonymous with !empty?

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 213


def has_next?
  return !(@source.empty? and @stack.empty?)
end

#source (readonly)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 177


attr_reader :source

#stream=(source) (writeonly)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 182


def stream=( source )
  @source = SourceFactory.create_from( source )
  reset
end

Instance Method Details

#add_listener(listener)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 173


def add_listener( listener )
  @listeners << listener
end

#add_namespace(prefix, uri) (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 619


def add_namespace(prefix, uri)
  @namespaces_restore_stack.last[prefix] = @namespaces[prefix]
  if uri.nil?
    @namespaces.delete(prefix)
  else
    @namespaces[prefix] = uri
  end
end

#entity(reference, entities)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 543


def entity( reference, entities )
  return unless entities

  value = entities[ reference ]
  return if value.nil?

  record_entity_expansion
  unnormalize( value, entities )
end

#need_source_encoding_update?(xml_declaration_encoding) ⇒ `Boolean` (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 652


def need_source_encoding_update?(xml_declaration_encoding)
  return false if xml_declaration_encoding.nil?
  return false if /\AUTF-16\z/i =~ xml_declaration_encoding
  true
end

#normalize(input, entities = nil, entity_filter = nil)

Escapes all possible entities

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 554


def normalize( input, entities=nil, entity_filter=nil )
  copy = input.clone
  # Doing it like this rather than in a loop improves the speed
  copy.gsub!( EREFERENCE, '&amp;' )
  entities.each do |key, value|
    copy.gsub!( value, "&#{key};" ) unless entity_filter and
                                entity_filter.include?(entity)
  end if entities
  copy.gsub!( EREFERENCE, '&amp;' )
  DEFAULT_ENTITIES.each do |key, value|
    copy.gsub!( value[3], value[1] )
  end
  copy
end

#parse_attributes(prefixes) (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 791


def parse_attributes(prefixes)
  attributes = {}
  expanded_names = {}
  closed = false
  while true
    if @source.match?(">", true)
      return attributes, closed
    elsif @source.match?("/>", true)
      closed = true
      return attributes, closed
    elsif match = @source.match(QNAME, true)
      name = match[1]
      prefix = match[2]
      local_part = match[3]

      unless @source.match?(/\s*=\s*/um, true)
        message = "Missing attribute equal: <#{name}>"
        raise REXML::ParseException.new(message, @source)
      end
      unless quote = scan_quote
        message = "Missing attribute value start quote: <#{name}>"
        raise REXML::ParseException.new(message, @source)
      end
      start_position = @source.position
      value = @source.read_until(quote)
      unless value.chomp!(quote)
        @source.position = start_position
        message = "Missing attribute value end quote: <#{name}>: <#{quote}>"
        raise REXML::ParseException.new(message, @source)
      end
      @source.match?(/\s*/um, true)
      if prefix == "xmlns"
        if local_part == "xml"
          if value != Private::XML_PREFIXED_NAMESPACE
            msg = "The 'xml' prefix must not be bound to any other namespace "+
              "(http://www.w3.org/TR/REC-xml-names/#ns-decl)"
            raise REXML::ParseException.new( msg, @source, self )
          end
        elsif local_part == "xmlns"
          msg = "The 'xmlns' prefix must not be declared "+
            "(http://www.w3.org/TR/REC-xml-names/#ns-decl)"
          raise REXML::ParseException.new( msg, @source, self)
        end
        add_namespace(local_part, value)
      elsif prefix
        prefixes << prefix unless prefix == "xml"
      end

      if attributes[name]
        msg = "Duplicate attribute #{name.inspect}"
        raise REXML::ParseException.new(msg, @source, self)
      end

      unless prefix == "xmlns"
        uri = @namespaces[prefix]
        expanded_name = [uri, local_part]
        existing_prefix = expanded_names[expanded_name]
        if existing_prefix
          message = "Namespace conflict in adding attribute " +
                    "\"#{local_part}\": " +
                    "Prefix \"#{existing_prefix}\" = \"#{uri}\" and " +
                    "prefix \"#{prefix}\" = \"#{uri}\""
          raise REXML::ParseException.new(message, @source, self)
        end
        expanded_names[expanded_name] = prefix
      end

      attributes[name] = value
    else
      message = "Invalid attribute name: <#{@source.buffer.split(%r{[/>\s]}).first}>"
      raise REXML::ParseException.new(message, @source)
    end
  end
end

#parse_id(base_error_message, accept_external_id:, accept_public_id:) (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 671


def parse_id(base_error_message,
             accept_external_id:,
             accept_public_id:)
  if accept_external_id and (md = @source.match(EXTERNAL_ID_PUBLIC, true))
    pubid = system = nil
    pubid_literal = md[1]
    pubid = pubid_literal[1..-2] if pubid_literal # Remove quote
    system_literal = md[2]
    system = system_literal[1..-2] if system_literal # Remove quote
    ["PUBLIC", pubid, system]
  elsif accept_public_id and (md = @source.match(PUBLIC_ID, true))
    pubid = system = nil
    pubid_literal = md[1]
    pubid = pubid_literal[1..-2] if pubid_literal # Remove quote
    ["PUBLIC", pubid, nil]
  elsif accept_external_id and (md = @source.match(EXTERNAL_ID_SYSTEM, true))
    system = nil
    system_literal = md[1]
    system = system_literal[1..-2] if system_literal # Remove quote
    ["SYSTEM", nil, system]
  else
    details = parse_id_invalid_details(accept_external_id: accept_external_id,
                                       accept_public_id: accept_public_id)
    message = "#{base_error_message}: #{details}"
    raise REXML::ParseException.new(message, @source)
  end
end

#parse_id_invalid_details(accept_external_id:, accept_public_id:) (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 699


def parse_id_invalid_details(accept_external_id:,
                             accept_public_id:)
  public = /\A\s*PUBLIC/um
  system = /\A\s*SYSTEM/um
  if (accept_external_id or accept_public_id) and @source.match?(/#{public}/um)
    if @source.match?(/#{public}(?:\s+[^'"]|\s*[\[>])/um)
      return "public ID literal is missing"
    end
    unless @source.match?(/#{public}\s+#{PUBIDLITERAL}/um)
      return "invalid public ID literal"
    end
    if accept_public_id
      if @source.match?(/#{public}\s#{PUBIDLITERAL}\s[^'"]/um)
        return "system ID literal is missing"
      end
      unless @source.match?(/#{public}\s#{PUBIDLITERAL}\s#{SYSTEMLITERAL}/um)
        return "invalid system literal"
      end
      "garbage after system literal"
    else
      "garbage after public ID literal"
    end
  elsif accept_external_id and @source.match?(/#{system}/um)
    if @source.match?(/#{system}(?:\s+[^'"]|\s*[\[>])/um)
      return "system literal is missing"
    end
    unless @source.match?(/#{system}\s+#{SYSTEMLITERAL}/um)
      return "invalid system literal"
    end
    "garbage after system literal"
  else
    unless @source.match?(/\A\s*(?:PUBLIC|SYSTEM)\s/um)
      return "invalid ID type"
    end
    "ID type is missing"
  end
end

#parse_name(base_error_message) (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 658


def parse_name(base_error_message)
  md = @source.match(Private::NAME_PATTERN, true)
  unless md
    if @source.match?(/\S/um)
      message = "#{base_error_message}: invalid name"
    else
      message = "#{base_error_message}: name is missing"
    end
    raise REXML::ParseException.new(message, @source)
  end
  md[0]
end

#peek(depth = 0)

Peek at the depth event in the stack. The first element on the stack is at depth 0. If depth is -1, will parse to the end of the input stream and return the last event, which is always :end_document. Be aware that this causes the stream to be parsed up to the depth event, so you can effectively pre-parse the entire document (pull the entire thing into memory) using this method.

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 229


def peek depth=0
  raise %Q[Illegal argument "#{depth}"] if depth < -1
  temp = []
  if depth == -1
    temp.push(pull()) until empty?
  else
    while @stack.size+temp.size < depth+1
      temp.push(pull())
    end
  end
  @stack += temp if temp.size > 0
  @stack[depth]
end

#pop_namespaces_restore (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 634


def pop_namespaces_restore
  namespaces_restore = @namespaces_restore_stack.pop
  namespaces_restore.each do |prefix, uri|
    if uri.nil?
      @namespaces.delete(prefix)
    else
      @namespaces[prefix] = uri
    end
  end
end

#position

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 198


def position
  if @source.respond_to? :position
    @source.position
  else
    # FIXME
    0
  end
end

#process_instruction (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 737


def process_instruction
  name = parse_name("Malformed XML: Invalid processing instruction node")
  if @source.match?(/\s+/um, true)
    match_data = @source.match(/(.*?)\?>/um, true)
    unless match_data
      raise ParseException.new("Malformed XML: Unclosed processing instruction", @source)
    end
    content = match_data[1]
  else
    content = nil
    unless @source.match?("?>", true)
      raise ParseException.new("Malformed XML: Unclosed processing instruction", @source)
    end
  end
  if name == "xml"
    if @document_status
      raise ParseException.new("Malformed XML: XML declaration is not at the start", @source)
    end
    version = VERSION.match(content)
    version = version[1] unless version.nil?
    encoding = ENCODING.match(content)
    encoding = encoding[1] unless encoding.nil?
    if need_source_encoding_update?(encoding)
      @source.encoding = encoding
    end
    if encoding.nil? and /\AUTF-16(?:BE|LE)\z/i =~ @source.encoding
      encoding = "UTF-16"
    end
    standalone = STANDALONE.match(content)
    standalone = standalone[1] unless standalone.nil?
    return [ :xmldecl, version, encoding, standalone ]
  end
  [:processing_instruction, name, content]
end

#pull

Returns the next event. This is a PullEvent object.

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 244


def pull
  @source.drop_parsed_content

  pull_event.tap do |event|
    @listeners.each do |listener|
      listener.receive event
    end
  end
end

#pull_event (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 254


def pull_event
  if @closed
    x, @closed = @closed, nil
    return [ :end_element, x ]
  end
  if empty?
    if @document_status == :in_doctype
      raise ParseException.new("Malformed DOCTYPE: unclosed", @source)
    end
    unless @tags.empty?
      path = "/" + @tags.join("/")
      raise ParseException.new("Missing end tag for '#{path}'", @source)
    end
    return [ :end_document ]
  end
  return @stack.shift if @stack.size > 0
  #STDERR.puts @source.encoding
  #STDERR.puts "BUFFER = #{@source.buffer.inspect}"

  @source.ensure_buffer
  if @document_status == nil
    start_position = @source.position
    if @source.match?("<?", true)
      return process_instruction
    elsif @source.match?("<!", true)
      if @source.match?("--", true)
        md = @source.match(/(.*?)-->/um, true)
        if md.nil?
          raise REXML::ParseException.new("Unclosed comment", @source)
        end
        if /--|-\z/.match?(md[1])
          raise REXML::ParseException.new("Malformed comment", @source)
        end
        return [ :comment, md[1] ]
      elsif @source.match?("DOCTYPE", true)
        base_error_message = "Malformed DOCTYPE"
        unless @source.match?(/\s+/um, true)
          if @source.match?(">")
            message = "#{base_error_message}: name is missing"
          else
            message = "#{base_error_message}: invalid name"
          end
          @source.position = start_position
          raise REXML::ParseException.new(message, @source)
        end
        name = parse_name(base_error_message)
        @source.match?(/\s*/um, true) # skip spaces
        if @source.match?("[", true)
          id = [nil, nil, nil]
          @document_status = :in_doctype
        elsif @source.match?(">", true)
          id = [nil, nil, nil]
          @document_status = :after_doctype
          @source.ensure_buffer
        else
          id = parse_id(base_error_message,
                        accept_external_id: true,
                        accept_public_id: false)
          if id[0] == "SYSTEM"
            # For backward compatibility
            id[1], id[2] = id[2], nil
          end
          @source.match?(/\s*/um, true) # skip spaces
          if @source.match?("[", true)
            @document_status = :in_doctype
          elsif @source.match?(">", true)
            @document_status = :after_doctype
            @source.ensure_buffer
          else
            message = "#{base_error_message}: garbage after external ID"
            raise REXML::ParseException.new(message, @source)
          end
        end
        args = [:start_doctype, name, *id]
        if @document_status == :after_doctype
          @source.match?(/\s*/um, true)
          @stack << [ :end_doctype ]
        end
        return args
      else
        message = "Invalid XML"
        raise REXML::ParseException.new(message, @source)
      end
    end
  end
  if @document_status == :in_doctype
    @source.match?(/\s*/um, true) # skip spaces
    start_position = @source.position
    if @source.match?("<!", true)
      if @source.match?("ELEMENT", true)
        md = @source.match(/(.*?)>/um, true)
        raise REXML::ParseException.new( "Bad ELEMENT declaration!", @source ) if md.nil?
        return [ :elementdecl, "<!ELEMENT" + md[1] ]
      elsif @source.match?("ENTITY", true)
        match_data = @source.match(Private::ENTITYDECL_PATTERN, true)
        unless match_data
          raise REXML::ParseException.new("Malformed entity declaration", @source)
        end
        match = [:entitydecl, *match_data.captures.compact]
        ref = false
        if match[1] == '%'
          ref = true
          match.delete_at 1
        end
        # Now we have to sort out what kind of entity reference this is
        if match[2] == 'SYSTEM'
          # External reference
          match[3] = match[3][1..-2] # PUBID
          match.delete_at(4) if match.size > 4 # Chop out NDATA decl
          # match is [ :entity, name, SYSTEM, pubid(, ndata)? ]
        elsif match[2] == 'PUBLIC'
          # External reference
          match[3] = match[3][1..-2] # PUBID
          match[4] = match[4][1..-2] # HREF
          match.delete_at(5) if match.size > 5 # Chop out NDATA decl
          # match is [ :entity, name, PUBLIC, pubid, href(, ndata)? ]
        elsif Private::PEREFERENCE_PATTERN.match?(match[2])
          raise REXML::ParseException.new("Parameter entity references forbidden in internal subset: #{match[2]}", @source)
        else
          match[2] = match[2][1..-2]
          match.pop if match.size == 4
          # match is [ :entity, name, value ]
        end
        match << '%' if ref
        return match
      elsif @source.match?("ATTLIST", true)
        md = @source.match(Private::ATTLISTDECL_END, true)
        raise REXML::ParseException.new( "Bad ATTLIST declaration!", @source ) if md.nil?
        element = md[1]
        contents = "<!ATTLIST" + md[0]

        pairs = {}
        values = md[0].strip.scan( ATTDEF_RE )
        values.each do |attdef|
          unless attdef[3] == "#IMPLIED"
            attdef.compact!
            val = attdef[3]
            val = attdef[4] if val == "#FIXED "
            pairs[attdef[0]] = val
            if attdef[0] =~ /^xmlns:(.*)/
              @namespaces[$1] = val
            end
          end
        end
        return [ :attlistdecl, element, pairs, contents ]
      elsif @source.match?("NOTATION", true)
        base_error_message = "Malformed notation declaration"
        unless @source.match?(/\s+/um, true)
          if @source.match?(">")
            message = "#{base_error_message}: name is missing"
          else
            message = "#{base_error_message}: invalid name"
          end
          @source.position = start_position
          raise REXML::ParseException.new(message, @source)
        end
        name = parse_name(base_error_message)
        id = parse_id(base_error_message,
                      accept_external_id: true,
                      accept_public_id: true)
        @source.match?(/\s*/um, true) # skip spaces
        unless @source.match?(">", true)
          message = "#{base_error_message}: garbage before end >"
          raise REXML::ParseException.new(message, @source)
        end
        return [:notationdecl, name, *id]
      elsif md = @source.match(/--(.*?)-->/um, true)
        case md[1]
        when /--/, /-\z/
          raise REXML::ParseException.new("Malformed comment", @source)
        end
        return [ :comment, md[1] ] if md
      end
    elsif match = @source.match(/(%.*?;)\s*/um, true)
      return [ :externalentity, match[1] ]
    elsif @source.match?(/\]\s*>/um, true)
      @document_status = :after_doctype
      return [ :end_doctype ]
    end
    if @document_status == :in_doctype
      raise ParseException.new("Malformed DOCTYPE: invalid declaration", @source)
    end
  end
  if @document_status == :after_doctype
    @source.match?(/\s*/um, true)
  end
  begin
    start_position = @source.position
    if @source.match?("<", true)
      # :text's read_until may remain only "<" in buffer. In the
      # case, buffer is empty here. So we need to fill buffer
      # here explicitly.
      @source.ensure_buffer
      if @source.match?("/", true)
        @namespaces_restore_stack.pop
        last_tag = @tags.pop
        md = @source.match(Private::CLOSE_PATTERN, true)
        if md and !last_tag
          message = "Unexpected top-level end tag (got '#{md[1]}')"
          raise REXML::ParseException.new(message, @source)
        end
        if md.nil? or last_tag != md[1]
          message = "Missing end tag for '#{last_tag}'"
          message += " (got '#{md[1]}')" if md
          @source.position = start_position if md.nil?
          raise REXML::ParseException.new(message, @source)
        end
        return [ :end_element, last_tag ]
      elsif @source.match?("!", true)
        md = @source.match(/([^>]*>)/um)
        #STDERR.puts "SOURCE BUFFER = #{source.buffer}, #{source.buffer.size}"
        raise REXML::ParseException.new("Malformed node", @source) unless md
        if md[0][0] == ?-
          md = @source.match(/--(.*?)-->/um, true)

          if md.nil? || /--|-\z/.match?(md[1])
            raise REXML::ParseException.new("Malformed comment", @source)
          end

          return [ :comment, md[1] ]
        else
          md = @source.match(/\[CDATA\[(.*?)\]\]>/um, true)
          return [ :cdata, md[1] ] if md
        end
        raise REXML::ParseException.new( "Declarations can only occur "+
          "in the doctype declaration.", @source)
      elsif @source.match?("?", true)
        return process_instruction
      else
        # Get the next tag
        md = @source.match(Private::TAG_PATTERN, true)
        unless md
          @source.position = start_position
          raise REXML::ParseException.new("malformed XML: missing tag start", @source)
        end
        tag = md[1]
        @document_status = :in_element
        @prefixes.clear
        @prefixes << md[2] if md[2]
        push_namespaces_restore
        attributes, closed = parse_attributes(@prefixes)
        # Verify that all of the prefixes have been defined
        for prefix in @prefixes
          unless @namespaces.key?(prefix)
            raise UndefinedNamespaceException.new(prefix,@source,self)
          end
        end

        if closed
          @closed = tag
          pop_namespaces_restore
        else
          if @tags.empty? and @have_root
            raise ParseException.new("Malformed XML: Extra tag at the end of the document (got '<#{tag}')", @source)
          end
          @tags.push( tag )
        end
        @have_root = true
        return [ :start_element, tag, attributes ]
      end
    else
      text = @source.read_until("<")
      if text.chomp!("<")
        @source.position -= "<".bytesize
      end
      if @tags.empty?
        unless /\A\s*\z/.match?(text)
          if @have_root
            raise ParseException.new("Malformed XML: Extra content at the end of the document (got '#{text}')", @source)
          else
            raise ParseException.new("Malformed XML: Content at the start of the document (got '#{text}')", @source)
          end
        end
        return pull_event if @have_root
      end
      return [ :text, text ]
    end
  rescue REXML::UndefinedNamespaceException
    raise
  rescue REXML::ParseException
    raise
  rescue => error
    raise REXML::ParseException.new( "Exception parsing",
      @source, self, (error ? error : $!) )
  end
  return [ :dummy ]
end

#push_namespaces_restore (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 628


def push_namespaces_restore
  namespaces_restore = {}
  @namespaces_restore_stack.push(namespaces_restore)
  namespaces_restore
end

#record_entity_expansion(delta = 1) (private)

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 645


def record_entity_expansion(delta=1)
  @entity_expansion_count += delta
  if @entity_expansion_count > @entity_expansion_limit
    raise "number of entity expansions exceeded, processing aborted."
  end
end

#reset

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 187


def reset
  @closed = nil
  @have_root = false
  @document_status = nil
  @tags = []
  @stack = []
  @entities = []
  @namespaces = {"xml" => Private::XML_PREFIXED_NAMESPACE}
  @namespaces_restore_stack = []
end

#scan_quote (private)

See additional method definition at line 773.

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 777


def scan_quote
  @source.match(/(['"])/, true)&.[](1)
end

#unnormalize(string, entities = nil, filter = nil)

Unescapes all possible entities

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 570


def unnormalize( string, entities=nil, filter=nil )
  if string.include?("\r")
    rv = string.gsub( Private::CARRIAGE_RETURN_NEWLINE_PATTERN, "\n" )
  else
    rv = string.dup
  end
  matches = rv.scan( REFERENCE_RE )
  return rv if matches.size == 0
  rv.gsub!( Private::CHARACTER_REFERENCES ) {
    m=$1
    if m.start_with?("x")
      code_point = Integer(m[1..-1], 16)
    else
      code_point = Integer(m, 10)
    end
    [code_point].pack('U*')
  }
  matches.collect!{|x|x[0]}.compact!
  if filter
    matches.reject! do |entity_reference|
      filter.include?(entity_reference)
    end
  end
  if matches.size > 0
    matches.tally.each do |entity_reference, n|
      entity_expansion_count_before = @entity_expansion_count
      entity_value = entity( entity_reference, entities )
      if entity_value
        if n > 1
          entity_expansion_count_delta =
            @entity_expansion_count - entity_expansion_count_before
          record_entity_expansion(entity_expansion_count_delta * (n - 1))
        end
        re = Private::DEFAULT_ENTITIES_PATTERNS[entity_reference] || /&#{entity_reference};/
        rv.gsub!( re, entity_value )
        if rv.bytesize > @entity_expansion_text_limit
          raise "entity expansion has grown too large"
        end
      else
        er = DEFAULT_ENTITIES[entity_reference]
        rv.gsub!( er[0], er[2] ) if er
      end
    end
    rv.gsub!( Private::DEFAULT_ENTITIES_PATTERNS['amp'], '&' )
  end
  rv
end

#unshift(token)

Push an event back on the head of the stream. This method has (theoretically) infinite depth.

[ GitHub ]

# File 'lib/rexml/parsers/baseparser.rb', line 219


def unshift token
  @stack.unshift(token)
end

Class: REXML::Parsers::BaseParser

Overview

Using the Pull Parser

Constant Summary

Class Method Summary

Instance Attribute Summary

Instance Method Summary

Constructor Details

.new(source) ⇒ BaseParser

Instance Attribute Details

#empty? ⇒ Boolean (readonly)

#entity_expansion_count (readonly)

#entity_expansion_limit=(value) (writeonly)

#entity_expansion_text_limit=(value) (writeonly)

#has_next? ⇒ Boolean (readonly)

#source (readonly)

#stream=(source) (writeonly)

Instance Method Details

#add_listener(listener)

#add_namespace(prefix, uri) (private)

#entity(reference, entities)

#need_source_encoding_update?(xml_declaration_encoding) ⇒ Boolean (private)

#normalize(input, entities = nil, entity_filter = nil)

#parse_attributes(prefixes) (private)

#parse_id(base_error_message, accept_external_id:, accept_public_id:) (private)

#parse_id_invalid_details(accept_external_id:, accept_public_id:) (private)

#parse_name(base_error_message) (private)

#peek(depth = 0)

#pop_namespaces_restore (private)

#position

#process_instruction (private)

#pull

#pull_event (private)

#push_namespaces_restore (private)

#record_entity_expansion(delta = 1) (private)

#reset

#scan_quote (private)

#unnormalize(string, entities = nil, filter = nil)

#unshift(token)

.new(source) ⇒ `BaseParser`

#empty? ⇒ `Boolean` (readonly)

#has_next? ⇒ `Boolean` (readonly)

#need_source_encoding_update?(xml_declaration_encoding) ⇒ `Boolean` (private)