memo.xight.org

日々のメモ

Mechanize でパースできないHTMLを強引に処理する

Summary

Mechanizeは、HTML (Content-Type: text/html) をパースする際にMechanize::Pageを利用する。
しかし、パースするHTMLがwell-formedではない場合、パースに失敗する。

Mechanize::Page で処理

require 'mechanize'

agent = Mechanize.new
agent.pluggable_parser['text/html'] = PlainFile
page = agent.get("http://example.com/")
p page


#<Mechanize::Page
 {url #<URI::HTTP http://ill-formed.com/>}
 {meta_refresh}
 {title "ill-formed.com"}
 {iframes}
 {frames}
 {links}
 {forms}>

Plain Textで処理

require 'mechanize'

class PlainFile < Mechanize::File; end

agent = Mechanize.new
agent.pluggable_parser['text/html'] = PlainFile
page = agent.get("http://example.com/")
p page


#<PlainFile:0x00007fb8d2077910
 @body=
  "<!doctype html>\n" +
  "<html>\n" +
  "<head>\n" +
  "    <title>ill-formed.com</title>\n" +
  "\n" +
  "    <meta charset=\"utf-8\" />\n" +
  "    <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />\n" +
  "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n" +
  "\t<link rel=\"stylesheet\" type=\"text/css\" href=\"main.css\" />\n" +
  "</head>\n" +
  "\n" +
  "<body>\n" +
  "<div>\n" +
  "    <h1>ill-formed.com</h1>\n" +
  "\t<p>This domain is established to be used for ill-formed HTML test. <!-- </p> -->\n" +
  "<!-- </div> -->\n" +
  "<!-- </body> -->\n" +
  "<!-- </html> -->\n",
 @code="200",
 @filename="index.html",
 @full_path=false,
 @response=
  {"server"=>"GitHub.com",
   "date"=>"Thu, 05 Oct 2017 14:55:52 GMT",
   "content-type"=>"text/html; charset=utf-8",
   "transfer-encoding"=>"chunked",
   "last-modified"=>"Thu, 05 Oct 2017 14:55:35 GMT",
   "access-control-allow-origin"=>"*",
   "expires"=>"Thu, 05 Oct 2017 15:05:52 GMT",
   "cache-control"=>"max-age=600",
   "content-encoding"=>"gzip",
   "x-github-request-id"=>"F5F8:1F5B:54E8B4B:7C0A75B:59D647F8"},
 @uri=#<URI::HTTP http://ill-formed.com/>>