Summary
Mechanizeは、HTML (Content-Type: text/html) をパースする際にMechanize::Pageを利用する。しかし、パースするHTMLがwell-formedではない場合、パースに失敗する。
Mechanize::Page で処理
require 'mechanize'
agent = Mechanize.new
agent.pluggable_parser['text/html'] = PlainFile
page = agent.get("http://example.com/")
p page
#<Mechanize::Page
{url #<URI::HTTP http://ill-formed.com/>}
{meta_refresh}
{title "ill-formed.com"}
{iframes}
{frames}
{links}
{forms}>
Plain Textで処理
require 'mechanize'
class PlainFile < Mechanize::File; end
agent = Mechanize.new
agent.pluggable_parser['text/html'] = PlainFile
page = agent.get("http://example.com/")
p page
#<PlainFile:0x00007fb8d2077910
@body=
"<!doctype html>\n" +
"<html>\n" +
"<head>\n" +
" <title>ill-formed.com</title>\n" +
"\n" +
" <meta charset=\"utf-8\" />\n" +
" <meta http-equiv=\"Content-type\" content=\"text/html; charset=utf-8\" />\n" +
" <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\" />\n" +
"\t<link rel=\"stylesheet\" type=\"text/css\" href=\"main.css\" />\n" +
"</head>\n" +
"\n" +
"<body>\n" +
"<div>\n" +
" <h1>ill-formed.com</h1>\n" +
"\t<p>This domain is established to be used for ill-formed HTML test. <!-- </p> -->\n" +
"<!-- </div> -->\n" +
"<!-- </body> -->\n" +
"<!-- </html> -->\n",
@code="200",
@filename="index.html",
@full_path=false,
@response=
{"server"=>"GitHub.com",
"date"=>"Thu, 05 Oct 2017 14:55:52 GMT",
"content-type"=>"text/html; charset=utf-8",
"transfer-encoding"=>"chunked",
"last-modified"=>"Thu, 05 Oct 2017 14:55:35 GMT",
"access-control-allow-origin"=>"*",
"expires"=>"Thu, 05 Oct 2017 15:05:52 GMT",
"cache-control"=>"max-age=600",
"content-encoding"=>"gzip",
"x-github-request-id"=>"F5F8:1F5B:54E8B4B:7C0A75B:59D647F8"},
@uri=#<URI::HTTP http://ill-formed.com/>>