Change Proposal to HTML WG to fix the algorithm for generating Atom feeds from HTML content

Discussion:

Julian Reschke

2010-04-06 21:17:22 UTC

FYI:

this relates to an HTML-WG discussion about the algorithm to create Atom
feeds from HTML (<http://dev.w3.org/html5/spec/Overview.html#atom>).

See <http://www.w3.org/Bugs/Public/show_bug.cgi?id=7806> and
<http://www.w3.org/html/wg/tracker/issues/86> for more context on how we
got here.

Best regards, Julian

Hi,
below is a change proposal for this issue.
Note that an obvious alternative to fixing the algorithm would be to
remove the section completely.
Best regards,
Julian
-- snip --
SUMMARY
The HTML5 spec contains an algorithm for producing an Atom (RFC4287)
feed document from an HTML page.
The definition both relaxes a MUST-level requirement from RFC4287, but
also adds a needless restriction.
Also, it's not clear *at all* whether this is a feature that people
really want, and if they do, whether it needs to be part of HTML5. Given
the fact that it's non-trivial to generate a valid Atom feed from HTML,
but the reverse *is* trivial, we should also consider removing this
feature altogether (I'd be happy to write a 2nd change proposal if
people want to see that as well).
RATIONALE
Instructions to derive a secondary format from HTML documents shouldn't
be misleading, and also should make clear which conditions need to be
met to produce valid documents.
DETAILS
There are two problems, both with the following step (4.15.1, step 15.9
"Otherwise
Let id be a user-agent-defined undereferenceable yet globally unique
valid absolute URL. The same absolute URL should be generated for each
run of this algorithm when given the same input. Let has-alternate be
false."
Problem #1: RFC 4287 does not require the ID to be undereferenceable.
This was a conscious decision of the IETF AtomPub WG. There's absolutely
no point in adding this requirement, except for the spec author's
distaste for URIs that are both dereferenceable *and* act as a globally
unique and stable identifier.
Note from
"...Though the IRI might use a dereferencable scheme, Atom Processors
MUST NOT assume it can be dereferenced."
Problem #2: RFC 4287 makes it a MUST-level requirement to generate the
From
"When an Atom Document is relocated, migrated, syndicated, republished,
exported, or imported, the content of its atom:id element MUST NOT
change. Put another way, an atom:id element pertains to all
instantiations of a particular Atom entry or feed; revisions retain the
same content in their atom:id elements. It is suggested that the atom:id
element be stored along with the associated resource."
HTML5 relaxes this to a should-level requirement.
I do agree that generating valid Atom feeds from HTML *is* hard, but
violating a MUST-level requirement from the Atom spec is not acceptable.
"Let id be a user-agent-defined yet globally unique valid absolute URL."
Change
"The same absolute URL should be generated for each run of this
algorithm when given the same input."
to
"The same absolute URL must be generated for each run of this algorithm
when given the same input. If this requirement can not be fulfilled,
then generating a valid Atom feed is not possible and this algorithm
should be aborted."
IMPACT
1. Positive Effects
Consistency between the applicable specs. Also, authors are correctly
informed about what it takes to generate proper Atom feeds.
2. Negative Effects
None.
3. Conformance Classes Changes
Atom feed generators are actually required to generate valid Atom
documents (with respect to atom:id).
4. Risks
None.
REFERENCES
Inline.

James Holderness

2010-04-07 05:37:48 UTC

Permalink

Post by Julian Reschke
this relates to an HTML-WG discussion about the algorithm to create Atom
feeds from HTML (<http://dev.w3.org/html5/spec/Overview.html#atom>).

I don't get why this was added to the spec in the first place. Under what
conditions would a UA want or need to generate a feed from a webpage? Do you
have a pointer to a thread or something where this idea was initially
proposed?

As for the id bug itself, I think that's the least of your problems. From my
brief reading, the algorithm looks to me to be riddled with errors, not
including the parts that are just plain bad.

Frankly I'd toss the whole thing out. Then again, I'd say the same for a
good deal of HTML5.

Regards
James

Julian Reschke

2010-04-07 08:16:26 UTC

Permalink

Post by James Holderness

Post by Julian Reschke
this relates to an HTML-WG discussion about the algorithm to create
Atom feeds from HTML (<http://dev.w3.org/html5/spec/Overview.html#atom>).

I think the reason that was presented is that people prefer to publish
just HTML instead of HTML + a feed. Should be somewhere in the WHATWG
archives.

Personally, I don't buy that. HTML pages that qualify as material for
news feeds are usually generated from a different source, such as a
database of news entries, or actually a feed document (this is how I do
it). Publishing both is simple, and actually saves bandwidth for those
just needing the Atom version.

Post by James Holderness
As for the id bug itself, I think that's the least of your problems.
From my brief reading, the algorithm looks to me to be riddled with
errors, not including the parts that are just plain bad.

Please point those out; as you can guess, I'm getting tired of this
stuff. Optimally in the HTML WG, but if you prefer to do it here, I'll
relay your feedback.

Post by James Holderness
Frankly I'd toss the whole thing out. Then again, I'd say the same for a
good deal of HTML5.

Same here.

Best regards, Julian

James Holderness

2010-04-07 22:12:26 UTC

Permalink

Post by Julian Reschke
I think the reason that was presented is that people prefer to publish
just HTML instead of HTML + a feed. Should be somewhere in the WHATWG
archives.

Ok, I suppose I can imagine how someone might find that useful, but even if
there were a demand for such functionality, I don't think the algorithm in
question actually addresses that issue.

An algorithm for interpreting a web page as some form of feed (which is
really what you would need), and an algorithm for converting a web page into
an Atom document (which is what is described) are really quite different
processes.

Post by Julian Reschke

Please point those out; as you can guess, I'm getting tired of this stuff.
Optimally in the HTML WG, but if you prefer to do it here, I'll relay your
feedback.

I'm sorry, but that would require way more effort than I'm willing to expend
on something that I believe should never have been part of the HTML spec in
the first place.

The point I was trying to make was that (IMO) you're wasting your time
trying to get this id issue fixed. Because even if you could eventually get
Hickson to fix it, after months of arguing, you'd still be stuck with a crap
algorithm that no sane person would use.

By all means, push for the whole thing to be dropped from the spec if you
have the enthusiasm. Otherwise, my advice would just be to give it up as a
lost cause.

Regards
James

Asbjørn Ulsberg

2010-05-27 20:11:56 UTC

Permalink

Post by Julian Reschke
I think the reason that was presented is that people prefer to publish
just HTML instead of HTML + a feed. Should be somewhere in the WHATWG
archives.
Personally, I don't buy that. HTML pages that qualify as material for
news feeds are usually generated from a different source, such as a
database of news entries, or actually a feed document (this is how I do
it). Publishing both is simple, and actually saves bandwidth for those
just needing the Atom version.

This what I'm thinking as well. If there's no back-end store and all you
have is text of some sort stored in the file system, 99 out of 100 times
I'd choose to store Atom documents in the file system and transform these
to HTML with XSLT rather than the other way around, especially when the
algorithm defined to do so is so ridiculous as it is.

Post by Julian Reschke

Post by James Holderness
Frankly I'd toss the whole thing out. Then again, I'd say the same for a
good deal of HTML5.

Same here.

+1.

--
Asbjørn Ulsberg -=|=- asbjorn-Cn+***@public.gmane.org
«He's a loathsome offensive brute, yet I can't look away»

Tim Bray

2010-04-08 00:34:22 UTC

Permalink

I must say, the notion that you can guarantee that a URI can't be
dereferenced is charmingly naive.

I also agree with Julian that it's a big hairy problem that the
language blesses generating two atom entries with different atom:ids
from different revisions of the same HTML resource.

-Tim

Post by Julian Reschke
this relates to an HTML-WG discussion about the algorithm to create Atom
feeds from HTML (<http://dev.w3.org/html5/spec/Overview.html#atom>).
See <http://www.w3.org/Bugs/Public/show_bug.cgi?id=7806> and
<http://www.w3.org/html/wg/tracker/issues/86> for more context on how we got
here.
Best regards, Julian

Julian Reschke

2010-04-08 12:00:43 UTC

Permalink

Post by Tim Bray
I must say, the notion that you can guarantee that a URI can't be
dereferenced is charmingly naive.
I also agree with Julian that it's a big hairy problem that the
language blesses generating two atom entries with different atom:ids
from different revisions of the same HTML resource.
...

Actually, even for multiple generations based on the same revision...

Best regards, Julian

Julian Reschke

2010-04-09 14:24:45 UTC

Permalink

FYI:

I submitted a 2nd change proposal, asking for removal of the whole section.

Feedback from members of the Atom community would be appreciated, simply
send email to the WG mailing list
(<http://lists.w3.org/Archives/Public/public-html/>) or the public
comments mailing list
(<http://lists.w3.org/Archives/Public/public-html-comments/>).

Best regards, Julian

Also, it's not clear *at all* whether this is a feature that people
really want, and if they do, whether it needs to be part of HTML5.
Given the fact that it's non-trivial to generate a valid Atom feed
from HTML, but the reverse *is* trivial, we should also consider
removing this feature altogether (I'd be happy to write a 2nd change
proposal if people want to see that as well). (See [2])

Since a number of people have expressed interest, I think it would be
helpful to provide a second proposal along these lines.

SUMMARY
The HTML5 spec contains an algorithm for producing an Atom (RFC4287)
feed document from an HTML page.
There are many problems with this, summarized under RATIONALE.
This Change Proposal removes the complete section defining this algorithm.
RATIONALE
1) It's not clear that a sufficient amount of people is interested in
this. HTML pages that would be candidates for this usually are generated
from a different source, like an article database, or even a feed
document. Therefore, providing both simply is not a problem for the
author. Defining a feature that is of little use increases the spec size
(more to review) and the risk of getting things wrong because of poor
review (see below).
2) Defining a mapping between both formats *is* interesting. Other
parties have done it before. This is even mentioned in HTML5. There's no
reason why another variant of this needs to be in HTML5.
3) The mapping as currently specified contradicts the Atom specification
(RFC 4287) in several aspects. If this Change Proposal does not get
applied, the individual problems with the mapping still will need to be
fixed. There's a separate Change Proposal ([1]) which is focused on
fixing some of these issues.
DETAILS
Remove all of 4.15.1 ("Atom"). Also remove 4.15 ("Converting HTML to
other formats"), which otherwise would be empty.
Note: the removal of this part should be applied to all variants of the
spec, be it in W3C space or not. Otherwise, the algorithm will need
proper review, and I'd recommend to encourage the members of the
atom-syntax mailing list to do that.
IMPACT
1. Positive Effects
Removal of spec text which is believed to be non-essential,
controversial, in contradiction with other applicable specs, and
potentially buggy.
2. Negative Effects
None.
3. Conformance Classes Changes
None (there was non requirement to implement this anyway).
4. Risks
None.
REFERENCES
[1] <http://lists.w3.org/Archives/Public/public-html/2010Apr/0291.html>

Julian Reschke

2010-05-27 16:33:45 UTC

Permalink

In the meantime, the HTML WG has decided to drop this part of the spec.
See <http://lists.w3.org/Archives/Public/public-html/2010May/0391.html>.

Best regards, Julian