Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing h1 heading if <header> outside of <article> #642

Open
chrisgoddard opened this issue Jul 11, 2024 · 2 comments
Open

Missing h1 heading if <header> outside of <article> #642

chrisgoddard opened this issue Jul 11, 2024 · 2 comments
Labels
question Further information is requested

Comments

@chrisgoddard
Copy link

I'm having a consistent problem having the article content include the main h1 heading if it's in a <heading> element which is outside of the <article>.

Common example is WaPo - e.g. https://www.washingtonpost.com/dc-md-va/2024/05/14/maryland-democratic-senate-primary/

The extracted content begins at "Prince George’s County Executive Angela D. Alsobrooks..." - it misses the h1 as well as the subheading right below it.

I've been trying to do some preprocessing of the HTML (basically moving the h1 element into the

) - but I can't get it working. Going through the code I can't quite figure out why exactly its being filtered out in the first place.

Any thoughts?

@adbar adbar added the question Further information is requested label Jul 16, 2024
@adbar
Copy link
Owner

adbar commented Jul 16, 2024

It is debatable whether titles are part of the main content, they are not always included in benchmarks. That being said the main title should also be present in the metadata, would that be a solution?

@chrisgoddard
Copy link
Author

chrisgoddard commented Jul 30, 2024

I would say no, because in instances where this happens (like Washington Post), the might be other information like a subtitle (or you could imagine author/etc) being in the

block too.

I agree it's an incorrect usage of the

and semantics - but I have seen it across a few sites (WaPo just being the most prominent).

Anecdotally, when I've seen it it was always been within a

tag - i.e. the structure is:

<body>
    <main>
        <header>
            <h1>Article Title</h1>
        </header>
        <article>
            <p>Article content</p>
        </article>
        <div class="related-rail">
            <aside>
                <h2>Related Articles</h2>
                <ul>
                    <li>Related Article 1</li>
                    <li>Related Article 2</li>
                </ul>
            </aside>
        </div>  
    </main>
</body>

so maybe there could be a heuristic for cases where that pattern of

appears - otherwise, would you have a suggestion of a preprocessing step that could make the existing implementation work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants