Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework handling general entity references (&entity;) #766

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Mingun
Copy link
Collaborator

@Mingun Mingun commented Jun 21, 2024

This is a big change in handling general entity references and character references. Open PR early to get feedback.

With this changes we can correctly parse document

<!DOCTYPE root [
  <!ENTITY root "<root/>">
]>
&root;

as equivalent normalized document

<root/>

The updated custom_entities example shows how it would be possible to implement requirement from the specification about parsed general entities. Serde deserializer did not updated yet, because this is not trivial part and probably that will be done in another PR.

Of course, such change probably makes the performance worse, I didn't measure impact yet.

Closes #667

@codecov-commenter
Copy link

codecov-commenter commented Jun 30, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 61.13537% with 178 lines in your changes missing coverage. Please review.

Project coverage is 60.13%. Comparing base (7558577) to head (eb90e9f).
Report is 94 commits behind head on master.

Files Patch % Lines
examples/custom_entities.rs 0.00% 117 Missing ⚠️
src/events/mod.rs 38.98% 36 Missing ⚠️
src/reader/buffered_reader.rs 83.09% 12 Missing ⚠️
benches/macrobenches.rs 0.00% 4 Missing ⚠️
src/de/mod.rs 88.46% 3 Missing ⚠️
src/errors.rs 0.00% 3 Missing ⚠️
benches/microbenches.rs 0.00% 1 Missing ⚠️
src/reader/slice_reader.rs 97.56% 1 Missing ⚠️
src/writer/async_tokio.rs 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #766      +/-   ##
==========================================
- Coverage   61.81%   60.13%   -1.69%     
==========================================
  Files          41       41              
  Lines       16798    16492     -306     
==========================================
- Hits        10384     9917     -467     
- Misses       6414     6575     +161     
Flag Coverage Δ
unittests 60.13% <61.13%> (-1.69%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Mingun
Copy link
Collaborator Author

Mingun commented Jun 30, 2024

I finished work on the base part of the entities support. In this PR new Event::GeneralRef is added together with new BytesRef struct which represents the any &...; reference, including entity references and character references. Character references can be resolved by call BytesRef::resolve_char_ref(), entity references can be resolved by mapping from content of BytesRef to replacement text. Both usages are shown in the updated custom_entities example.

@dralley
Copy link
Collaborator

dralley commented Jun 30, 2024

I won't have a chance to fully review this for a couple of days. Quick question though, am I correct in thinking that this PR will mean that any time a text block contains one or more entity references, instead of the developer receiving one Event::Text containing everything between the opening and closing tags, they will receive a series of Event::Text and Event::GeneralRef which they will then need to merge back together themselves into the original text?

@Mingun
Copy link
Collaborator Author

Mingun commented Jun 30, 2024

Yes, you are correct. But that does not mean that he/she will needed to construct the complete text themselves. In the next PR I plan to rename Reader to the RawReader and add make it return borrow-only RawEvents, and in in the another PR introduce new Reader which will automatically merge all consequent Text, CData and GeneralRef events. This should be much more convenient for the average user. RawReader will only be needed for very fine control.

Because renames affects very many places, I want to do that in a separate PR to reduce noise in PR with new Reader.

Borrow-only reader-only events will be useful also in that sense that I plan to add offset member to them to track event position in the stream. When you construct event for writing you are obviously does not have position and I think it is better to not have a dummy value for it, in order to you couldn't mistakenly use writer event in the reading context and get the wrong position.

Because new Reader will have a stack of the RawReaders (in the same way as demonstrated in custom_entities example), it will be simple recreate readers when we will need to change encoding, so I think, that #158 is very close to resolving.

Comment on lines 263 to 289
Ok((_, false)) => {
// We want to report error at `&`, but offset was increased,
// so return it back (-1 for `&`)
$self.state.last_error_offset = start - 1;
Err(Error::Syntax(SyntaxError::UnclosedReference))
}
Copy link
Collaborator Author

@Mingun Mingun Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a code in ruffle in ruffle-rs/ruffle#10471 from @Aaron1011 that will be break (custom_unescape function), because currently dangling & will always return SyntaxError::UnclosedReference. I think I should add a new configuration option for this here

@Mingun
Copy link
Collaborator Author

Mingun commented Jul 21, 2024

@dralley, what do you think about this?

Mingun and others added 4 commits July 23, 2024 20:35
…onstruction in a text

failures (16):
  serde-de (9):
    borrow::escaped::element
    borrow::escaped::top_level
    resolve::resolve_custom_entity
    trivial::text::byte_buf
    trivial::text::bytes
    trivial::text::string::field
    trivial::text::string::naked
    trivial::text::string::text
    xml_schema_lists::element::text::string
  serde-migrated (1):
    test_parse_string
  serde-se (5):
    with_root::char_amp
    with_root::char_gt
    with_root::char_lt
    with_root::str_escaped
    with_root::tuple
  --doc (1):
    src\de\resolver.rs - de::resolver::EntityResolver (line 13)
Text events produces by the Reader can not contain escaped data anymore,
all such data is represented by the Event::GeneralRef
Mingun and others added 2 commits July 23, 2024 20:47
Fixed (18):
  serde-de (9):
    borrow::escaped::element
    borrow::escaped::top_level
    resolve::resolve_custom_entity
    trivial::text::byte_buf
    trivial::text::bytes
    trivial::text::string::field
    trivial::text::string::naked
    trivial::text::string::text
    xml_schema_lists::element::text::string
  serde-migrated (1):
    test_parse_string
  serde-se (5):
    with_root::char_amp
    with_root::char_gt
    with_root::char_lt
    with_root::str_escaped
    with_root::tuple
  --doc (3):
    src\de\resolver.rs - de::resolver::EntityResolver (line 13)
@@ -24,6 +24,9 @@ XML specification. See the updated `custom_entities` example!

- [#766]: Allow to parse resolved entities as XML fragments and stream events from them.
- [#766]: Added new event `Event::GeneralRef` with content of [general entity].
- [#766]: Added new configuration option `allow_dangling_amp` which allows to have
a `&` not followed by `;` in the textual data which is required for some applications
for compatibility reasons.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which applications?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant case from #719 here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How would I parse character references as literal bytes and not codepoints?
3 participants