Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Preprocessing in Juman.apply_to_sentence #121

Closed
tealgreen0503 opened this issue May 12, 2023 · 4 comments
Closed

About Preprocessing in Juman.apply_to_sentence #121

tealgreen0503 opened this issue May 12, 2023 · 4 comments

Comments

@tealgreen0503
Copy link

It appears that some preprocessing takes place when performing morphological analysis with Juman.apply_to_sentence. For example, half-width spaces are replaced with full-width spaces, and line breaks are removed.

import rhoknp
juman = rhoknp.Jumanpp()
text = " これは半角スペースです。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['\u3000', 'これ', 'は', '半角', 'スペース', 'です', '。']
text = "\nこれは改行です。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['これ', 'は', '改行', 'です', '。']

Are there other such preprocessings?

@tealgreen0503 tealgreen0503 changed the title Translation: About Preprocessing in Juman.apply_to_sentence About Preprocessing in Juman.apply_to_sentence May 12, 2023
@hkiyomaru
Copy link
Member

The preprocessing steps performed by Jumanpp.apply_to_sentence include:

  1. Replacing half-width spaces with full-width spaces.
  2. Replacing straight double quotation marks (") with curved double quotation marks (”).
  3. Removing line breaks.
  4. Removing carriage returns.

@hkiyomaru
Copy link
Member

It's important to note that sentences beginning with # are considered comments and are not parsed. The Juman++ developer has proposed a workaround to address this, which can be found in this Github issue. It's worth mentioning that rhoknp does not perform this workaround as a pre-processing step. If you require this functionality, you will need to implement the workaround yourself.

@hkiyomaru
Copy link
Member

Let's carry on this discussion on ku-nlp/jumanpp#154.

@hkiyomaru
Copy link
Member

#123 will fix the handling of half-width spaces and straight double quotation marks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants