About Preprocessing in `Juman.apply_to_sentence` #121

tealgreen0503 · 2023-05-12T09:15:36Z

It appears that some preprocessing takes place when performing morphological analysis with Juman.apply_to_sentence. For example, half-width spaces are replaced with full-width spaces, and line breaks are removed.

import rhoknp
juman = rhoknp.Jumanpp()
text = " これは半角スペースです。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['\u3000', 'これ', 'は', '半角', 'スペース', 'です', '。']
text = "\nこれは改行です。"
print([morpheme.surf for morpheme in juman.apply_to_sentence(text).morphemes])
# ['これ', 'は', '改行', 'です', '。']

Are there other such preprocessings?

The text was updated successfully, but these errors were encountered:

hkiyomaru · 2023-05-13T05:05:15Z

The preprocessing steps performed by Jumanpp.apply_to_sentence include:

Replacing half-width spaces with full-width spaces.
Replacing straight double quotation marks (") with curved double quotation marks (”).
Removing line breaks.
Removing carriage returns.

hkiyomaru · 2023-05-13T05:17:26Z

It's important to note that sentences beginning with # are considered comments and are not parsed. The Juman++ developer has proposed a workaround to address this, which can be found in this Github issue. It's worth mentioning that rhoknp does not perform this workaround as a pre-processing step. If you require this functionality, you will need to implement the workaround yourself.

hkiyomaru · 2023-05-18T04:04:53Z

Let's carry on this discussion on ku-nlp/jumanpp#154.

hkiyomaru · 2023-05-19T05:40:15Z

#123 will fix the handling of half-width spaces and straight double quotation marks.

tealgreen0503 changed the title ~~Translation: About Preprocessing in Juman.apply_to_sentence~~ About Preprocessing in Juman.apply_to_sentence May 12, 2023

hkiyomaru closed this as completed May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Preprocessing in `Juman.apply_to_sentence` #121

About Preprocessing in `Juman.apply_to_sentence` #121

tealgreen0503 commented May 12, 2023

hkiyomaru commented May 13, 2023

hkiyomaru commented May 13, 2023

hkiyomaru commented May 18, 2023

hkiyomaru commented May 19, 2023

About Preprocessing in Juman.apply_to_sentence #121

About Preprocessing in Juman.apply_to_sentence #121

Comments

tealgreen0503 commented May 12, 2023

hkiyomaru commented May 13, 2023

hkiyomaru commented May 13, 2023

hkiyomaru commented May 18, 2023

hkiyomaru commented May 19, 2023

About Preprocessing in `Juman.apply_to_sentence` #121

About Preprocessing in `Juman.apply_to_sentence` #121