Skip to content

Commit

Permalink
Documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Thomas Proisl committed Nov 9, 2023
1 parent 831c4f2 commit d9a572e
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 20 deletions.
34 changes: 17 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ Here are some common use cases:
<details><summary>Show example</summary>

```
echo "der beste Betreuer? - >ProfSmith! : )" | somajo-tokenizer -c -
echo 'der beste Betreuer? - >ProfSmith! : )' | somajo-tokenizer -c -
der
beste
Betreuer
Expand All @@ -218,7 +218,7 @@ Here are some common use cases:
<details><summary>Show example</summary>

```
echo "der beste Betreuer? - >ProfSmith! : )" | somajo-tokenizer -
echo 'der beste Betreuer? - >ProfSmith! : )' | somajo-tokenizer -
der
beste
Betreuer
Expand Down Expand Up @@ -246,7 +246,7 @@ Here are some common use cases:
<details><summary>Show example</summary>

```
echo "Palim, Palim! Ich hätte gerne eine Flasche Pommes Frites." | somajo-tokenizer --split-sentences -
echo 'Palim, Palim! Ich hätte gerne eine Flasche Pommes Frites.' | somajo-tokenizer --split-sentences -
Palim
,
Palim
Expand All @@ -273,7 +273,7 @@ Here are some common use cases:
<details><summary>Show example</summary>

```
echo "Dont you wanna come?" | somajo-tokenizer -l en_PTB -
echo 'Dont you wanna come?' | somajo-tokenizer -l en_PTB -
Do
nt
you
Expand Down Expand Up @@ -329,7 +329,7 @@ Here are some common use cases:
<details><summary>Show example</summary>

```
echo "der beste Betreuer? - >ProfSmith! : )" | somajo-tokenizer -c -e -t -
echo 'der beste Betreuer? - >ProfSmith! : )' | somajo-tokenizer -c -e -t -
der regular
beste regular
Betreuer regular SpaceAfter=No
Expand All @@ -351,19 +351,18 @@ Here are some common use cases:

### Using the module

You can easily incorporate SoMaJo into your own Python projects. All
you need to do is importing `somajo.SoMaJo`, creating a `SoMaJo`
object and calling one of its tokenizer functions: `tokenize_text`,
`tokenize_text_file`, `tokenize_xml` or `tokenize_xml_file`. These
functions return a generator that yields tokenized chunks of text. By
default, these chunks of text are sentences. If you set
`split_sentences=False`, then the chunks of text are either paragraphs
or chunks of XML. Every tokenized chunk of text is a list of `Token`
objects.

For more details, take a look at the [API
Take a look at the [API
documentation](https://github.com/tsproisl/SoMaJo/blob/master/doc/build/markdown/somajo.md).

You can incorporate SoMaJo into your own Python projects. All you need
to do is importing `somajo`, creating a `SoMaJo` object and calling
one of its tokenizer functions: `tokenize_text`, `tokenize_text_file`,
`tokenize_xml` or `tokenize_xml_file`. These functions return a
generator that yields tokenized chunks of text. By default, these
chunks of text are sentences. If you set `split_sentences=False`, then
the chunks of text are either paragraphs or chunks of XML. Every
tokenized chunk of text is a list of `Token` objects.

Here is an example for tokenizing and sentence splitting two
paragraphs:

Expand All @@ -379,7 +378,7 @@ paragraphs = ["der beste Betreuer?\n-- ProfSmith! : )",
sentences = tokenizer.tokenize_text(paragraphs)
for sentence in sentences:
for token in sentence:
print("{}\t{}\t{}".format(token.text, token.token_class, token.extra_info))
print(f"{token.text}\t{token.token_class}\t{token.extra_info}")
print()
```

Expand Down Expand Up @@ -414,6 +413,7 @@ for sentence in sentences:
print()
```


## Evaluation

SoMaJo was the system with the highest average F₁ score in the
Expand Down
6 changes: 3 additions & 3 deletions src/somajo/somajo.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class SoMaJo:
guarantee well-formed output (tags might need to be closed and
re-opened at sentence boundaries).
character_offsets : bool, (default=False)
Compute for each token the character offsets in the input.
Compute the character offsets in the input for each token.
This allows for stand-off tokenization.
"""
Expand Down Expand Up @@ -159,7 +159,7 @@ def tokenize_text_file(self, text_file, paragraph_separator, *, parallel=1):
>>> sentences = tokenizer.tokenize_text_file("example_empty_lines.txt", paragraph_separator="single_newlines")
>>> for sentence in sentences:
... for token in sentence:
... print("{}\t{}\t{}".format(token.text, token.token_class, token.extra_info))
... print("{token.text}\t{token.token_class}\t{token.extra_info}")
... print()
...
Heyi regular SpaceAfter=No
Expand Down Expand Up @@ -383,7 +383,7 @@ def tokenize_text(self, paragraphs, *, parallel=1):
>>> sentences = tokenizer.tokenize_text(paragraphs)
>>> for sentence in sentences:
... for token in sentence:
... print("{}\t{}\t{}".format(token.text, token.token_class, token.extra_info))
... print("{token.text}\t{token.token_class}\t{token.extra_info}")
... print()
...
Heyi regular SpaceAfter=No
Expand Down

0 comments on commit d9a572e

Please sign in to comment.