77377 satır eklendi. #2

ufukart · 2021-01-11T21:31:12Z

Elimdeki halihazırda bulunan 77377 satır sözlük veritabanını aktardım. Mevcut json'da duplike kontrolü yapmadım.

kahramanumut · 2021-01-12T18:01:31Z

Katkın için çok teşekkür ederim baya bir kelime var, boyut çok fazla olduğu için deserialize işleminde ve sonrasında sorun yaratabileceğinden şu an direkt merge atamıyorum ama dosyayı aldım.

huseyinsimsekk · 2021-01-13T10:09:05Z

Merhaba, mergelenirse duplicate kontrolü yapabilirim. Sadece tekil olacak şekilde düzenlenebilir json dosyası. Yine de fazla olabilir boyut tabiki

berkakkaya · 2021-01-19T13:12:32Z

Merhaba, bahsedilen JSON dosyasında duplicate kontrolü yapması için bir Python scripti yazdım.
Scriptimin çıktısını şuraya ekledim.

Kullandığım script:

from json import load, dump
from os import mkdir
from os.path import isdir


print("File is loading...")

# Load the file
with open("frequentlyWords.json", "r") as f:
    file = load(f)

words = dict()

print("Scanning...")

for record in file:
    # Word string: [English equivalent]_[Turkish equivalent]
    word_string = f"{record['english']}_{record['turkish']}"
    
    if not word_string in words:
        words[word_string] = 1
    else:
        words[word_string] += 1

duplicate_words = dict()

for word in words:
    if words[word] > 1:
        duplicate_words[word] = words[word]

# We don't need words variable anymore
del words

len_duplicate_words = len(duplicate_words)

if len_duplicate_words == 0:
    print("No duplicate word has been found.")
else:
    print(f"{len_duplicate_words} duplicate word(s) has been found.")

data = dict(
    note="Words are shown as word strings. Word strings are formatted as [English equivalent]_[Turkish equivalent]. Also key values in duplicate words are indicating how many duplicates of that word are in the target file.",
    duplicate_words_count=len_duplicate_words,
    duplicate_words=duplicate_words
)

if not isdir("out"):
    mkdir("out")

with open("out/result.json", "w", encoding="utf8") as f:
    dump(data, f, ensure_ascii=False, indent=2)

print("More detailed results are in out/result.json")

Ayrıca dosyayı kontrol ederken bazı kelimelerin Türkçe karşılıklarının şu şekilde olduğunu fark ettim (276898. satırda):

{
  "english": "piercing",
  "turkish": "n.delme:v.del:prep.delerek"
}

Sanırsam bunun gibi birçok kelime var. Belki düzeltilmesi gerekebilir.

kahramanumut · 2021-01-19T17:36:34Z

10 numara olmuş @berkakkaya eline sağlık

berkakkaya · 2021-01-19T17:43:19Z

10 numara olmuş @berkakkaya eline sağlık

Teşekkürler 😄

Ayrıca verdiğim scripti Github Actions'a ekleyebilirsiniz isterseniz ileriki pull requestlerde işinizi kolaylaştırır

77377 satır eklendi.

b233e2c

Elimdeki halihazırda bulunan 77377 satır sözlük veritabanını aktardım. Mevcut json'da duplike kontrolü yapmadım.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

77377 satır eklendi. #2

77377 satır eklendi. #2

ufukart commented Jan 11, 2021

kahramanumut commented Jan 12, 2021

huseyinsimsekk commented Jan 13, 2021

berkakkaya commented Jan 19, 2021 •

edited

Loading

kahramanumut commented Jan 19, 2021

berkakkaya commented Jan 19, 2021

77377 satır eklendi. #2

Are you sure you want to change the base?

77377 satır eklendi. #2

Conversation

ufukart commented Jan 11, 2021

kahramanumut commented Jan 12, 2021

huseyinsimsekk commented Jan 13, 2021

berkakkaya commented Jan 19, 2021 • edited Loading

kahramanumut commented Jan 19, 2021

berkakkaya commented Jan 19, 2021

berkakkaya commented Jan 19, 2021 •

edited

Loading