Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update files to reflect unpickling error fix request #351

Open
mahasch opened this issue Jul 31, 2022 · 0 comments
Open

Update files to reflect unpickling error fix request #351

mahasch opened this issue Jul 31, 2022 · 0 comments

Comments

@mahasch
Copy link

mahasch commented Jul 31, 2022

Hi,
I had an issue with the line endings, where I faced a very common "UnpicklingError: the STRING opcode argument must be quoted"

I am using python 3.10.4.

By adding a doc2unix.py file in tools( code from stackoverflow), and changing the email_preprocesses.py script to use pickle instead of joblib and there are no more errors. Credit also goes to hat20 and vkaushik189 who have also recognised this solution. Another prerequisite will be to install pickle. Can this change please be integrated to the files?
Thank you

This is the email_preprocesses.py file
`

#!/usr/bin/python

import pickle

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif 

def preprocess(words_file = "../tools/word_data_unix.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = pickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

    return features_train_transformed, features_test_transformed, labels_train, labels_test

`

This is the doc2unix.py file
`

  #!/usr/bin/env python
  """
  convert dos linefeeds (crlf) to unix (lf)
  usage: python dos2unix.py
  """
  
  import sys
  
  original = 'word_data.pkl'
  destination = "word_data_unix.pkl"
  
  content = ''
  outsize = 0
  with open(original, 'rb') as infile:
      content = infile.read()
  with open(destination, 'wb') as output:
      for line in content.splitlines():
          outsize += len(line) + 1
          output.write(line + str.encode('\n'))
  
  print("Done. Saved %s bytes." % (len(content)-outsize))

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant