Update files to reflect unpickling error fix request #351

mahasch · 2022-07-31T17:42:31Z

Hi,
I had an issue with the line endings, where I faced a very common "UnpicklingError: the STRING opcode argument must be quoted"

I am using python 3.10.4.

By adding a doc2unix.py file in tools( code from stackoverflow), and changing the email_preprocesses.py script to use pickle instead of joblib and there are no more errors. Credit also goes to hat20 and vkaushik189 who have also recognised this solution. Another prerequisite will be to install pickle. Can this change please be integrated to the files?
Thank you

This is the email_preprocesses.py file
`

#!/usr/bin/python

import pickle

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif 

def preprocess(words_file = "../tools/word_data_unix.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = pickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

    return features_train_transformed, features_test_transformed, labels_train, labels_test

`

This is the doc2unix.py file
`

  #!/usr/bin/env python
  """
  convert dos linefeeds (crlf) to unix (lf)
  usage: python dos2unix.py
  """
  
  import sys
  
  original = 'word_data.pkl'
  destination = "word_data_unix.pkl"
  
  content = ''
  outsize = 0
  with open(original, 'rb') as infile:
      content = infile.read()
  with open(destination, 'wb') as output:
      for line in content.splitlines():
          outsize += len(line) + 1
          output.write(line + str.encode('\n'))
  
  print("Done. Saved %s bytes." % (len(content)-outsize))

`

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update files to reflect unpickling error fix request #351

Update files to reflect unpickling error fix request #351

mahasch commented Jul 31, 2022

Update files to reflect unpickling error fix request #351

Update files to reflect unpickling error fix request #351

Comments

mahasch commented Jul 31, 2022