Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: added string processing comparison with SAS #16497

Merged
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions doc/source/comparison_with_sas.rst
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,142 @@ takes a list of columns to sort by.
tips = tips.sort_values(['sex', 'total_bill'])
tips.head()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a directive here (and other sub-sections if you can), for references e.g. _compare_with_sas.string


String Processing
-----------------

Length
~~~~~~

SAS determines the length of a character string with the ``LENGTHN``
and ``LENGTHC`` functions. ``LENGTHN`` excludes trailing blanks and
``LENGTHC`` includes trailing blanks.

.. code-block:: none

data _null_;
set tips;
put(LENGTHN(time));
put(LENGTHC(time));
run;

Python determines the length of a character string with the ``len`` function.
``len`` includes trailing blanks. Use ``len`` and ``rstrip`` to exclude
trailing blanks.

.. ipython:: python

tips['time'].str.len().head()
tips['time'].str.rstrip().str.len().head()


Find
~~~~

SAS determines the position of a character in a string with the
``FINDW`` function. ``FINDW`` takes the string defined by
the first argument and searches for the first position of the substring
you supply as the second argument.

.. code-block:: none

data _null_;
set tips;
put(FINDW(sex,'ale'));
run;

Python determines the position of a character in a string with the
``find`` function. ``find`` searches for the first position of the
substring. If the substring is found, the function returns its
position. Keep in mind that Python indexes are zero-based and
the function will return -1 if it fails to find the substring.

.. ipython:: python

tips['sex'].str.find("ale").head()


Substring
~~~~~~~~~

SAS extracts a substring from a string based on its position
with the ``SUBSTR`` function.

.. code-block:: none

data _null_;
set tips;
put(substr(sex,1,1));
run;

In Python, you can use ``[]`` notation to extract a substring
from a string by position locations. Keep in mind that Python
indexes are zero-based.

.. ipython:: python

tips['sex'].str[0:1].head()


Scan
~~~~

The SAS ``SCAN`` function returns the nth word from a string.
The first argument is the string you want to parse and the
second argument specifies which word you want to extract.

.. code-block:: none

data firstlast;
input String $60.;
First_Name = scan(string, 1);
Last_Name = scan(string, -1);
datalines2;
John Smith;
Jane Cook;
;;;
run;

Python extracts a substring from a string based on its text
by using regular expressions. There are much more powerful
approaches, but this just shows a simple approach.

.. ipython:: python

firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})
firstlast['First_Name'] = firstlast['String'].str.split(" ", expand=True)[0]
firstlast['Last_Name'] = firstlast['String'].str.rsplit(" ", expand=True)[0]


Upcase, Lowcase, and Propcase
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The SAS ``UPCASE``, ``LOWCASE``, and ``PROPCASE`` functions change
the case of the argument.

.. code-block:: none

data firstlast;
input String $60.;
string_up = UPCASE(string);
string_low = LOWCASE(string);
string_prop = PROPCASE(string);
datalines2;
John Smith;
Jane Cook;
;;;
run;

The equivalent Python functions are ``upper``, ``lower``, and ``title``.

.. ipython:: python

firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})
firstlast['string_up'] = firstlast['String'].str.upper()
firstlast['string_low'] = firstlast['String'].str.lower()
firstlast['string_prop'] = firstlast['String'].str.title()


Merging
-------

Expand Down