Pandas

40+ basic pandas questions/answers

Go through 10 questions per day, and in less than a week you will have pretty good grasp of Pandas module.

pandas stands for “Panel Data”
http://pandas.pydata.org/
It is a python module which is great for analytical calculations.

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
a) How to test the type of the variable 
   (is it an int? float? str?, list? dict? etc.) 
b) How to test type of a column in a DataFrame 
c) How to list types of all columns in a dataFrame

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

a) 
def print_type(obj):
    if type(obj) == int:
        print "int"
    elif type(obj) == float:
        print "float"
    elif type(obj) == str:
        print "str"
    else:
        print "unknown type"

(b)
aa = ddd()
aa.f1.dtype

c)
aa.dtypes

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to take a value from a particular cell of a dataframe
    - using df[][]
    - using df.ix

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa['i2'][1]
aa.ix[1,'i2']

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Extract a row from a DataFrame into a regular python list

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa.ix[1].tolist()
aa.ix[len(aa)-1].tolist()
OR
aa.ix[1,:].values.tolist()
list(aa.ix[1,:])

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Extract a column or a row from a DataFrame into a regular python list
Hint - use .ix to convert col/row into a Series

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

list(aa.ix[:,'i2'])
aa.ix[2].tolist()

# for col you can also do:
aa['i2'].tolist() 
aa['i2'].values.tolist()
list(aa['i2'])

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
a) How to find rows which has the same value in a particular column ?
b) How to use value_counts()  ?
c) How to count number of times each unique value appears in group, and for multiple columns? 
d) what is np.unique - and how to use it?
e) Procedure to extract duplicate rows (by one or more columns)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

a) 
aa.duplicated(['f3'])  # creates true/false mask

b)
aa['f3'].value_counts()
Returns Series containing counts of unique values (excluding NaN)
in descending order (the first element is the most frequently occuring).

c)
source = DataFrame([
  ['amazon.com',  'correct',   'correct'  ], 
  ['amazon.com',  'incorrect', 'correct'  ], 
  ['walmart.com', 'incorrect', 'correct'  ], 
  ['walmart.com', 'incorrect', 'incorrect']
], columns=['domain', 'price', 'product'])
source.groupby('domain').apply(lambda x: x[['price','product']].apply(lambda y: y.value_counts())).fillna(0)

d)
np.unique is a Numpy function which shows unique values in a column,
    and where they were found first time.
  
e)
def show_duplicates(df, cols=[], include_nulls=True):
    """
    # accepts a dataframe df and a column (or list of columns)
    # if list of columns is not provided - uses all df columns
    # returns a dataframe consisting of rows of df
    # which have duplicate values in "cols"
    # sorted by "cols" so that duplciates are next to each other
    # Note - doesn't change index values of rows
    """
    # ---------------------------------
    aa = df.copy()
    mycols = cols
    # ---------------------------------
    if len(mycols) <= 0:
        mycols = aa.columns.tolist()
    elif type(mycols) != list:
        mycols = list(mycols)
    # ---------------------------------
    if not include_nulls:
        mask = False
        for mycol in mycols:
            mask = mask | (aa[mycol] != aa[mycol])  # test for null values
        aa = aa[~mask]                              # remove rows with nulls in mycols
    if len(aa) <= 0:
        return aa[:0]
    # ---------------------------------
    # duplicated() method returns Boolean Series denoting duplicate rows
    mask = aa.duplicated(cols=mycols, take_last=False).values \
         | aa.duplicated(cols=mycols, take_last=True).values
    aa = aa[mask]
    if len(aa) <= 0:
        return aa[:0]
    # ---------------------------------
    # sorting to keep duplicates together
    # Attention - can not sort by nulls
    # bb contains mycols except for cols which are completely nulls
    bb = aa[mycols]
    bb = bb.dropna(how='all',axis=1)
    # sort aa by columns in bb (thus avoiding nulls)
    aa = aa.sort_index(by=bb.columns.tolist())
    # ---------------------------------
    # sorting skips nulls thus messing up the order. 
    # Let's put nulls at the end
    mask = False
    for mycol in mycols:
        mask = mask | (aa[mycol] != aa[mycol])  # test for null values
    aa1 = aa[~mask]
    aa2 = aa[mask]
    aa = aa1.append(aa2)

    return aa


QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to append a list of data to a dataframe
How to append a series as a row to a database

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
aa = ddd()               # create test DataFrame
bb = aa.ix[1].tolist()   # take 2nd row as a list

# append list
aa = aa.append(DataFrame([bb], columns=aa.columns))

# append Series by converting Series to a list
aa = aa.append(DataFrame([ss.tolist()], columns=aa.columns))

# alternatively you can make 1-column dataframe - and transpose it
bb=DataFrame(ss)
bb.index = aa.columns
aa.append(bb.T)

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to remove duplicate rows 
(duplicate is defined as having same value(s) in a list of columns)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa = aa.drop_duplicates(['i2']) 
or
aa = aa.drop_duplicates(['i2'], take_last=True) 

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to use a mask using   &, |, ~, .isin(), .isnull()

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

mask = aa.f2.isnull()
aa[mask]

mask = aa.i1.isin([1,3,5])
aa[mask]
aa[~mask] # shows the records where mask is False

mask = (aa.id==1) & (aa.i2 == 4)
mask = (aa.id==1) | (aa.i2 == 4)

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Give an example of using a map() function
on a pandas DataFrame column

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa['yy'] = aa.yy.map(int)

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Give example using map with lambda for dataframe operations

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa['s2'] = aa.ss + '__' + aa.i1.map(lambda x: str(x))

# make a list of values in column 'yy' rounded to 2 digits after dot
aa['yy'].map(lambda x: round(x,2)).tolist()

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Give example using  groupby().sum()

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

cc = aa.groupby(['i2'], as_index=False).sum()

# note - groupby().sum() will usually remove all 
# string columns from the result. To avoid it, you can
# use agg():

cc = aa.groupby('i2', as_index=False).agg({'i1':np.sum,'ss':np.max})

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Give example using groupby().aggregate()

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

bb = aa.groupby('i2', as_index=True).aggregate({'yy':np.sum, 'xx':np.max}) 

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to sort a dataframe by a list of columns

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

cc = aa.sort_index(by=['i2','yy'])

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to delete some rows from dataframe - and reindex.

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa = aa.drop([3,4])

mask = aa['id'].map(lambda x: x > 3)
aa = aa[~mask]

aa.reindex() # doesn't change index
             # unless you provide it 
aa.index = range(len(aa))

mask = aa['id'].map(lambda x: x in (0,1,4))
aa = aa[~mask]

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to add rows to a dataframe (add 2 dataframes together vertically)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa = aa.append(bb,ignore_index=True)

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to write dataframe to csv file, and how to read it back

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa.to_csv('data.csv',sep='|',header=True, index=False)
bb = read_csv('data.csv', sep='|')

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
  - how to add columns to pandas DataFrame
  - how to calculate column values from numeric/string values in other columns.
  - how to delete one or more columns

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

df['c4']=None         # populate with same value
col2=[1,2,3,4,5,6,7]
df['c4']=col2         # list becomes a column
df['c4'] = "-"
df['c2'] = 0

# adding a column - and populating it using vectorized operation on columns
df['c5']= 2*df['c1'] + 3*df['c2'] + 5

# calculating column values from other columns:
df['c4']= 2*df['c1'] + 3*df['c2'] + 5
aa['s2'] = aa.ss + '__' + aa.i1.map(lambda x: str(x))

# Deleting one column
del ff['s5']

# Deleting many columns
ff = ff.drop(['c1','c2',c3'], axis=1)  

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to calculate a pandas DataFrame column 
as a linear combination of some other columns

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

ff['c4']= 2*ff['i1'] + 3*ff['i2'] + 5 

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to calculate a DataFrame column 
from several other columns while using str() and int().
Hint - use map(lambda ..)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

ff['s3'] = ff['yy'].map(lambda x: int(x)) 
ff['s4']= '>>>' + ff.s3.map(lambda x: str(x)) + '<<<'

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
String operations on columns
How to define a mask using a regular expression

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

mask = aa.ss.map(lambda x: True if re.search(r's[1,3]',str(x)) else False)

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to define a mask using regex on one column, and numeric comparison on the other column

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

mask = (aa.i2.map(lambda x: True if re.search(r'4',str(x)) else False)) & (aa.xx > 2)
mask = ( df.a == 1) & (df.b == 2)
mask = ( df.a == 1) | (df.b == 2)
mask = ( df.a == 1) | df.b.isin([1,2,3])
mask = ( df.a == 1) | df.b.map(lambda x: ......)
mask = ( df.a == 1) | df.b.map(lambda x: ......)  |  df.c.map(lambda x: ......) 

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to change the order of columns in a dataframe

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa = DataFrame({'a':range(3),'b':range(3),'c':range(3)})
col_list_ordered = ['a','b','c']
aa = aa[col_list_ordered]
# or (notice double-brackets)
aa = aa[['a','b','c']]

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to check if a dataframe has a column with a particular name

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    if 'i2' in aa.columns:
        print "true"

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
select rows of a pandas DataFrame which have null values (in any column)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

def rows_with_nulls(df):
    mask=False
    for col in df.columns: mask = mask | df[col].isnull()
    return df[mask]

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
a) how to substitute null values in a column ? 
b) in the whole dataframe ?

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    aa.ss.fillna(0.0)
    aa.ss.fillna(0)
    aa.ss.fillna('-')

    aa.fillna(0)

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Can an integer column in pandas DataFrame have a NaN value?

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

No. Float column can.

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to convert value type of a column to int64 or float64 ?

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    aa.mycol = aa.mycol.astype(np.float64)
    aa.mycol = aa.mycol.astype(np.int64)

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Take first several rows for a dataframe (regardless of index)
Take last several rows of a dataframe (regardless of index)
Take group of rows in the middle (regardless of index)

Take one row as a list (first / last / middle)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

# first rows as dataframe:
aa.head()
aa.head(1)
aa[:5]
aa[:1]

# last rows rows as dataframe:
aa.tail()
aa.tail(1)
aa[-5:]
aa[-1:]

# rows in the middle as dataframe
aa[2:4]

# Take one row as a list (first / last / middle)
# for this we use DF.ix[] construct, because for 1 row it returns a Series
aa.ix[aa.index[0]].tolist()
aa.ix[aa.index[-1]].tolist()
aa.ix[aa.index[3]].tolist()

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Take first row of a DataFrame as a list or dict

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa.ix[aa.index[0]].tolist()
aa.ix[aa.index[0]].to_dict()

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to combine data of two pandas DataFrames ?

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

merge(df1,df2, on=[..], how='inner')  # like joining 2 tables in SQL, inner,outer,left
aa.append(bb,ignore_index=True)
concat([s1,s2,s3])      - stacks together objects along an axis (vertically)
concat([aa,bb],axis=1) - stacking horizontally
df.combine_first() - splices together overlapping data to fill missing values

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
pandas stack()/unstack() functions
grouping by mask

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

xx = pandas.DataFrame([["Jan","name1",1,2,3],   ["Jan","name2",4,5,6],
                       ["Mar","name1",11,12,13],["Mar","name2",14,15,16]],
                       columns=["Month","name","c1","c2","c3"])

wide = xx.set_index(["Month","name"]).stack(1).unstack('Month')

Month     Jan  Mar
name              
name1 c1    1   11
      c2    2   12
      c3    3   13
name2 c1    4   14
      c2    5   15
      c3    6   16

mask = wide.Jan > 3
wide.groupby(by=mask).sum()

Month  Jan  Mar
Jan            
False    6   36
True    15   45

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
pandas DataFrame - pivot()

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa = DataFrame({
   'foo':3*['one'] + 3*['two'],
   'bar':2*['A','B','C'],
   'baz':[1,2,3,4,5,6] })
aa = aa[['foo','bar','baz']]

#       foo bar  baz
#    0  one   A    1
#    1  one   B    2
#    2  one   C    3
#    3  two   A    4
#    4  two   B    5
#    5  two   C    6

xx.pivot('foo','bar','baz')
# or
xx.pivot('foo', 'bar')['baz']

#    bar  A  B  C
#    foo         
#    one  1  2  3
#    two  4  5  6

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
pandas DataFrame - create and populate with data
    from dict of columns,
    from list
    from numpy array, 
    from list of serieses
    from list of list

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

df = DataFrame({'x'  : 3 * ['a'] + 2 * ['b'],
                'nn' : np.arange(5, dtype=np.float64), 
                'y'  : np.random.normal(size=5),
                'z'  : range(5)})

df = DataFrame([[1,2,3]], columns=['A','B','C'])

df = DataFrame(np.arange(12).reshape((3,4)),
               index = ['A','B','C'],
               columns = ['AA','BB','CC','DD'])

nrows  = 10
ncols  =  5
mydata = np.random.rand(nrows, ncols)
#mydata = np.random.randn(nrows, ncols)
aa = DataFrame(data=mydata)
aa = DataFrame(data=mydata, 
               index=range(nrows), 
               columns=[chr(65+x)*2 for x in range(ncols)])

aa = DataFrame( np.random.normal(size=12).reshape((3,4)), 
                index = ['A','B','C'], 
                columns = ['AA','BB','CC','DD'])

s1 = Series({'x':1,'y':2})
s2 = Series({'x':3,'y':4})
aa = DataFrame([s1,s2])   # s1 and s2 - rows

mydata = [[1,2],[3,4],[5,6]]
aa = DataFrame(mydata, columns=['AA','BB'])

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to generate random numbers to populate DataFrame

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

np.random.randn(rows, cols)
np.random.rand(rows, cols)
np.random.normal(size=25).reshape(5,5)
np.random.normal(loc=0.0, scale=1.0, size=None)
np.random.<TAB>
np.random.seed(int)

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
pandas DataFrame - transform a row using flexible 
    custom function operating on all columns.

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa = DataFrame({'aa':range(5), 'bb':range(5)})

def fn(df):
    dd = df.ix[df.index[0]].to_dict()
    df['lev'] = str(dd['aa']**2) + '_' + str(dd['bb']**2)
    return df

bb = aa.groupby(aa.index,as_index=False).apply(fn)
print bb
   aa  bb    lev
0   0   0    0_0
1   1   1    1_1
2   2   2    4_4
3   3   3    9_9
4   4   4  16_16

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
how to loop through rows of pandas DataFrame?

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa=ddd()
for rr in aa.itertuples(): print rr
for rr in aa.itertuples(index=False): print rr

DataFrame.iterrows()  - a generator
    yields tuple (index, row_as_Series)
    Slow, because needs to create a Series from each row.
  
for row in df.iterrows():
    ind = row[0]
    ser = row[1]
    cols = list(ser.index)
    vals = list(ser)
    print "ind = ",ind,", vals = ",vals

for row in df.iterrows():   print row[1].values
for row in df.iterrows():   print list(row[1].values)

# Another way:
for ii in df.index: do_something(df.ix[ii])

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
String operations on columns

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
aa = DataFrame({
    'ss':['aa1','bb2','cc3',np.nan],
    'ff':['  11.11  ','  22.22  ','  33.33  ','  44.44  '],
    'ii':['  11     ','  22     ','  33     ','  44     ']})

# use str method on Series/Column:
aa.ss.str.<TAB>
aa.ss.str.contains('bb')
aa.ss.str[:2]   # first 2 characters
aa.ss.str.upper()
aa.ss.str.len()
# etc.

# using regular expression
mask = aa.ss.map(lambda x: True if re.search(r's[1,3]',str(x)) else False)

# convert from string to number and back
aa['zz'] = aa.ff.astype(np.int64) # error
aa['zz'] = aa.ff.map(lambda x: int(float(x)) if x==x else np.nan).astype(np.int64)
aa['zz'] = aa.ii.astype(np.int64) # works
print aa.zz.dtype
aa['zz'] = aa.ff.astype(np.float64) # works
aa['zz'] = aa.ii.astype(np.float64) # works
print aa.zz.dtype
aa['zz'] = aa.ii.str.strip().astype(float)
print aa.zz.dtype
aa['zz'] = aa.ii.str.strip().str[0].astype(int)
print aa.zz.dtype
aa['zz'] = aa.ff.astype(object)  # use this to work with a string
print aa.zz.dtype
aa['zz'] = aa.ff.astype(str)     # silent error
print aa.zz.dtype

# mask = aa.ss.str.match(r'1|3') - doesn't work yet
# mask = aa.ss.get(label) # ?? 

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
Making histogram (cutting data into bins)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa=np.random.normal(size=100)
aa
bins=[-3, -2.5, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2, 2.5, 3]
vals = pandas.cut(aa,bins)
pandas.value_counts(vals)

# also look at pandas.qcut

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to get a short summary of data in the numeric DataFrame?
How to remove outliers (values which are too big or small)

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

aa = ddd()
print aa.describe()

np.random.seed(12345)
data = DataFrame(np.random.randn(1000, 4), columns=['A','B','C','D'])
data
data.describe()
data.describe().index  # [count, mean, std, min, 25%, 50%, 75%, max]

# look at outliers (numbers more than 3)
print data[(np.abs(data) > 3).any(1)]

# remove outliers
data[(np.abs(data) > 3)] = np.sign(data) * 3

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to search and replace values in a column in a DataFrame

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

# most common way is in 2 steps:
#   step1 - create a mask to identify rows in hwich we need to do change
#   step2 - do the change to the column using the mask

# alternative may be to use replace based on value(s)
aa.ss.replace(list_of_values, replacing_value)
#or
aa.ss.replace({val1:repl1, val2:repl2, etc.})

QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
How to combine a list of sets into one sorted list

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

gg = [{1,2,3},{2,3,4},{3,4,5}]
print sorted(set.union(*gg))

##=======================================
##df.groupby(...).size()
##=======================================
##for name, group in df.groupby(..): ...
##=======================================
##dict(list(df.groupby(...)
##=======================================
##what's the difference between quantiles and buckets?
##median as a quantile
##=======================================
## difference between agg and apply
## df.groupby(...).agg(fn) - fn to work on an array
## df.groupby(...).apply(fn) - fn to work on a DataFrame
## you can provide args to fn in apply:
##  .apply(fn, args)
## ser.apply(fn) - apply can be use on a Series - but I prefer map for this 
##=======================================
##OLS = Ordinary Least Squares
##=======================================
##pivot vs crosstab