Python pandas quick guide Shiu-Tang Li May 12, 2016 Contents 1 Dataframe initialization / outputs 1.1 Load csv files into dataframe. . . . 1.2 Initialize a dataframe . . . . . . . . 1.3 Create a new column . . . . . . . . 1.4 Output a dataframe to csv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 3 3 2 Take a quick glance of a dataframe 2.1 Print the data frame . . . . . . . . . . . . . . . . . . . 2.2 Get the description of numerical columns . . . . . . . 2.3 Get the dimension . . . . . . . . . . . . . . . . . . . . 2.4 Get the data type / get the filtered data by data type 2.5 Get the unique elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 4 4 4 3 Select data from a dataframe 3.1 Get column names . . . . . . . . . . . . . . . 3.2 Select a specific column . . . . . . . . . . . . 3.3 Select the sub-dataframe of a few columns . . 3.4 Select rows with restrictions on columns . . . 3.5 Select rows with row index . . . . . . . . . . 3.6 Select row index with max values in a specific 3.7 Select given entry . . . . . . . . . . . . . . . . 3.8 Iterate rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 5 5 5 5 5 6 4 Revise data in a dataframe 4.1 Revise data in a particular entry 4.2 Reindex rows . . . . . . . . . . . 4.3 Reindex one row . . . . . . . . . 4.4 Rename columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 7 7 7 . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 4.6 4.7 4.8 4.9 Drop columns / rows . . . . . . Find / drop / fill missing values Data frame transpose . . . . . Change types of a column . . . Merge data frames . . . . . . . . . . . . . . . . . . . . . . 7 8 8 8 8 5 Search key words in a dataframe 5.1 Exact match in target column . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 6 Perform operations on a dataframe 6.1 Sort dataframe . . . . . . . . . . . 6.2 Rearrange dataframe - pivot table 6.3 Grouping . . . . . . . . . . . . . . 6.4 ‘Apply’ function . . . . . . . . . . . . . . . . . . . 7 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 10 10 10 11 2 1 Dataframe initialization / outputs 1.1 1 2 3 4 import pandas d a t a f r a m e = pandas . r e a d c s v ( ”C : / U s e r s / Shiu−Tang L i / . . . c s v ” , encoding = ” ISO−8859−1” ) # encoding : t o d e a l with u n i c o d e s 1.2 1 2 3 4 5 1 2 3 4 2 import pandas as pd l i s t o f d i c t s = [{ ’ column1 ’ : 3 , ’ column2 ’ : 4 } , { ’ column1 ’ : 7 , ’ column2 ’ : 2 } , { ’ column1 ’ : 6 , ’ column2 ’ : 8 } ] d a t a f r a m e = pd . DataFrame ( l i s t o f d i c t s , index=[ ’ index1 ’ , ’ index2 ’ ] ) # c o n s t r u c t data frame from l i s t o f d i c t i o n a r i e s 2 Create a new column d a t a f r a m e [ ’ new column ’ ] = L i s t OR S e r i e s # w i l l g e t warning message 1.4 1 Initialize a dataframe import pandas as pd df1 = pd . DataFrame = ({ ’ c1 ’ : [ ’ 1 ’ , ’ 2 ’ , ’ 3 ’ , ’ 4 ’ ] , ’ c2 ’ : [ ’ 5 ’ , ’ 6 ’ , ’ 7 ’ , ’ 8 ’ ] } ) df2 = pd . DataFrame ({ ’ c1 ’ : 2 , ’ c2 ’ : np . a r r a y ( [ 0 ] ∗ 100 , dtype= ’ i n t 3 2 ’ ) , ’ c3 ’ : ’ h e l l o ’ }) 1.3 1 Load csv files into dataframe. Output a dataframe to csv import pandas d a t a f r a m e . t o c s v ( ”C : / U s e r s / Shiu−Tang L i / . . . c s v ” ) Remark. May load .csv as list of lists instead of data frames. 3 2 2.1 1 2 3 4 2 3 2 3 4 5 6 2 3 4 Get the data type / get the filtered data by data type types = data frame . dtypes # t y p e s i s a S e r i e s l a b e l e d by column names , showing t h e data t y p e f o r each column . i n t e g e r i n d e x = t y p e s [ t y p e s == i n t 6 4 ] . index # t y p e s [ t y p e s == i n t 6 4 ] i s a f i l t e r e d S e r i e s with i n t e g e r v a l u e s . p r i n t ( data frame [ i n t e g e r i n d e x ]) # t h i s g i v e s us t h e f i l t e r e d data frame with o n l y i n t 6 4 v a l u e s . 2.5 1 Get the dimension dim = d a t a f r a m e . shape number of rows = dim [0 ] number of columns = dim [ 1] 2.4 1 Get the description of numerical columns p r i n t ( data frame . describe () ) 2.3 1 Print the data frame p r i n t ( d a t a f r a m e . head ( 5 ) ) # p r i n t f i r s t 5 rows p r i n t ( data frame ) # p r i n t t h e data frame , dimension i n f o r m a t i o n i s a l s o a t t a c h e d 2.2 1 Take a quick glance of a dataframe Get the unique elements p r i n t ( d a t a f r a m e [ ’ column name ’ ] . unique ( ) ) # w i l l r e t u r n a l i s t showing d i s t i n c t elements i n t h e column p r i n t ( d a t a f r a m e [ ’ column name ’ ] . v a l u e c o u n t s ( ) ) # w i l l r e t u r n a t a b l e showing t h e c o u n t s i n t h e column 4 3 3.1 1 2 3 4 2 3 4 2 3 4 5 2 3 4 Select rows with restrictions on columns d a t a f r a m e [ d a t a f r a m e [ ’ column name ’ ] == s o m e v a l u e s ] 3.5 1 Select the sub-dataframe of a few columns d a t a f r a m e 2 = d a t a f r a m e [ [ ’ column name1 ’ , ’ column name2 ’ ] ] d a t a f r a m e 2 = d a t a f r a m e [ d a t a f r a m e . column [ 0 : 2 ] ] # s e l e c t t h e f i r s t two columns i n two d i f f e r e n t ways d a t a f r a m e 2 = d a t a f r a m e [ ’ column name x ’ : ’ column name y ’ ] # s e l e c t t h e columns between t h e two columns 3.4 1 Select a specific column column = d a t a f r a m e [ ’ column name ’ ] # column i s a [ S e r i e s ] o b j e c t , c o n t a i n s row index + v a l u e s , both a r e l i s t s c o l u m n v a l u e s = column . v a l u e s column index = column . index 3.3 1 Get column names p r i n t ( d a t a f r a m e . columns ) # d a t a f r a m e . columns i s a l i s t o f s t r i n g s f i r s t c o l u m n = d a t a f r a m e . columns [ 0] # p r i n t t h e f i s t column , which i s a s t r i n g 3.2 1 Select data from a dataframe Select rows with row index data frame . i l o c [ i ] #i : row index data frame . i l o c [0:3] # s e l e c t t h e rows with i n d i c e s 0 ,1 ,2 Remark. The difference between loc and iloc: If the index of the dataframe is 3, 7, 0, 2, . . ., iloc[0] will select the third row (true integer index), loc will select the 1st row (index by locations). 3.6 1 2 Select row index with max values in a specific column d a t a f r a m e [ ’ column name ’ ] . idxmax ( ) # r e t u r n s t h e 1 s t row index t h a t has max 3.7 Select given entry 5 1 2 3 4 5 6 # Approach 1 : i : t r u e row index d a t a f r a m e . i l o c [ i ] [ ’ column name ’ ] # Approach 2 : i : l o c a t i o n d a t a f r a m e [ ’ column name ’ ] . v a l u e s [ i ] # Approach 3 : i : t r u e row index d a t a f r a m e . i x [ i , ’ column name ’ ] 3.8 1 2 Iterate rows f o r i , row i n d a t a f r a m e . i t e r r o w s ( ) : # i : row i n d i c e s ; row : each row 6 4 4.1 1 2 3 4 5 6 7 8 9 2 3 4 5 6 2 3 4 5 6 2 3 4 2 3 4 5 6 Rename columns d a t a f r a m e . rename ( columns={ ’ column name1 ’ : ’ new column name1 ’ , ’ column name2 ’ : ’ new column name2 ’ } , i n p l a c e=True ) # rename a few columns : w i l l g e t warning message d a t a f r a m e . columns = [ ’ column name1 ’ , ’ column name2 ’ , . . . ] # rename a l l columns 4.5 1 Reindex one row o l d i n d i c e s = d a t a f r a m e . index new indices = o l d i n d i c e s . values f o r i , item i n enumerate ( n e w i n d i c e s ) : i f item == ’ o l d i n d e x t o b e c h a n g e d ’ : n e w i n d i c e s [ i ] = ’ new index ’ d a t a f r a m e . index = n e w i n d i c e s 4.4 1 Reindex rows d a t a f r a m e . index = [ index1 , index2 , . . . ] # r e p l a c e i n d i c e s with a new l i s t d a t a f r a m e = d a t a f r a m e . s e t i n d e x ( [ ’ column name ’ ] ) # indexed by a p a r t i c u l a r column d a t a f r a m e = d a t a f r a m e . r e s e t i n d e x ( drop=True ) # r e i n d e x e d from 0 . drop=F a l s e : make a dataframe column with t h e o l d index values . 4.3 1 Revise data in a particular entry # i : t r u e row index # Approach 1 ( w i l l g e t warning message ) : d a t a f r a m e . i x [ i , ’ column name ’ ] = new value # Approach 2 ( w i l l g e t warning message ) : d a t a f r a m e [ ’ column name ’ ] [ i ] = new value # Approach 3 : d a t a f r a m e . s e t v a l u e ( i , ’ column name ’ , new value ) # Approach 4 : d a t a f r a m e . a t [ i , ’ column name ’ ] = new value 4.2 1 Revise data in a dataframe Drop columns / rows d a t a f r a m e . drop ( ’ column name ’ , a x i s =1, i n p l a c e=True ) # drop a column d a t a f r a m e . drop ( [ ’ column name1 ’ , ’ column name2 ’ , . . . ] , a x i s =1, i n p l a c e=True ) # drop a few columns d a t a f r a m e . drop ( ’ row index1 ’ , a x i s =0, i n p l a c e=True ) # drop a row 7 7 8 d a t a f r a m e . drop ( [ ’ row index1 ’ , ’ row index2 ’ , . . . ] , a x i s =0, i n p l a c e=True ) # drop a few rows 4.6 1 2 3 4 5 6 7 8 9 10 11 12 13 import pandas as pd i s n u l l = pd . i s n u l l ( d a t a f r a m e [ ” column name ” ] ) # w i l l r e t u r n a t r u e / f a l s e S e r i e s . N u l l v a l u e = NaN . new data frame = d a t a f r a m e . dropna ( ) # drop a l l rows with m i s s i n g v a l u e s new data frame = d a t a f r a m e . dropna ( a x i s = 1) # drop a l l columns with m i s s i n g v a l u e s new data frame = d a t a f r a m e . dropna ( s u b s e t =[” column1 ” , ” column2 ” ] ) # drop a l l rows with m i s s i n g v a l u e s i n t h e two columns d a t a f r a m e [ ” column ” ] = d a t a f r a m e [ ” column ” ] . f i l l n a ( d a t a f r a m e [ ” column ” ] . median ( ) ) d a t a f r a m e [ ” column ” ] = d a t a f r a m e [ ” column ” ] . f i l l n a ( d a t a f r a m e [ ” column ” ] . mean ( ) ) d a t a f r a m e [ ” column ” ] = d a t a f r a m e [ ” column ” ] . f i l l n a ( something ) # f i l l NaN v a l u e s with column median / mean / or s t h e l s e 4.7 1 2 2 3 4 5 6 7 1 2 3 4 5 Change types of a column d a t a f r a m e [ ’ column name ’ ] = d a t a f r a m e [ ’ column name ’ ] . a s t y p e ( f l o a t ) # changing t y p e s t o f l o a t 4.9 1 Data frame transpose data frame . T 4.8 1 Find / drop / fill missing values Merge data frames import pandas as pd l i s t 1 = [{ ’ c1 ’ : 4 , ’ c2 ’ : 3 } , { ’ c1 ’ : 2 , ’ c2 ’ : 3}] l i s t 2 = [{ ’ c1 ’ : 6 , ’ c2 ’ : 9 } , { ’ c1 ’ : 8 , ’ c2 ’ : 1 0 } ] df1 = pd . DataFrame ( l i s t 1 , index =[0 ,1]) df2 = pd . DataFrame ( l i s t 2 , index=[ ’ two ’ , ’ t h r e e ’ ] ) d f = pd . c o n c a t ( [ df1 , df2 ] ) # combine two data frames with common columns , i n c r e a s e rows . index could c o n t a i n keys o f d i f f e r e n t data t y p e s import pandas as pd df1 = pd . DataFrame ({ ’ key ’ : [ ’ a ’ , ’ b ’ ] , ’ c1 ’ : [ 1 , 2 ] , ’ c2 ’ : [ 8 , 2]}) df2 = pd . DataFrame ({ ’ key ’ : [ ’ a ’ , ’ b ’ ] , ’ c3 ’ : [ 3 , 5]}) merged df = pd . merge ( df1 , df2 , on= ’ key ’ ) # combine two data frames with t h e same data i n t h e column ’ key ’ . Merged data frame w i l l c o n t a i n 4 columns . 8 5 5.1 1 2 3 Search key words in a dataframe Exact match in target column f o r i , row i n movies . i t e r r o w s ( ) : i f r e . s e a r c h ( ” Keywords ” , row [ ” column ” ] ) != None : print ( i ) 9 6 6.1 1 2 3 4 2 3 4 5 Sort dataframe d a t a f r a m e . s o r t ( ” column name ” , i n p l a c e=True , a s c e n d i n g=F a l s e ) # s o r t by column , i n d e c r e a s i n g o r d e r d a t a f r a m e . s o r t i n d e x ( i n p l a c e=True , a s c e n d i n g=F a l s e ) # s o r t by index , i n d e c r e a s i n g o r d e r 6.2 1 Perform operations on a dataframe Rearrange dataframe - pivot table import numpy as np new data frame = d a t a f r a m e . p i v o t t a b l e ( index=” column1 ” , v a l u e s=” column2 ” , aggfunc=np . mean) # For d i f f e r e n t column1 v a l u e s , f i n d t h e r e s u l t o f a g g f u n c t i o n a p p l i e d t o column2 . new data frame = d a t a f r a m e . p i v o t t a b l e ( index=” column1 ” , v a l u e s=” any column ” , aggfunc= ’ count ’ ) # count t h e f r e q u e n c y f o r column1 . Remark. Other choices of aggfunc: ’max’, ’min’, np.std, np.sum, np.median. 6.3 1 2 3 Grouping groups = d a t a f r a m e . groupby ( ” column1 ” ) t a b l e = groups . a g g r e g a t e ( np . mean) [ ” column2 ” ] # c l a s s i f y rows i n t o d i f f e r e n t groups based on ’ column1 ’ , f o r each group a p p l y np . mean ( ) on a l l v a l u e s i n ” column2 ” . 6.4 ‘Apply’ function r1 r2 r3 apply1 1 2 3 4 5 6 c1 ap1 c2 ap1 c3 ap1 apply2 ap2 ap2 ap2 apply1 = d a t a f r a m e . a p p l y ( s o m e f u n c t i o n ) OR apply1 = d a t a f r a m e . a p p l y ( lambda x : s o m e f u n c t i o n ( x ) ) # a p p l y f u n c t i o n t o columns , t o g e t a S e r e i s o b j e c t l a b e l e d column names apply2 = d a t a f r a m e . a p p l y ( s o m e f u n c t i o n , a x i s =1) OR apply2 = d a t a f r a m e . a p p l y ( lambda x : s o m e f u n c t i o n ( x ) , a x i s =1) # a p p l y f u n c t i o n t o rows , t o g e t a S e r e i s o b j e c t l a b e l e d row i n d i c e s 10 7 Others Series.tolist() converts Series to lists. numpy.nan: missing value 11