Subido por oriolandreua

python pandas

Anuncio
Python pandas quick guide
Shiu-Tang Li
May 12, 2016
Contents
1 Dataframe initialization / outputs
1.1 Load csv files into dataframe. . . .
1.2 Initialize a dataframe . . . . . . . .
1.3 Create a new column . . . . . . . .
1.4 Output a dataframe to csv . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
3
3
2 Take a quick glance of a dataframe
2.1 Print the data frame . . . . . . . . . . . . . . . . . . .
2.2 Get the description of numerical columns . . . . . . .
2.3 Get the dimension . . . . . . . . . . . . . . . . . . . .
2.4 Get the data type / get the filtered data by data type
2.5 Get the unique elements . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
4
4
4
3 Select data from a dataframe
3.1 Get column names . . . . . . . . . . . . . . .
3.2 Select a specific column . . . . . . . . . . . .
3.3 Select the sub-dataframe of a few columns . .
3.4 Select rows with restrictions on columns . . .
3.5 Select rows with row index . . . . . . . . . .
3.6 Select row index with max values in a specific
3.7 Select given entry . . . . . . . . . . . . . . . .
3.8 Iterate rows . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
column
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
5
5
5
5
5
6
4 Revise data in a dataframe
4.1 Revise data in a particular entry
4.2 Reindex rows . . . . . . . . . . .
4.3 Reindex one row . . . . . . . . .
4.4 Rename columns . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
7
7
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.5
4.6
4.7
4.8
4.9
Drop columns / rows . . . . . .
Find / drop / fill missing values
Data frame transpose . . . . .
Change types of a column . . .
Merge data frames . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
8
8
8
8
5 Search key words in a dataframe
5.1 Exact match in target column . . . . . . . . . . . . . . . . . . . . . . . . . .
9
9
6 Perform operations on a dataframe
6.1 Sort dataframe . . . . . . . . . . .
6.2 Rearrange dataframe - pivot table
6.3 Grouping . . . . . . . . . . . . . .
6.4 ‘Apply’ function . . . . . . . . . .
.
.
.
.
.
.
.
.
.
7 Others
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
10
10
10
11
2
1
Dataframe initialization / outputs
1.1
1
2
3
4
import pandas
d a t a f r a m e = pandas . r e a d c s v ( ”C : / U s e r s / Shiu−Tang L i / . . . c s v ” ,
encoding = ” ISO−8859−1” )
# encoding : t o d e a l with u n i c o d e s
1.2
1
2
3
4
5
1
2
3
4
2
import pandas as pd
l i s t o f d i c t s = [{ ’ column1 ’ : 3 , ’ column2 ’ : 4 } , { ’ column1 ’ : 7 , ’ column2 ’ : 2 } , { ’ column1
’ : 6 , ’ column2 ’ : 8 } ]
d a t a f r a m e = pd . DataFrame ( l i s t o f d i c t s , index=[ ’ index1 ’ , ’ index2 ’ ] )
# c o n s t r u c t data frame from l i s t o f d i c t i o n a r i e s
2
Create a new column
d a t a f r a m e [ ’ new column ’ ] = L i s t OR S e r i e s
# w i l l g e t warning message
1.4
1
Initialize a dataframe
import pandas as pd
df1 = pd . DataFrame = ({ ’ c1 ’ : [ ’ 1 ’ , ’ 2 ’ , ’ 3 ’ , ’ 4 ’ ] , ’ c2 ’ : [ ’ 5 ’ , ’ 6 ’ , ’ 7 ’ , ’ 8 ’ ] } )
df2 = pd . DataFrame ({ ’ c1 ’ : 2 ,
’ c2 ’ : np . a r r a y ( [ 0 ] ∗ 100 , dtype= ’ i n t 3 2 ’ ) ,
’ c3 ’ : ’ h e l l o ’ })
1.3
1
Load csv files into dataframe.
Output a dataframe to csv
import pandas
d a t a f r a m e . t o c s v ( ”C : / U s e r s / Shiu−Tang L i / . . . c s v ” )
Remark. May load .csv as list of lists instead of data frames.
3
2
2.1
1
2
3
4
2
3
2
3
4
5
6
2
3
4
Get the data type / get the filtered data by data type
types = data frame . dtypes
# t y p e s i s a S e r i e s l a b e l e d by column names , showing t h e data t y p e f o r each
column .
i n t e g e r i n d e x = t y p e s [ t y p e s == i n t 6 4 ] . index
# t y p e s [ t y p e s == i n t 6 4 ] i s a f i l t e r e d S e r i e s with i n t e g e r v a l u e s .
p r i n t ( data frame [ i n t e g e r i n d e x ])
# t h i s g i v e s us t h e f i l t e r e d data frame with o n l y i n t 6 4 v a l u e s .
2.5
1
Get the dimension
dim = d a t a f r a m e . shape
number of rows = dim [0 ]
number of columns = dim [ 1]
2.4
1
Get the description of numerical columns
p r i n t ( data frame . describe () )
2.3
1
Print the data frame
p r i n t ( d a t a f r a m e . head ( 5 ) )
# p r i n t f i r s t 5 rows
p r i n t ( data frame )
# p r i n t t h e data frame , dimension i n f o r m a t i o n i s a l s o a t t a c h e d
2.2
1
Take a quick glance of a dataframe
Get the unique elements
p r i n t ( d a t a f r a m e [ ’ column name ’ ] . unique ( ) )
# w i l l r e t u r n a l i s t showing d i s t i n c t elements i n t h e column
p r i n t ( d a t a f r a m e [ ’ column name ’ ] . v a l u e c o u n t s ( ) )
# w i l l r e t u r n a t a b l e showing t h e c o u n t s i n t h e column
4
3
3.1
1
2
3
4
2
3
4
2
3
4
5
2
3
4
Select rows with restrictions on columns
d a t a f r a m e [ d a t a f r a m e [ ’ column name ’ ] == s o m e v a l u e s ]
3.5
1
Select the sub-dataframe of a few columns
d a t a f r a m e 2 = d a t a f r a m e [ [ ’ column name1 ’ , ’ column name2 ’ ] ]
d a t a f r a m e 2 = d a t a f r a m e [ d a t a f r a m e . column [ 0 : 2 ] ]
# s e l e c t t h e f i r s t two columns i n two d i f f e r e n t ways
d a t a f r a m e 2 = d a t a f r a m e [ ’ column name x ’ : ’ column name y ’ ]
# s e l e c t t h e columns between t h e two columns
3.4
1
Select a specific column
column = d a t a f r a m e [ ’ column name ’ ]
# column i s a [ S e r i e s ] o b j e c t , c o n t a i n s row index + v a l u e s , both a r e l i s t s
c o l u m n v a l u e s = column . v a l u e s
column index = column . index
3.3
1
Get column names
p r i n t ( d a t a f r a m e . columns )
# d a t a f r a m e . columns i s a l i s t o f s t r i n g s
f i r s t c o l u m n = d a t a f r a m e . columns [ 0]
# p r i n t t h e f i s t column , which i s a s t r i n g
3.2
1
Select data from a dataframe
Select rows with row index
data frame . i l o c [ i ]
#i : row index
data frame . i l o c [0:3]
# s e l e c t t h e rows with i n d i c e s 0 ,1 ,2
Remark. The difference between loc and iloc: If the index of the dataframe is 3, 7, 0, 2,
. . ., iloc[0] will select the third row (true integer index), loc will select the 1st row (index
by locations).
3.6
1
2
Select row index with max values in a specific column
d a t a f r a m e [ ’ column name ’ ] . idxmax ( )
# r e t u r n s t h e 1 s t row index t h a t has max
3.7
Select given entry
5
1
2
3
4
5
6
# Approach 1 : i : t r u e row index
d a t a f r a m e . i l o c [ i ] [ ’ column name ’ ]
# Approach 2 : i : l o c a t i o n
d a t a f r a m e [ ’ column name ’ ] . v a l u e s [ i ]
# Approach 3 : i : t r u e row index
d a t a f r a m e . i x [ i , ’ column name ’ ]
3.8
1
2
Iterate rows
f o r i , row i n d a t a f r a m e . i t e r r o w s ( ) :
# i : row i n d i c e s ; row : each row
6
4
4.1
1
2
3
4
5
6
7
8
9
2
3
4
5
6
2
3
4
5
6
2
3
4
2
3
4
5
6
Rename columns
d a t a f r a m e . rename ( columns={ ’ column name1 ’ : ’ new column name1 ’ , ’ column name2 ’ : ’
new column name2 ’ } , i n p l a c e=True )
# rename a few columns : w i l l g e t warning message
d a t a f r a m e . columns = [ ’ column name1 ’ , ’ column name2 ’ , . . . ]
# rename a l l columns
4.5
1
Reindex one row
o l d i n d i c e s = d a t a f r a m e . index
new indices = o l d i n d i c e s . values
f o r i , item i n enumerate ( n e w i n d i c e s ) :
i f item == ’ o l d i n d e x t o b e c h a n g e d ’ :
n e w i n d i c e s [ i ] = ’ new index ’
d a t a f r a m e . index = n e w i n d i c e s
4.4
1
Reindex rows
d a t a f r a m e . index = [ index1 , index2 , . . . ]
# r e p l a c e i n d i c e s with a new l i s t
d a t a f r a m e = d a t a f r a m e . s e t i n d e x ( [ ’ column name ’ ] )
# indexed by a p a r t i c u l a r column
d a t a f r a m e = d a t a f r a m e . r e s e t i n d e x ( drop=True )
# r e i n d e x e d from 0 . drop=F a l s e : make a dataframe column with t h e o l d index
values .
4.3
1
Revise data in a particular entry
# i : t r u e row index
# Approach 1 ( w i l l g e t warning message ) :
d a t a f r a m e . i x [ i , ’ column name ’ ] = new value
# Approach 2 ( w i l l g e t warning message ) :
d a t a f r a m e [ ’ column name ’ ] [ i ] = new value
# Approach 3 :
d a t a f r a m e . s e t v a l u e ( i , ’ column name ’ , new value )
# Approach 4 :
d a t a f r a m e . a t [ i , ’ column name ’ ] = new value
4.2
1
Revise data in a dataframe
Drop columns / rows
d a t a f r a m e . drop ( ’ column name ’ , a x i s =1, i n p l a c e=True )
# drop a column
d a t a f r a m e . drop ( [ ’ column name1 ’ , ’ column name2 ’ , . . . ] , a x i s =1, i n p l a c e=True )
# drop a few columns
d a t a f r a m e . drop ( ’ row index1 ’ , a x i s =0, i n p l a c e=True )
# drop a row
7
7
8
d a t a f r a m e . drop ( [ ’ row index1 ’ , ’ row index2 ’ , . . . ] , a x i s =0, i n p l a c e=True )
# drop a few rows
4.6
1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
i s n u l l = pd . i s n u l l ( d a t a f r a m e [ ” column name ” ] )
# w i l l r e t u r n a t r u e / f a l s e S e r i e s . N u l l v a l u e = NaN .
new data frame = d a t a f r a m e . dropna ( )
# drop a l l rows with m i s s i n g v a l u e s
new data frame = d a t a f r a m e . dropna ( a x i s = 1)
# drop a l l columns with m i s s i n g v a l u e s
new data frame = d a t a f r a m e . dropna ( s u b s e t =[” column1 ” , ” column2 ” ] )
# drop a l l rows with m i s s i n g v a l u e s i n t h e two columns
d a t a f r a m e [ ” column ” ] = d a t a f r a m e [ ” column ” ] . f i l l n a ( d a t a f r a m e [ ” column ” ] . median ( )
)
d a t a f r a m e [ ” column ” ] = d a t a f r a m e [ ” column ” ] . f i l l n a ( d a t a f r a m e [ ” column ” ] . mean ( ) )
d a t a f r a m e [ ” column ” ] = d a t a f r a m e [ ” column ” ] . f i l l n a ( something )
# f i l l NaN v a l u e s with column median / mean / or s t h e l s e
4.7
1
2
2
3
4
5
6
7
1
2
3
4
5
Change types of a column
d a t a f r a m e [ ’ column name ’ ] = d a t a f r a m e [ ’ column name ’ ] . a s t y p e ( f l o a t )
# changing t y p e s t o f l o a t
4.9
1
Data frame transpose
data frame . T
4.8
1
Find / drop / fill missing values
Merge data frames
import pandas as pd
l i s t 1 = [{ ’ c1 ’ : 4 , ’ c2 ’ : 3 } , { ’ c1 ’ : 2 , ’ c2 ’ : 3}]
l i s t 2 = [{ ’ c1 ’ : 6 , ’ c2 ’ : 9 } , { ’ c1 ’ : 8 , ’ c2 ’ : 1 0 } ]
df1 = pd . DataFrame ( l i s t 1 , index =[0 ,1])
df2 = pd . DataFrame ( l i s t 2 , index=[ ’ two ’ , ’ t h r e e ’ ] )
d f = pd . c o n c a t ( [ df1 , df2 ] )
# combine two data frames with common columns , i n c r e a s e rows . index could
c o n t a i n keys o f d i f f e r e n t data t y p e s
import pandas as pd
df1 = pd . DataFrame ({ ’ key ’ : [ ’ a ’ , ’ b ’ ] , ’ c1 ’ : [ 1 , 2 ] , ’ c2 ’ : [ 8 , 2]})
df2 = pd . DataFrame ({ ’ key ’ : [ ’ a ’ , ’ b ’ ] , ’ c3 ’ : [ 3 , 5]})
merged df = pd . merge ( df1 , df2 , on= ’ key ’ )
# combine two data frames with t h e same data i n t h e column ’ key ’ . Merged data
frame w i l l c o n t a i n 4 columns .
8
5
5.1
1
2
3
Search key words in a dataframe
Exact match in target column
f o r i , row i n movies . i t e r r o w s ( ) :
i f r e . s e a r c h ( ” Keywords ” , row [ ” column ” ] ) != None :
print ( i )
9
6
6.1
1
2
3
4
2
3
4
5
Sort dataframe
d a t a f r a m e . s o r t ( ” column name ” , i n p l a c e=True , a s c e n d i n g=F a l s e )
# s o r t by column , i n d e c r e a s i n g o r d e r
d a t a f r a m e . s o r t i n d e x ( i n p l a c e=True , a s c e n d i n g=F a l s e )
# s o r t by index , i n d e c r e a s i n g o r d e r
6.2
1
Perform operations on a dataframe
Rearrange dataframe - pivot table
import numpy as np
new data frame = d a t a f r a m e . p i v o t t a b l e ( index=” column1 ” , v a l u e s=” column2 ” ,
aggfunc=np . mean)
# For d i f f e r e n t column1 v a l u e s , f i n d t h e r e s u l t o f a g g f u n c t i o n a p p l i e d t o
column2 .
new data frame = d a t a f r a m e . p i v o t t a b l e ( index=” column1 ” , v a l u e s=” any column ” ,
aggfunc= ’ count ’ )
# count t h e f r e q u e n c y f o r column1 .
Remark. Other choices of aggfunc: ’max’, ’min’, np.std, np.sum, np.median.
6.3
1
2
3
Grouping
groups = d a t a f r a m e . groupby ( ” column1 ” )
t a b l e = groups . a g g r e g a t e ( np . mean) [ ” column2 ” ]
# c l a s s i f y rows i n t o d i f f e r e n t groups based on ’ column1 ’ , f o r each group a p p l y
np . mean ( ) on a l l v a l u e s i n ” column2 ” .
6.4
‘Apply’ function
r1
r2
r3
apply1
1
2
3
4
5
6
c1
ap1
c2
ap1
c3
ap1
apply2
ap2
ap2
ap2
apply1 = d a t a f r a m e . a p p l y ( s o m e f u n c t i o n ) OR
apply1 = d a t a f r a m e . a p p l y ( lambda x : s o m e f u n c t i o n ( x ) )
# a p p l y f u n c t i o n t o columns , t o g e t a S e r e i s o b j e c t l a b e l e d column names
apply2 = d a t a f r a m e . a p p l y ( s o m e f u n c t i o n , a x i s =1) OR
apply2 = d a t a f r a m e . a p p l y ( lambda x : s o m e f u n c t i o n ( x ) , a x i s =1)
# a p p l y f u n c t i o n t o rows , t o g e t a S e r e i s o b j e c t l a b e l e d row i n d i c e s
10
7
Others
Series.tolist() converts Series to lists.
numpy.nan: missing value
11
Descargar