Working with dataset in R and using subset to work on dataset

Let’s load built in data set

>Library(datasets)

Let’s access one of the dataset name mtcars:

 >mtcars

 

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb

Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4

Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1

Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1

Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2

Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1

Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4

Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2

Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3

Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3

Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4

Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4

Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4

Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1

Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2

Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2

AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2

Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4

Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1

Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4

Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

So what is this data set? Running str(_dataset_name) will tell you more about it:

> str(mtcars)
'data.frame':  32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Let’s see the top of the data frame using head

> head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can also see the tail as well:

> tail(mtcars)
                mpg cyl  disp  hp drat    wt qsec vs am gear carb
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

So how many rows are in this dataset?

> nrow(mtcars)
[1] 32

So how many columns are in this dataset?

> ncol(mtcars)
[1] 11

Ok can we get the name of column headers? Sure

> names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

So let’s see the summary for our dataset:

> summary(mtcars)

      mpg             cyl             disp             hp             drat     
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760 
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080 
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695 
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597 
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920 
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930 
       wt             qsec             vs               am       
 Min.   :1.513   Min.   :14.50   Min.   :0.0000   Min.   :0.0000 
 1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000   1st Qu.:0.0000 
 Median :3.325   Median :17.71   Median :0.0000   Median :0.0000 
 Mean   :3.217   Mean   :17.85   Mean   :0.4375   Mean   :0.4062 
 3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000   3rd Qu.:1.0000 
 Max.   :5.424   Max.   :22.90   Max.   :1.0000   Max.   :1.0000 
      gear            carb     
 Min.   :3.000   Min.   :1.000 
 1st Qu.:3.000   1st Qu.:2.000 
 Median :4.000   Median :2.000 
 Mean   :3.688   Mean   :2.812 
 3rd Qu.:4.000   3rd Qu.:4.000 
 Max.   :5.000   Max.   :8.000

As you can see above the summary give you the mean, media, max and other values for each column.

Using Subset with Dataset in R

How to list data frame with a few columns? Sure we will be using subset() for this:

Subset(dataframe, condition, format)

For example the following subset will show all the rows in mtcrs but only mpg column:

> subset(mtcars, select = c(mpg))
                     mpg
Mazda RX4           21.0
Mazda RX4 Wag       21.0
Datsun 710          22.8
Hornet 4 Drive      21.4
Hornet Sportabout   18.7
Valiant             18.1
Duster 360          14.3
Merc 240D           24.4
Merc 230            22.8
Merc 280            19.2
Merc 280C           17.8
Merc 450SE          16.4
Merc 450SL          17.3
Merc 450SLC         15.2
Cadillac Fleetwood  10.4
Lincoln Continental 10.4
Chrysler Imperial   14.7
Fiat 128            32.4
Honda Civic         30.4
Toyota Corolla      33.9
Toyota Corona       21.5
Dodge Challenger    15.5
AMC Javelin         15.2
Camaro Z28          13.3
Pontiac Firebird    19.2
Fiat X1-9           27.3
Porsche 914-2       26.0
Lotus Europa        30.4
Ford Pantera L      15.8
Ferrari Dino        19.7
Maserati Bora       15.0
Volvo 142E          21.4

Now what if we want to see the list of cars which have mpg above 20  along with mpg and hp column:

> subset(mtcars, mpg > 20, select = c(mpg, hp))
                mpg  hp
Mazda RX4      21.0 110
Mazda RX4 Wag  21.0 110
Datsun 710     22.8  93
Hornet 4 Drive 21.4 110
Merc 240D      24.4  62
Merc 230       22.8  95
Fiat 128       32.4  66
Honda Civic    30.4  52
Toyota Corolla 33.9  65
Toyota Corona  21.5  97
Fiat X1-9      27.3  66
Porsche 914-2  26.0  91
Lotus Europa   30.4 113
Volvo 142E     21.4 109

We can also call subset within subset.

For example we need to get a list of cars with HP above 100 and mpg above 20.

> subset(subset(mtcars, mpg > 20, select = c(mpg, hp)),hp >= 100)
                mpg  hp
Mazda RX4      21.0 110
Mazda RX4 Wag  21.0 110
Hornet 4 Drive 21.4 110
Lotus Europa   30.4 113
Volvo 142E     21.4 109

Note: following will not work as condition will not match and you will see a full list of cars with mpg and hp column
> subset(mtcars, mpg > 20 && hp >= 100, select = c(mpg, hp))

You can also use subset to show specific columns
> subset(mtcars[1:3])
                     mpg cyl  disp
Mazda RX4           21.0   6 160.0
Mazda RX4 Wag       21.0   6 160.0
Datsun 710          22.8   4 108.0
Hornet 4 Drive      21.4   6 258.0
Hornet Sportabout   18.7   8 360.0

…..

Or you can defined the columns name as below:

> subset(mtcars[c("mpg", “cyl”, "hp")])



Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s