r upload csv with fread and NOT add a new column

Importing Information into R

A tutorial nearly information analysis using R

Dr Jon Yearsley (School of Biology and Environmental Science, UCD)

  • Objectives
  • Organise yourself!
  • Data Workflow
  • Format your information (tidy data)
  • Data frames
  • Importing spreadsheet data
  • Summary of the topics covered
  • Further Reading

How to Read this Tutorial

This tutorial is a mixture of R code chunks and explanations of the code. The R lawmaking chunks will appear in boxes.

Below is an example of a chunk of R code:

                                          # This is a clamper of R lawmaking. All text afterwards a # symbol is a comment                                            # Set working directory using setwd() function                                            setwd('Enter the path to my working directory')                                            # Clear all variables in R'southward memory                                            rm(list=                ls())                # Standard code to clear R'southward memory                                    

Sometimes the output from running this R lawmaking volition be displayed after the chunk of code. R output volition be preceeded by ##.

Here is a clamper of code followed past the R output

                                          2                +                4                # Utilize R to add two numbers                                    
          ## [1] 6        

Objectives

The objectives of this tutorial are:

  1. Demonstrate good practise in information organisation
  2. Introduce plain text file formats for data
  3. Explain data import into R

Organise yourself!

Before you start importing data into R you should take time to organised your workspace on your computer:

  • Create a folder on your estimator to contain all your work for this particular project (e.chiliad. a folder called DataModule)
  • Inside this project folder create another folder called information. This will concord all the raw data files. These raw data files should not exist changed.
  • Within this projection folder create a text file chosen MyFirstScript.R. You can use RStudio for this (for this use File->New File->R Script menu option) or any basic text editor to do this (e.thousand. Notepad, TextEdit, gedit, emacs). This file will be your R script that will contain all the commands for R. The .r or .R suffixes is the standard suffix for an R script.
  • If yous are starting a big project consider creating dissever folder for: R scripts, figures, output from the R script

Your beginning R script

At present you accept created the file MyFirstScript.R you should put some header text at the commencement of the file to explicate what the R script will do. This was described in tutorial ane.

Video Tutorial: Creating a new R script with RStudio (1 min)

The text should take a short explanation of the R script followed by your proper noun and the engagement you wrote the R script. Each line should start with a # so that the text is not interpreted by R (this text is for humans so they empathise what the file is intended to exercise). Here is an case,

          # ********** Showtime of header ************** # Title: <The title of your R script>  # # Add a short description of the R script here. # # Author: <your name>  (email address) # Engagement: <today's date> # # *********** Finish of header ****************  # Two mutual commands at the start of an R script are: rm(list=ls())         # Clear R'due south memory  setwd('~/DataModule') # Set up the working directory  # Supplant '~/DataModule' with the name of your ain directory  # ****************************************** # Write your commands beneath.  # Call up to use comments to explain your commands                  

Writing clear R scripts

An R script isn't just telling the computer how to perform calculations on your data. It is likewise explaining your working to other human being beings.

"Instead of imagining that our main chore is to instruct a estimator what to do, permit us concentrate rather on explaining to human beings what we desire a estimator to do." – Donald E. Knuth

To make your R scripts usable by humans they must exist clearly commented (using the # symbol to get-go a annotate) and clearly organised.

As you lot write an R script consider these questions:

  • Does your R script look well organised (e.g. is it well spaced, are lines indented logically)?
  • Could someone else read the R script and understand the bones idea?
  • Could someone else modify your R script relatively easily?
  • In a couple of months time could you quickly read and edit your own R script?

Professional data analysts take clarity very seriously. Here are some links to R coding way guides:

  1. Google's style guide, https://google.github.io/styleguide/Rguide.xml
  2. Hadley Wickham's style guide, http://adv-r.had.co.nz/Style.html
  3. http://www.stat.ubc.ca/~jenny/STAT545A/block19_codeFormattingOrganization.html
  4. http://nicercode.github.io/blog/2013-04-05-why-nice-lawmaking/

Data Workflow

Beneath is a schematic of the workflow for handling data.

Figure: The workflow to follow when handling data.

In this tutorial nosotros will consider formating information, in the next tutorial we'll discuss importing data, and and so nosotros'll start to consider exploring the data using graphics and numerical summaries.

Format your data (tidy data)

The workflow starts long before you analyse your information. It starts even before you lot have your information in some computer software.

Organising your data should follow tidy data guidelines (see below) and be planned before yous collect your data. The format of the data should exist finalised before importing the data into R. It is ofttimes easiest to tidy your data using a spreadsheet plan earlier you import the data into R.

Well organised data from the kickoff will make your life a lot easier and your data import equally painless every bit possible.

Six guidelines for tidy data

When tidying your information you should ensure that:

  1. each variable has its ain column
  2. each row is an observation
  3. the top of each column contains the name of the variable
  4. there are no bare columns or blank rows between data
  5. all data in a column has the aforementioned type (e.g. it is all numerical information, or it is all text data)
  6. data are consistent (e.g. if a binary variable can accept values 'Yes' or 'No' then just these two values are allowed, with no alternatives such every bit 'Y' and 'North')

PDF Summary: This PDF document reiterates the concept of tidy data

The link to the PDF is: http://world wide web.ucd.ie/ecomodel/pdf/TidyData.pdf

Poorly vs well formatted data

The data set shown in the figure below are an example of poorly formatted data. The data set contains data on the pb concentrations (ppm) from iii species of fish (whitefish, sucker and trout). 2 types of sample were collected: samples from fillets of fish and from whole fish. The information has three variables: lead concentration, species of fish and type of fish sample.

Figure: A poorly formatted data set. This file would be hard to import and analyse in this format.

How would you ameliorate the format of the poorly formatted data shown in the figure? (Hint: employ the vi guidelines higher up)

The second figure shows some well formatted information that follows the tidy data guidelines: each column represents a single variable and each row an observation.

Figure: A well formatted data set. This file would be easy to import and analyse in this format. One column contains the data for one variable. These data are the worldwide occurences of Covid-19, downlaoded from the European Centre for Disease Prevention and Control, https://www.ecdc.europa.eu/en

Data frames

A data frame is R's name for spreadsheet information (east.g. data organised in a grid, like Excel). R stores the vast majority of data equally a information frame and uses data frames when analyzing data.

A data frame forces the data to be well organised.

  • Each column is a variable. The proper noun of this variable becomes the proper name of the cavalcade.
  • Each row corresponds to an observation. This meas that values in the same row are data collected about the same object. Rows tin besides have names.

Below is an example of a data frame (called airquality) that contains data on the air quality in New York from May - September 1973 (this is a information set that is built in to R).

                                          # The airquality data is a congenital-in dataset                                                          # Beginning 10 rows of the airquality data frame                                            head(airquality,                northward=                ten)                      
          ##    Ozone Solar.R Wind Temp Month Day ## i     41     190  seven.4   67     5   ane ## 2     36     118  8.0   72     5   2 ## iii     12     149 12.half-dozen   74     5   iii ## 4     18     313 11.5   62     v   4 ## 5     NA      NA 14.3   56     5   5 ## 6     28      NA 14.9   66     five   6 ## 7     23     299  viii.6   65     5   seven ## eight     nineteen      99 13.8   59     5   8 ## nine      8      19 20.1   61     5   9 ## ten    NA     194  eight.6   69     5  x        

Y'all can blazon ?airquality to brandish the help file for this data gear up. The information frame has 154 rows (observations) and 6 columns (variables measured). The half dozen columns contain data on: ozone concentrations (parts per billion), solar radiations, current of air speed, air temperature, calendar month and day of observation. You can meet that each column has a name corresponding to the information for that column.

The structure of the data frame can exist viewed using the str() function

                                          # Display the structure of the airquality data frame                                            str(airquality)                      
          ## 'information.frame':    153 obs. of  6 variables: ##  $ Ozone  : int  41 36 12 18 NA 28 23 19 viii NA ... ##  $ Solar.R: int  190 118 149 313 NA NA 299 99 nineteen 194 ... ##  $ Wind   : num  vii.four 8 12.6 11.5 14.three 14.ix eight.6 xiii.8 xx.1 eight.6 ... ##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ... ##  $ Month  : int  5 5 5 v 5 5 5 5 5 5 ... ##  $ Day    : int  1 2 iii iv five 6 seven 8 nine 10 ...        

The str() function shows that this is a data frame with 153 observations (rows) and six variables (columns). It also shows the data tyes of the variables: wind is a numerical variable (i.e. continuous) and the other variables are all integers (i.due east. whole numbers).

Tidy data in R is described in more detail on this web page: https://cran.r-projection.org/spider web/packages/tidyr/vignettes/tidy-data.html

Tibbles

A recent development (circa 2016) is an improved information frame called a tibble. We will not discuss these new data frame objects here, simply you tin read virtually them at https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html.

Don't Panic! Tibbles are very like to information frames.

The important point to know is that if y'all utilize RStudio'southward GUI interface to import data then your data will be stored in a tibble, non a information frame.

Importing spreadsheet data

To outset working with data in R y'all need to import your data into R. You are aiming to have a data frame that contains your data.

The simplest way to import data into R is from a text file (https://en.wikipedia.org/wiki/Text_file). Text files (sometimes called flat files) can be read by any computer operating system and by many unlike statistical programs. Saving data as a simple text file makes your data highly transportable.

Importing data from software specific formats (eastward.yard. Excel's .XLSX format, Minitab'southward .MTW format, SPSS'south .SAV format or SAS's .SAS format) is possible (e.m. using RStudio'due south Import Dataset GUI). If you want your data to be easily shared with other people and so use a text file to store your data.

We advise you lot to:

  • save your data as a text file (software, such equally Excel, often have an selection to save data as plain text)
  • organize data with columns corresponding to unlike variables before exporting to the text file
  • apply a visible text graphic symbol to delimit each column (commonly a comma, semi-colon). Using an invisible grapheme (e.thousand. a infinite or a TAB) is non recommended because these characters all look the aforementioned at kickoff glance.

General advice on importing data into R can be institute at https://cran.r-project.org/doctor/manuals/r-release/R-data.html

Converting information to a CSV text file

A comma separated values file (CSV file) is the most common format for a text file that contains data.

Here are a few video tutorials on converting information into a CSV text file so that it is suitable for import into R.

Video Tutorial: Converting data from EXCEL to a CSV format (3 mins)

Video Tutorial: Converting data from Googlesheets to a CSV format (1 min)

Viewing text files

Earlier importing a text file into any software package information technology is a huge help if you can look at it in a text editor. Text files tin can comprise characters that are normally invisible (e.m. spaces, tabs and end of line markers). If a text editor is going to exist of use it must exist able to display all the characters in a file.

Iii text editors that can exercise this are:

notepad++ is a free programme for Windows operating systems

BBedit is a gratuitous program for Mac OSX operating systems

emacs is a GNU opensource programme primarily for Linux operating systems.

On Linux systems the cat -A control from the terminal is also useful.

Here are two video tutorials on this topic

Video Tutorial: Viewing information in a text file before importing into R (iv mins)

Video Tutorial: An overview of the mutual data text file formats (iii mins)

Data import examples

The data we'll be importing are described at http://world wide web.ucd.ie/ecomodel/Resources/datasets_WebVersion.html

The files are:

  • WOLF.CSV: This file is a text file of comma separated values.
  • Superlative.CSV: This file is a text file of comma separated values.
  • INSECT.TXT:This file is a text file of TAB delimited values.
  • Beekeeper.TXT: This file is a text file with blank space delimiting the values.
  • MALIN_HEAD.TXT: This file is a text file with TAB delimited values.

All these information files are uncomplicated text files that differ in the graphic symbol used to distinguish columns of data.

Comma delimited files (CSV files)

CSV stands for comma separated values (note sometimes semi-colons are used in identify of commas because some countries use the comma in place of the decimal point).

The read.table() function is a flexible function for importing text information

Video Tutorial: Importing a CSV file into R using read.table() (5 mins)

                                          # Import WOLF.CSV file using read.table part                            wolf                =                read.tabular array('WOLF.CSV',                header=                Truthful,                sep=                ',')                      

The wolf variable contains the imported information. Information technology is called a data frame.

The ideal system of a data frame is for each row to exist an ascertainment of some object and each columns a variable that measures some property of the object. For case, each row of wolf is an ascertainment of i individual wolf and each cavalcade of wolf give information about where the wolf was observed and the data collected from its hair sample.

The HEIGHT.CSV file also contains comma separated values. Here is the read.table() command to read in this file

                                          # Import HEIGHT.CSV file using read.tabular array part                            human                =                read.table('Top.CSV',                header=                TRUE,                sep=                ',')                      

Notation: The function read.csv() is a special instance of the read.table() function.

Use the R help pages to learn more well-nigh these functions

                          ?read.table                # Brandish help folio on read.tabular array function                                    

TAB delimited files (TXT files)

The INSECT.TXT data set is a text file where variables are delimited past a TAB. In addition the showtime 3 lines contain a information description that we do not want to import.

The read.tabular array() function tin can be used to import this file. The argument skip=3 is used to ignore the get-go three lines. The argument sep='\t' specifies a TAB as the variable delimiter

                                          # Import INSECT.TXT file using read.table role (TAB delimited)                                            # skipping the beginning 3 lines (skip=three)                            insect                =                read.table('INSECT.TXT',                header=T,                skip=                3,                sep=                '                \t                ')                      

The MALIN_HEAD.TXT as well contains TAB delimited data. Here is the read.table() command to read in this file

                                          # Import MALIN_HEAD.TXT file using read.table office (TAB delimited)                            rainfall                =                read.table('MALIN_HEAD.TXT',                header=T,                sep=                '                \t                ')                      

Blank infinite delimited files

The BEEKEEPER.TXT data set uses white space to delimit the variables. The first half-dozen lines of the file contain a clarification of the information

Using read.table() with the argument sep='' will translate any space as a variable delimiter.

                                          # Import BEEKEEPER.TXT file using read.table function (white space delimited)                                            # skipping the beginning six lines (skip=half dozen)                            bees                =                read.table('BEEKEEPER.TXT',                header=T,                skip=                vi,                sep=                '')                      

Summary of import commands

Type of text file R Control
Comma delimited (.CSV) read.table(<filename>, header=T, sep=',')
TAB delimited (.TXT) read.table(<filename>, header=T, sep='\t')
Bare space (.TXT) read.table(<filename>, header=T, sep='')
                                          # Comma separated values                            wolf                =                read.table('WOLF.CSV',                header=                TRUE,                sep=                ',')              human                =                read.table('HEIGHT.CSV',                header=                True,                sep=                ',')                                            # TAB delimited values                            insect                =                read.table('INSECT.TXT',                header=T,                skip=                3,                sep=                '                \t                ')              rainfall                =                read.tabular array('MALIN_HEAD.TXT',                header=T,                sep=                '                \t                ')                                            # White infinite delimited values                            bees                =                read.table('BEEKEEPER.TXT',                header=T,                skip=                half dozen,                sep=                '')                      

Importing data using RStudio

RStudio has its own data import functionality. To utilise this you will need to install the R package readr. For more inofmration about this see RStudio'due south guide: https://back up.rstudio.com/hc/en-united states of america/manufactures/218611977-Importing-Information-with-RStudio

Video Tutorial: Importing a CSV file into R using RStudio's GUI (3 mins xiii secs)

Importing information using RStudio will save the data as a modified data frame, called a tibble (tibbles are briefly discussed in a higher place).

Importing using fread()

fread() is a powerful data import function that is like to read.table() but faster. It is part of the information.table parcel, which you volition need to install.

You lot should only accept to give fread() the proper name of the file you want to import, and fread() will try to piece of work out the appropriate mode to import the data. Endeavor some examples and compare the the examples in a higher place

                                          # ******************************************                                            # Other packages for importing data --------                                            # The data.table package                                                          library(data.table)                # Load the data.table package                                                          # Import a CSV file                            wolf2                =                fread('WOLF.CSV')                            human2                =                fread('HEIGHT.CSV')                                            # Import TAB delimited file                            insect2                =                fread('INSECT.TXT')              rainfall2                =                fread('MALIN_HEAD.TXT')                                                          # Import white space delimited file                            bees2                =                fread('Apiculturist.TXT')                      

The fread() command is simpler to apply because it tries to guess the format of the data in the file.

Summary of the topics covered

  • Organizing your files on your estimator
  • Best practise for formatting data
  • Reading in spreadsheet data
  • Data frames

Further Reading

All these books can be found in UCD'southward library

  • Andrew P. Beckerman and Owen 50. Petchey, 2012 Getting Started with R: An introduction for biologists (Oxford University Press, Oxford) [Chapter 2, iii]
  • Mark Gardner, 2012 Statistics for Ecologists Using R and Excel (Pelagic, Exeter)
  • Michael J. Crawley, 2015 Statistics : an introduction using R (John Wiley & Sons, Chichester) [Chapter 2]
  • Tenko Raykov and George A Marcoulides, 2013 Bones statistics: an introduction with R (Rowman and Littlefield, Plymouth)

finkfrompands.blogspot.com

Source: https://www.ucd.ie/ecomodel/Resources/Sheet2a_data_import_WebVersion.html

0 Response to "r upload csv with fread and NOT add a new column"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel