Git & GitHub Refresher (2023)

r-knowledge git

A refresher on typical Git & GitHub workflow at NN.

Bryan Blanc https://github.com/bpb824
2023-01-26

This content was presented to Nelson\Nygaard Staff at a Lunch and Learn webinar on Thursday, January 26th, 2023, and is available as a recording here and embedded below.

Workflow

Typical R or Python Workflow Locally

These are the suggested options for the R/Python Extract, Transform, Load (ETL) workflow locally:

  1. Inputs: Our data typically come from clients or the web, and get transmitted to some folder for storage/archiving before we perform ETL steps on the data. Here are suggested locations for storage with notes on each. Something critical to note here is we do not recommend storing raw data in GitHub – GitHub has a file size limitation of 100 MB, and is made for storing code, not data.

    1. Sharepoint

      1. Use the get_sharepoint_dir() function from {nntools} for easy access to your sharepoint parent folder in a way that will work across machines if someone else runs your code.

      2. Use the “sync” feature of sharepoint to synchronize necessary Sharepoint folders to your local machine (via OneDrive).

      3. Important to sync to the same folder so that the file paths will be the same across machines – recommendation is highest folder level in Sharepoint

    2. P Drive: Data can often be stored by PMs in the Background or analysis folders. There are no technical issues with this, but loading large data files from NN’s file server can be slow.

    3. G Drive: Data can often be stored here for use in combination with ArcGIS workflows. There are no technical issues with this, but loading large data files from NN’s file server can be slow.

  2. ETL processes (i.e. the code): This is the step that should be version controlled with Git, and stored on GitHub.

    1. Cloned locally onto your machine

    2. If you don’t want to use GitHub for some reason, recommendation would be sharepoint.

  3. Outputs: We typically are outputting results either a) back to NN file storage locations or b) directly to some cloud-hosted format, such as a Shiny application, an AWS bucket, a SQL database, a GitHub pages website, or some combination thereof.

    1. NN file storage locations

      1. Sharepoint: Outputs can be sent here and then shared via a weblink once they have synchronized to the cloud.

      2. P Drive: Outputs might be sent to the Analysis or Graphics folders

      3. G Drive: Outputs might be sent to a number of different subfolders within a G drive folder, depending on if it is a tabular output or a spatial file for mapping.

    2. Cloud destinations

      1. Shinyapps.io: Shiny apps can be deployed here.

      2. AWS S3 bucket: Files of miscellaneous structure can be stored here for use by Shiny applications, typically.

      3. PostgreSQL database: Data that can be formatted into a table (spatial or not) can be stored via SQL and queried by further R/Python scripts or Shiny applications.

      4. GitHub Pages website: Rendered HTML and associated files can be sent to a docs folder within a GitHub repository to be hosted as a website via GitHub pages.

Git Usage Options

Learning Resources