Dataset Generation Tutorial

To see a formal description of the generate_report() function, look here.

newline

Example 1: generate a dataset from a database

Suppose you have a database from GCAM version v7.1 named myDb in a folder named dbFolder and you want to generate a standardized dataset for several scenarios: scen1, scen2 and scen3. The generate_report function will generate it and automatically save it in the same folder where myDb is located.

Follow the installation guide either with R or Docker.
Load the gcamreport library. If you are using Rstudio or Docker, run

devtools::load_all(".", reset = TRUE)

and if you are using R, run

library(gcamreport)

Store the database path and name, the query path, the desired project name, the GCAM-core compatible version, and the desired scenarios and reporting variables in variables. In case you do not specify the scenarios, all the scenarios in the database will be considered for reporting; and if the variables are not specified, all the available reporting variables will be used. For more details about scenarios and variables specification, look at the regions’ tutorial or at the variables’ tutorial. If you want to specify the GWP version look at GWP tutorial and if you want to specify the query files, look at the query files tutorial.

dbpath <- "/path/to/database"
dbname <- "gcamdb_name"
prjname <- "awesomeProj.dat"
scen <- c("scen1", "scen2", "scen3")
GCAMv <- "v7.1"

Notice that the extension is included in the project name. Accepted extensions are .dat & .proj.

Note: If you followed the Docker installation, you should place your database inside the gcamreport folder, which is now considered the root of the R session. Inside the R session it is referred to as /app. Thus, your database path will be something like /app/path/to/database.

Generate the standardized dataset until the desired year. In this example, 2050. By default is 2100 and it should be at least 2025.

generate_report(db_path = dbpath, db_name = dbname, prj_name = prjname, 
                GCAM_version = GCAMv, scenarios = scen, final_year = 2050,
                launch_ui = FALSE)

Note: The project generation might take some time, depending on the number of scenarios, regions, and variables you want to standardize.

Notice that the dataset will automatically be saved in .RData, .csv and .xlsx at /path/to/database/awesomeProj_standardized.RData, /path/to/database/awesomeProj_standardized.csv, and /path/to/database/awesomeProj_standardized.xlsx.

This procedure will also generate a project file at /path/to/database/dbname_prjname.dat with all the loaded queries. You can directly use it as indicated in Example 2.

The terminal will output the performed vetting verification and their final status.

newline

Example 2: generate a dataset from a project

Suppose you have a project named myProj.dat obtained through GCAM 7.1 and you want to generate a standardized dataset from it. The generate_report function will generate it and automatically save it in the same folder as myProj.dat. Note that myProj.dat should have all the queries needed to generate the standardized dataset. If you are not sure you have all of them, or if you need to generate the project, see Example1.

Follow the installation guide either with R or Docker.
Load the gcamreport library. If you are using Rstudio or Docker, run

devtools::load_all(".", reset = TRUE)

and if you are using R, run

library(gcamreport)

Store the project path, the GCAM version, the desired scenarios, and reporting variables in variables. In case you do not specify the scenarios, all the scenarios in the database will be considered for reporting; and if the variables are not specified, all the available reporting variables will be used. For more details about scenarios and variables specification, look at the regions’ tutorial or at the variables’ tutorial. Notice that only regions and variables already present in the rgcam project can be considered for reporting. If you wish to include new items in your project, consider generating the project again as detailed in Example1.

mypath <- "/path/to/project/myProj.dat"
scen <- c('scen1', 'scen2', 'scen3')
GCAMv <- "v7.1"

Notice that the extension is included. Accepted extensions are .dat & .proj.

Note: If you followed the Docker installation, you should place your project file inside the gcamreport folder, which is now considered the root of the R session. Inside the R session it is referred to as /app. Thus, your project path will be something like /app/path/to/project/myProj.dat.

Generate the standardized dataset until the desired year. In this example, 2050. By default is 2100 and it should be at least 2025.

generate_report(prj_name = mypath, scenarios = scen, final_year = 2050, 
                GCAM_version = GCAMv, launch_ui = FALSE)

Notice that the dataset will automatically be saved in .RData, .csv and .xlsx at /path/to/project/myProj_standardized.RData, /path/to/project/myProj_standardized.csv, and /path/to/project/myProj_standardized.xlsx.

The terminal will output the performed verifications and their final status.

newline

Example 3: save or not the output and specify the file format or the directory

Suppose you are in the situation of one of the previous examples, but you want to either not save the standardized output, save it in .csv, .xlsx, or in both extensions.

Follow the installation guide either with R or Docker.
Load the gcamreport library. If you are using Rstudio or Docker, run

devtools::load_all(".", reset = TRUE)

and if you are using R, run

library(gcamreport)

Use example1 database or example2 project description and add all the extra parameters that you would like to consider in the generate_report function (e.g., final year, desired scenarios…). Specify the output saving options through the save_output parameter:

## -- save the dataset in CSV and XLSX format
generate_report(..., save_output = TRUE)    # this is the default option

## -- save the dataset only in CSV format
generate_report(..., save_output = 'CSV')

## -- save the dataset only in XLSX format
generate_report(..., save_output = 'XLSX')

## -- do not save the dataset
generate_report(..., save_output = FALSE)

Use example1 database or example2 project description and add all the extra parameters that you consider in the generate_report function. Specify the output directory and output file name through output_file parameter. This will save the output in the indicated path as .csv and .xlsx. To modify the extension, check step 3.

## -- save the dataset in '/desired/directory' and in a file called 'awesomeOutput'
generate_report(..., output_file = '/desired/directory/awesomeOutput')

newline

Example 4: specify the regions or regions’ group/s

Suppose you are in one of the previous situations, but you want to consider a standardized dataset with only some regions. You have two ways to select them: you can directly specify the desired regions to be considered, or you can specify the group(s) of regions to be considered. In either case, the desired regions will form World. Then, for example, the total arable land of the world will be the sum of the arable land of only the selected regions.

Follow the installation guide either with R or Docker.
Load the gcamreport library. If you are using Rstudio or Docker, run

devtools::load_all(".", reset = TRUE)

and if you are using R, run

library(gcamreport)

Check which are the available regions or regions’ groups for reporting. The following commands will print a list with all the possibilities.

available_regions()
available_continents()

In case you want to store them in a vector, you can simply assign the output. You can also skip the console printing by setting print = FALSE.

avail_reg <- available_regions(print = FALSE)
avail_cont <- available_continents()

Use example1 database or example2 project description and add all the extra parameters that you consider in the generate_report function (e.g., final final, desired scenarios…). Specify the regions through the desired_regions parameter or the desired_continents parameter. Notice that not both can be specified at the same time.

## -- specify the desired regions
generate_report(..., desired_regions = c('EU-15','EU-12'))

## -- specify the desired regions' group/s
generate_report(..., desired_continents = c('ASIA','REF'))

newline

Example 5: specify the variables

Suppose you are in the situation of one of the previous examples, but you want to consider only some variables in the standardized dataset.

Follow the installation guide either with R or Docker.
Load the gcamreport library. If you are using Rstudio or Docker, run

devtools::load_all(".", reset = TRUE)

and if you are using R, run

library(gcamreport)

Check which are the available variables for reporting. The following command will print a list with all the possibilities.

available_variables()

In case you want to save them in a vector, you can simply assign the output. You can also skip the console printing by setting print = FALSE.

avail_var <- available_variables(print = FALSE)

Use example1 database or example2 project description and add all the extra parameters that you consider in the generate_report function (e.g, final year, desired scenarios…). Specify the variables through the desired_variables parameter. You can specify a vector with all the desired variables names fully written, or also consider all variables that start with the same name. This last feature, allows you to easily select all variables within a group, such as Emissions, Emissions|CO2, or Agricultural Demand

## -- specify the desired variables
generate_report(..., 
              desired_variables = c('Agricultural Demand|Crops|Energy',
                                    'Agricultural Demand|Crops|Feed',
                                    'Capacity Additions|Electricity|Wind|Onshore', 
                                    'Emissions|BC|Energy*')) # This will select,
                                                             # Emissions|BC|Energy,
                                                             # Emissions|BC|Energy|Demand|Industry, 
                                                             # Emissions|BC|Energy|Demand|Residential and Commercial,
                                                             # Emissions|BC|Energy|Demand|Transportation,
                                                             # Emissions|BC|Energy|Supply

In case you specify only some variables within a group, they will make up the total value. For example, if we select Final Energy|Electricity and Final Energy|Gases, then Final Energy will be the sum of these two sectors, and will not consider Final Energy|Industry or Final Energy|Heat.

newline

Example 6: specify the GWP or the GCAM version

Suppose you are in the situation of one of the previous examples, but you want to consider some specific GWP values to standardize the dataset, which is from a certain GCAM version.

Follow the installation guide either with R or Docker.
Load the gcamreport library. If you are using Rstudio or Docker, run

devtools::load_all(".", reset = TRUE)

and if you are using R, run

library(gcamreport)

Check which are the available GWP and CAM versions for reporting. The following command will print a list with all the possibilities.

available_GCAM_versions()
available_GWP_versions()

Use example1 database or example2 project description and add all the extra parameters that you consider in the generate_report function (e.g, final year, desired scenarios…). Specify the GWP version through the GWP_version parameter and the GCAM version through the GCAM_version parameter. Notice that the GCAM version should match the GCAM version used to produce the data. By default the reporting process uses GCAM7.0 and AR5.

## -- specify the desired variables
generate_report(..., GCAM_version = "v6.0", GWP_version = "AR4")

newline

Example 7: specify the query files

The gcamreport standardization procedure requires two query files. The gcamreport::queries_general is a query file that contains the necessary queries to standardize any variable. You can see the xml version of this file here. In contrast, the gcamreport::queries_nonCO2 contains only nonCO2 queries. In particular, the queries nonCO2 emissions by subsector (excluding resource production and nonCO2 emissions by region. These queries are particularly heavy, and to avoid crashing the R session, they are loaded in parts. You can see the xml version of the file here.

It is highly recommended not to modify these files. Although they specify a large set of queries to be loaded, not all of them will be included in the rgcam project. The gcamreport package generates the rgcam project with the minimum queries necessary to standardize the desired variables, thus avoiding loading extra queries. It is only possible to specify the query files when generating the rgcam project.

Let’s start with the example: Suppose you have a database named myDb of GCAM 7.1 in a folder named dbFolder and you want to generate a standardized dataset for several scenarios (scen1, scen2 and scen3) using a new_general_queries_file. The generate_report function will generate it and automatically save it in the same folder where myDb is located.

Follow the installation guide either with R or Docker.
Load the gcamreport library. If you are using Rstudio or Docker, run

devtools::load_all(".", reset = TRUE)

and if you are using R, run

library(gcamreport)

Store the database path and name, the general query path, the desired project name, the GCAM version, and the desired scenarios and reporting variables in variables. In case you do not specify the scenarios, all the scenarios in the database will be considered for reporting; and if the variables are not specified, all the available reporting variables will be used. For more details about scenarios and variables specification, look at the regions’ tutorial or at the variables’ tutorial.

dbpath <- "/path/to/database"
dbname <- "gcamdb_name"
prjname <- "awesomeProj.dat"
scen <- c("scen1", "scen2", "scen3")
GCAMv <- "v7.1"
new_queries_general_file <- "path/to/your/new_queries_file.xml"

Notice that the extension is included in the general query file (.xml) and in the project name (.dat or .proj).

Note: If you followed the Docker installation, you should place your database and the new query file inside the gcamreport folder, which is now considered the root of the R session. Inside the R session it is referred to as /app. Thus, your databse path will be something like /app/path/to/database and your query file path will be something like /app/path/to/new_queries_file.xml.

Generate the standardized dataset until the desired year. In this example, 2050. By default is 2100 and it should be at least 2025.

generate_report(db_path = dbpath, query_path = querypath, db_name = dbname, 
                prj_name = prjname, scenarios = scen, final_year = 2050, 
                GCAM_version = GCAMv, launch_ui = FALSE, 
                queries_general_file = new_queries_general_file)

Note: The project generation might take some time, depending on the number of scenarios, regions, and variables you want to standardize.

This procedure will also generate a project file at /path/to/database/dbname_prjname.dat with all the loaded queries. You can directly use it as indicated in Example 2.

The terminal will output the performed vetting verifications and their final status.

To specify the nonCO2 query file you can proceed analogously. However, check carefully its default structure and the function where is used: data_query.

newline

Troubleshooting for the `generate_report()` function

A) Error on `generate_report` considering a database

When running generate_report(db_path = "path/to/your/data/myData.dat""), you might see this error in your R console:

> generate_report("path/to/your/data/myData.dat")
[1] "Creating project..."
/home/user/basex/.basex: writing new configuration file.
Error in localDBConn(db_path, db_name, migabble = FALSE) : 
  Database does not exist or is invalid: examples/database_basexdb_ref
In addition: Warning messages:
1: In normalizePath(dbPath) :
  path[1]="examples": No such file or directory
2: The following named parsers don't match the column names: name, date, version

This problem might be due to an incorrect package installation or an incorrect database placement.

Possible solution 1: ensure that you cloned the repo. Check the instructions here.

Possible solution 2: ensure that you placed the database in the folder you are specifying. It can be, that if you extracted the database from a zip folder, an intermediate folder has appeared. In addition:

In case you are using the gcamreport package following the R installation, try to copy the whole path to your data, for instance db_path = C:\Users\username\Documents\path\to\your\database if you are using a Windows distribution.
In case you are using the gcamreport package following the Docker installation:
1. ensure that your database is inside the gcamreport folder.
2. ensure that you typed correctly the path to your gcamreport folder when generating the docker image (5th step in the Docker section)
3. ensure that you are pointing correctly to your database. For example, if in the gcamreport folder you have a folder called some_databases with your database amazingDatabase, you should refer to it as

# option 1: full path
generate_report(db_path = "/app/some_databases", db_name = "amazingDatabase")

# option 2: partial path
generate_report(db_path = "some_databases", db_name = "amazingDatabase")

Possible solution 3: ensure that you did not place the database in the main gcamreport folder. The database should be placed in any subfolder within gcamreport or in any folder outside gcamreport. Due tot a known issue with the rgcam package, placing the database in the man folder is not supported.

newline

B) Error on `generate_report` with `left_join_strict`

When running generate_report(), you might see this error in your R console:

  > generate_report(...)
  Loading project...
  Loading data, performing checks, and saving output...
  [1] "ag_demand_clean"

  Error in left_join_strict(., filter_variables(get(paste("ag_demand_map", :
    Error: Some rows in the left dataset do not have matching keys in the right dataset.

This problem is due to a mismatch in the ag_demand_map map.

Possible solution 1: ensure that you specified correctly the GCAM_verions parameter in the generate_report function.

Possible solution 2: have a look at this tutorial to know more about how to update the mappings.

newline

C) Error on `generate_report` considering a project

When running generate_report("path/to/your/data/myData.dat"), you might see this error in your R console:

> generate_report("path/to/your/data/myData.dat")
[1] "Loading project..."
[1] "Loading data, performing checks, and saving output..."
[1] "ag_demand_clean"
Error in rgcam::getQuery(prj, "demand balances by crop commodity") :
  getQuery: Query demand balances by crop commodity is not in any scenarios in the data set.

This problem is due to a wrong path specification.

Possible solution: ensure that you specified correctly the path. In addition:

In case you are using the gcamreport package following the R installation, try to copy the whole path to your data, for instance C:\Users\username\Documents\path\to\your\data\myData.dat if you are using a Windows distribution.
In case you are using the gcamreport package following the Docker installation:
1. ensure that your data is inside the gcamreport folder.
2. ensure that you typed correctly the path to your gcamreport folder when generating the docker image (5th step in the Docker section)
3. ensure that you are pointing correctly to your data. For example, if in the gcamreport folder you have a folder called amazingData with your dataset myData.dat, you should refer to it as

# option 1: full path
generate_report("/app/amazingData/myData.dat")

# option 2: partial path
generate_report("amazingData/myData.dat")

newline

Once the R console is opened, you might see this message after introducing any command:

System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
Warning message:
In system("timedatectl", intern = TRUE) :
   running command 'timedatectl' had status 1

Possible solution: simply type Ctrl+C and run your command again.

Example 1: generate a dataset from a database

Example 2: generate a dataset from a project

Example 3: save or not the output and specify the file format or the directory

Example 4: specify the regions or regions’ group/s

Example 5: specify the variables

Example 6: specify the GWP or the GCAM version

Example 7: specify the query files

Troubleshooting for the generate_report() function

A) Error on generate_report considering a database

B) Error on generate_report with left_join_strict

C) Error on generate_report considering a project

D) Error related to system when using the Docker installation.

Troubleshooting for the `generate_report()` function

A) Error on `generate_report` considering a database

B) Error on `generate_report` with `left_join_strict`

C) Error on `generate_report` considering a project