Dataset Generation Tutorial
Source:vignettes/Dataset_Generation_Tutorial.Rmd
Dataset_Generation_Tutorial.Rmd
To see a formal description of the generate_report()
function, look here.
newline
Example 1: generate a dataset from a database
Suppose you have a database from GCAM version v7.1 named
myDb
in a folder named dbFolder
and you want
to generate a standardized dataset for several scenarios:
scen1
, scen2
and scen3
. The
generate_report
function will generate it and automatically
save it in the same folder where myDb
is located.
Follow the installation guide either with R or Docker.
Load the
gcamreport
library. If you are using Rstudio or Docker, run
devtools::load_all()
and if you are using R, run
- Store the database path and name, the query path, the desired project name, the GCAM-core compatible version, and the desired scenarios and reporting variables in variables. In case you do not specify the scenarios, all the scenarios in the database will be considered for reporting; and if the variables are not specified, all the available reporting variables will be used. For more details about scenarios and variables specification, look at the regions’ tutorial or at the variables’ tutorial. If you want to specify the GWP version look at GWP tutorial and if you want to specify the query files, look at the query files tutorial.
dbpath <- "/path/to/database"
dbname <- "gcamdb_name"
prjname <- "awesomeProj.dat"
scen <- c("scen1", "scen2", "scen3")
GCAMv <- "v7.1"
Notice that the extension is included in the project name. Accepted
extensions are .dat
& .proj
.
Note: If you followed the Docker installation, you
should place your database inside the gcamreport
folder,
which is now considered the root of the R session. Inside the R session
it is referred to as /app
. Thus, your database path will be
something like /app/path/to/database
.
- Generate the standardized dataset until the desired year. In this example, 2050. By default is 2100 and it should be at least 2025.
generate_report(db_path = dbpath, db_name = dbname, prj_name = prjname,
GCAM_version = GCAMv, scenarios = scen, final_year = 2050,
launch_ui = FALSE)
Note: The project generation might take some time, depending on the number of scenarios, regions, and variables you want to standardize.
Notice that the dataset will automatically be saved in
.RData
, .csv
and .xlsx
at
/path/to/database/awesomeProj_standardized.RData
,
/path/to/database/awesomeProj_standardized.csv
, and
/path/to/database/awesomeProj_standardized.xlsx
.
This procedure will also generate a project file at
/path/to/database/dbname_prjname.dat
with all the loaded
queries. You can directly use it as indicated in Example 2.
The terminal will output the performed vetting verification and their final status.
newline
Example 2: generate a dataset from a project
Suppose you have a project named myProj.dat
obtained
through GCAM 7.1 and you want to generate a standardized dataset from
it. The generate_report
function will generate it and
automatically save it in the same folder as myProj.dat
.
Note that myProj.dat
should have all the queries needed to
generate the standardized dataset. If you are not sure you have all of
them, or if you need to generate the project, see Example1.
Follow the installation guide either with R or Docker.
Load the
gcamreport
library. If you are using Rstudio or Docker, run
devtools::load_all()
and if you are using R, run
- Store the project path, the GCAM version, the desired scenarios, and reporting variables in variables. In case you do not specify the scenarios, all the scenarios in the database will be considered for reporting; and if the variables are not specified, all the available reporting variables will be used. For more details about scenarios and variables specification, look at the regions’ tutorial or at the variables’ tutorial. Notice that only regions and variables already present in the rgcam project can be considered for reporting. If you wish to include new items in your project, consider generating the project again as detailed in Example1.
mypath <- "/path/to/project/myProj.dat"
scen <- c('scen1', 'scen2', 'scen3')
GCAMv <- "v7.1"
Notice that the extension is included. Accepted extensions are
.dat
& .proj
.
Note: If you followed the Docker installation, you
should place your project file inside the gcamreport
folder, which is now considered the root of the R session. Inside the R
session it is referred to as /app
. Thus, your project path
will be something like /app/path/to/project/myProj.dat
.
- Generate the standardized dataset until the desired year. In this example, 2050. By default is 2100 and it should be at least 2025.
generate_report(prj_name = mypath, scenarios = scen, final_year = 2050,
GCAM_version = GCAMv, launch_ui = FALSE)
Notice that the dataset will automatically be saved in
.RData
, .csv
and .xlsx
at
/path/to/project/myProj_standardized.RData
,
/path/to/project/myProj_standardized.csv
, and
/path/to/project/myProj_standardized.xlsx
.
The terminal will output the performed verifications and their final status.
newline
Example 3: save or not the output and specify the file format or the directory
Suppose you are in the situation of one of the previous examples, but
you want to either not save the standardized output, save it in
.csv
, .xlsx
, or in both extensions.
Follow the installation guide either with R or Docker.
Load the
gcamreport
library. If you are using Rstudio or Docker, run
devtools::load_all()
and if you are using R, run
- Use example1 database or example2 project description and add all the extra
parameters that you would like to consider in the
generate_report
function (e.g., final year, desired scenarios…). Specify the output saving options through thesave_output
parameter:
## -- save the dataset in CSV and XLSX format
generate_report(..., save_output = TRUE) # this is the default option
## -- save the dataset only in CSV format
generate_report(..., save_output = 'CSV')
## -- save the dataset only in XLSX format
generate_report(..., save_output = 'XLSX')
## -- do not save the dataset
generate_report(..., save_output = FALSE)
- Use example1 database or example2 project description and add all the extra
parameters that you consider in the
generate_report
function. Specify the output directory and output file name throughoutput_file
parameter. This will save the output in the indicated path as.csv
and.xlsx
. To modify the extension, check step 3.
## -- save the dataset in '/desired/directory' and in a file called 'awesomeOutput'
generate_report(..., output_file = '/desired/directory/awesomeOutput')
newline
Example 4: specify the regions or regions’ group/s
Suppose you are in one of the previous situations, but you want to consider a standardized dataset with only some regions. You have two ways to select them: you can directly specify the desired regions to be considered, or you can specify the group(s) of regions to be considered. In either case, the desired regions will form World. Then, for example, the total arable land of the world will be the sum of the arable land of only the selected regions.
Follow the installation guide either with R or Docker.
Load the
gcamreport
library. If you are using Rstudio or Docker, run
devtools::load_all()
and if you are using R, run
- Check which are the available regions or regions’ groups for reporting. The following commands will print a list with all the possibilities.
In case you want to store them in a vector, you can simply assign the
output. You can also skip the console printing by setting
print = FALSE
.
avail_reg <- available_regions(print = FALSE)
avail_cont <- available_continents()
- Use example1 database or example2 project description and add all the extra
parameters that you consider in the
generate_report
function (e.g., final final, desired scenarios…). Specify the regions through thedesired_regions
parameter or thedesired_continents
parameter. Notice that not both can be specified at the same time.
## -- specify the desired regions
generate_report(..., desired_regions = c('EU-15','EU-12'))
## -- specify the desired regions' group/s
generate_report(..., desired_continents = c('ASIA','REF'))
newline
Example 5: specify the variables
Suppose you are in the situation of one of the previous examples, but you want to consider only some variables in the standardized dataset.
Follow the installation guide either with R or Docker.
Load the
gcamreport
library. If you are using Rstudio or Docker, run
devtools::load_all()
and if you are using R, run
- Check which are the available variables for reporting. The following command will print a list with all the possibilities.
In case you want to save them in a vector, you can simply assign the
output. You can also skip the console printing by setting
print = FALSE
.
avail_var <- available_variables(print = FALSE)
- Use example1 database or example2 project description and add all the extra
parameters that you consider in the
generate_report
function (e.g, final year, desired scenarios…). Specify the variables through thedesired_variables
parameter. You can specify a vector with all the desired variables names fully written, or also consider all variables that start with the same name. This last feature, allows you to easily select all variables within a group, such as Emissions, Emissions|CO2, or Agricultural Demand
## -- specify the desired variables
generate_report(...,
desired_variables = c('Agricultural Demand|Crops|Energy',
'Agricultural Demand|Crops|Feed',
'Capacity Additions|Electricity|Wind|Onshore',
'Emissions|BC|Energy*')) # This will select,
# Emissions|BC|Energy,
# Emissions|BC|Energy|Demand|Industry,
# Emissions|BC|Energy|Demand|Residential and Commercial,
# Emissions|BC|Energy|Demand|Transportation,
# Emissions|BC|Energy|Supply
In case you specify only some variables within a group, they will make up the total value. For example, if we select Final Energy|Electricity and Final Energy|Gases, then Final Energy will be the sum of these two sectors, and will not consider Final Energy|Industry or Final Energy|Heat.
newline
Example 6: specify the GWP or the GCAM version
Suppose you are in the situation of one of the previous examples, but you want to consider some specific GWP values to standardize the dataset, which is from a certain GCAM version.
Follow the installation guide either with R or Docker.
Load the
gcamreport
library. If you are using Rstudio or Docker, run
devtools::load_all()
and if you are using R, run
- Check which are the available GWP and CAM versions for reporting. The following command will print a list with all the possibilities.
- Use example1 database or example2 project description and add all the extra
parameters that you consider in the
generate_report
function (e.g, final year, desired scenarios…). Specify the GWP version through theGWP_version
parameter and the GCAM version through theGCAM_version
parameter. Notice that the GCAM version should match the GCAM version used to produce the data. By default the reporting process uses GCAM7.0 and AR5.
## -- specify the desired variables
generate_report(..., GCAM_version = "v6.0", GWP_version = "AR4")
newline
Example 7: specify the query files
The gcamreport
standardization procedure requires two
query files. The gcamreport::queries_general
is a query
file that contains the necessary queries to standardize any variable.
You can see the xml version of this file here.
In contrast, the gcamreport::queries_nonCO2
contains only
nonCO2 queries. In particular, the queries
nonCO2 emissions by sector (excluding resource production
and nonCO2 emissions by region
. These queries are
particularly heavy, and to avoid crashing the R session, they are loaded
in parts. You can see the xml version of the file here.
It is highly recommended not to modify these files. Although they
specify a large set of queries to be loaded, not all of them will be
included in the rgcam project. The gcamreport
package
generates the rgcam project with the minimum queries necessary to
standardize the desired variables, thus avoiding loading extra queries.
It is only possible to specify the query files when generating the rgcam
project.
Let’s start with the example: Suppose you have a database named
myDb
of GCAM 7.1 in a folder named dbFolder
and you want to generate a standardized dataset for several scenarios
(scen1
, scen2
and scen3
) using a
new_general_queries_file
. The generate_report
function will generate it and automatically save it in the same folder
where myDb
is located.
Follow the installation guide either with R or Docker.
Load the
gcamreport
library. If you are using Rstudio or Docker, run
devtools::load_all()
and if you are using R, run
- Store the database path and name, the general query path, the desired project name, the GCAM version, and the desired scenarios and reporting variables in variables. In case you do not specify the scenarios, all the scenarios in the database will be considered for reporting; and if the variables are not specified, all the available reporting variables will be used. For more details about scenarios and variables specification, look at the regions’ tutorial or at the variables’ tutorial.
dbpath <- "/path/to/database"
dbname <- "gcamdb_name"
prjname <- "awesomeProj.dat"
scen <- c("scen1", "scen2", "scen3")
GCAMv <- "v7.1"
new_queries_general_file <- "path/to/your/new_queries_file.xml"
Notice that the extension is included in the general query file
(.xml
) and in the project name (.dat
or
.proj
).
Note: If you followed the Docker installation, you
should place your database and the new query file inside the
gcamreport
folder, which is now considered the root of the
R session. Inside the R session it is referred to as /app
.
Thus, your databse path will be something like
/app/path/to/database
and your query file path will be
something like /app/path/to/new_queries_file.xml
.
- Generate the standardized dataset until the desired year. In this example, 2050. By default is 2100 and it should be at least 2025.
generate_report(db_path = dbpath, query_path = querypath, db_name = dbname,
prj_name = prjname, scenarios = scen, final_year = 2050,
GCAM_version = GCAMv, launch_ui = FALSE,
queries_general_file = new_queries_general_file)
Note: The project generation might take some time, depending on the number of scenarios, regions, and variables you want to standardize.
Notice that the dataset will automatically be saved in
.RData
, .csv
and .xlsx
at
/path/to/database/awesomeProj_standardized.RData
,
/path/to/database/awesomeProj_standardized.csv
, and
/path/to/database/awesomeProj_standardized.xlsx
.
This procedure will also generate a project file at
/path/to/database/dbname_prjname.dat
with all the loaded
queries. You can directly use it as indicated in Example 2.
The terminal will output the performed vetting verifications and their final status.
To specify the nonCO2 query file you can proceed analogously. However, check carefully its default structure and the function where is used: data_query.
newline
Troubleshooting for the generate_report()
function
A) Error on generate_report
considering a database
When running
generate_report(db_path = "path/to/your/data/myData.dat"")
,
you might see this error in your R console:
> generate_report("path/to/your/data/myData.dat")
[1] "Creating project..."
/home/user/basex/.basex: writing new configuration file.
Error in localDBConn(db_path, db_name, migabble = FALSE) :
Database does not exist or is invalid: examples/database_basexdb_ref
In addition: Warning messages:
1: In normalizePath(dbPath) :
path[1]="examples": No such file or directory
2: The following named parsers don't match the column names: name, date, version
This problem might be due to an incorrect package installation or an incorrect database placement.
Possible solution 1: ensure that you cloned the repo. Check the instructions here.
Possible solution 2: ensure that you placed the database in the folder you are specifying. It can be, that if you extracted the database from a zip folder, an intermediate folder has appeared. In addition:
In case you are using the
gcamreport
package following the R installation, try to copy the whole path to your data, for instancedb_path = C:\Users\username\Documents\path\to\your\database
if you are using a Windows distribution.-
In case you are using the
gcamreport
package following the Docker installation:ensure that your database is inside the
gcamreport
folder.ensure that you typed correctly the path to your
gcamreport
folder when generating the docker image (5th step in the Docker section)ensure that you are pointing correctly to your database. For example, if in the
gcamreport
folder you have a folder calledsome_databases
with your databaseamazingDatabase
, you should refer to it as
# option 1: full path
generate_report(db_path = "/app/some_databases", db_name = "amazingDatabase")
# option 2: partial path
generate_report(db_path = "some_databases", db_name = "amazingDatabase")
Possible solution 3: ensure that you did
not place the database in the main gcamreport
folder. The database should be placed in any subfolder within
gcamreport
or in any folder outside
gcamreport
. Due tot a known issue with the
rgcam
package, placing the database in the man folder is
not supported.
newline
B) Error on generate_report
with
left_join_strict
When running generate_report()
, you might see this error
in your R console:
> generate_report(...)
Loading project...
Loading data, performing checks, and saving output...
[1] "ag_demand_clean"
Error in left_join_strict(., filter_variables(get(paste("ag_demand_map", :
Error: Some rows in the left dataset do not have matching keys in the right dataset.
This problem is due to a mismatch in the ag_demand_map
map.
Possible solution 1: ensure that you specified
correctly the GCAM_verions
parameter in the
generate_report
function.
Possible solution 2: have a look at this tutorial to know more about how to update the mappings.
newline
C) Error on generate_report
considering a project
When running
generate_report("path/to/your/data/myData.dat")
, you might
see this error in your R console:
> generate_report("path/to/your/data/myData.dat")
[1] "Loading project..."
[1] "Loading data, performing checks, and saving output..."
[1] "ag_demand_clean"
Error in rgcam::getQuery(prj, "demand balances by crop commodity") :
getQuery: Query demand balances by crop commodity is not in any scenarios in the data set.
This problem is due to a wrong path specification.
Possible solution: ensure that you specified correctly the path. In addition:
In case you are using the
gcamreport
package following the R installation, try to copy the whole path to your data, for instanceC:\Users\username\Documents\path\to\your\data\myData.dat
if you are using a Windows distribution.-
In case you are using the
gcamreport
package following the Docker installation:ensure that your data is inside the
gcamreport
folder.ensure that you typed correctly the path to your
gcamreport
folder when generating the docker image (5th step in the Docker section)ensure that you are pointing correctly to your data. For example, if in the
gcamreport
folder you have a folder calledamazingData
with your datasetmyData.dat
, you should refer to it as
# option 1: full path
generate_report("/app/amazingData/myData.dat")
# option 2: partial path
generate_report("amazingData/myData.dat")
newline
D) Error related to system when using the Docker installation.
Once the R console is opened, you might see this message after introducing any command:
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
Warning message:
In system("timedatectl", intern = TRUE) :
running command 'timedatectl' had status 1
Possible solution: simply type Ctrl+C
and run your command again.