# グループ化されていない観測結果に関する記述統計 -- r フィールド と dplyr フィールド 関連 問題

## Descriptive statistics on grouped and ungrouped observations

0

### 問題

これは私のデータです：

<事前> <コード> dput(DF) structure(list(Product_Name = c("iPhone", "iPhone", "iPhone", "iPhone", "iPhone", "iPhone", "Nexus 6P", "Nexus 6P", "Nexus 6P", "Nexus 6P", "Nexus 6P", "Nexus 6P"), Product_Type = c("New", "New", "Refurbished", "New", "New", "Refurbished", "Refurbished", "Refurbished", "Refurbished", "Refurbished", "Refurbished", "Refurbished" ), Year = c(2006, 2011, 2009, 2008, 2011, 2009, 2012, 2007, 2013, 2015, 2009, 2010), Units = c(100, 200, 300, 400, 500, 600, 700, 200, 120, 125, 345, 340)), .Names = c("Product_Name", "Product_Type", "Year", "Units"), row.names = c(NA, 12L), class = "data.frame")

これに関する私のコードです：

<事前> <コード> DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2"

<事前> <コード> DF %>% group_by(Product_Name, Product_Type,Time) %>% dplyr::summarise(Count = n(), Sum_Units = sum(Units,na.rm=TRUE), Avg_Units = mean(Units,na.rm = TRUE), Max_Units=max(Units,na.rm = TRUE))

<事前> <コード> dput(DFOut) structure(list(Product_Name = c("iPhone", "Nexus 6P"), New_Units_Sum_Time1 = c(500, NA), Refurbished_Units_Sum_Time_1 = c(900, 545), Sum_Units_Time1 = c(1400, 545), Sum_Units_Time2 = c(700, 1285), Sum_Units_Time_1_And_2 = c(2100, 1830), Avg_Units_Time1 = c(350, 272.5), Avg_Units_Time2 = c(350, 321.25), Avg_Units_Time_1_And_2 = c(350, 305), Max_Units_Time1 = c(600, 345), Max_Units_Time2 = c(500, 700), Max_Units_Time_1_And_2 = c(600, 700)), .Names = c("Product_Name", "New_Units_Sum_Time1", "Refurbished_Units_Sum_Time_1", "Sum_Units_Time1", "Sum_Units_Time2", "Sum_Units_Time_1_And_2", "Avg_Units_Time1", "Avg_Units_Time2", "Avg_Units_Time_1_And_2", "Max_Units_Time1", "Max_Units_Time2", "Max_Units_Time_1_And_2" ), row.names = 1:2, class = "data.frame")

a）製品の種類とそれが販売された期間に基づいて（例えば、` DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2" 0 ` および` DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2" `）。出力では、` DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2" 2 `と<コード> DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2" 3 の組み合わせを示しています。 ` Product_Type14 ` ` 998876615 `の他の組み合わせの説明的統計の作成方法を教えてください。それは素晴らしいでしょう。

b）商品の種類を無視することに基づくが、販売期間を無視していない（例えば、<コード> DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2" 7 ）

c）両方の種類の製品と期間を無視して販売されました（例えば、<コード> DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2" 8 ）。

Avgと平均のためのDitto。

どうすればいいですか？私は助けに感謝します。私は本当にこれに苦労しています。

Excelを使用してDFOUTを手動で作成したことに注意してください。私はそれをチェックしたが、いくつかの手動エラーがあるかもしれない - 私は質問がある場合にそれらを明確にすることを嬉しく思います。あなたの時間をありがとうございました。

sessionInfo（）

<事前> <コード> DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2" 9

I have recently migrated to R from Stata. I am unsure how to perform compute descriptive statistics on grouped and ungrouped observations.

Here's my data:

``dput(DF) structure(list(Product_Name = c("iPhone", "iPhone", "iPhone",  "iPhone", "iPhone", "iPhone", "Nexus 6P", "Nexus 6P", "Nexus 6P",  "Nexus 6P", "Nexus 6P", "Nexus 6P"), Product_Type = c("New",  "New", "Refurbished", "New", "New", "Refurbished", "Refurbished",  "Refurbished", "Refurbished", "Refurbished", "Refurbished", "Refurbished" ), Year = c(2006, 2011, 2009, 2008, 2011, 2009, 2012, 2007, 2013,  2015, 2009, 2010), Units = c(100, 200, 300, 400, 500, 600, 700,  200, 120, 125, 345, 340)), .Names = c("Product_Name", "Product_Type",  "Year", "Units"), row.names = c(NA, 12L), class = "data.frame") ``

My data has products sold by year and type. Each product could a refurbished product or a new product. Further, if it was sold before 2010, I would mark it as sold in "Time 1" otherwise I would mark it as sold in "Time 2".

Here's my code for this:

``DF[DF\$Year<2010,"Time"]<-"1" DF[DF\$Year>=2010,"Time"]<-"2" ``

Now, I want to generate descriptive statistics for these time periods.

``DF %>%    group_by(Product_Name, Product_Type,Time) %>%   dplyr::summarise(Count = n(),                     Sum_Units = sum(Units,na.rm=TRUE),                     Avg_Units = mean(Units,na.rm = TRUE),                     Max_Units=max(Units,na.rm = TRUE)) ``

If we run above code, we would get descriptive statistics by `Product_Name`, `Product_Type`, and `Time` (i.e. grouped descriptive statistics). However, this is not what I want. I want descriptive statistics with and without considering groupings with `Product_Type` and `Time`. Meaning, I would want to compute descriptive statistics assuming that products were sold in Time 1 OR Time 2 (i.e. all years) and irrespective of the type of product sold, while retaining some of the grouped information above.

Expected output:

``dput(DFOut) structure(list(Product_Name = c("iPhone", "Nexus 6P"), New_Units_Sum_Time1 = c(500,  NA), Refurbished_Units_Sum_Time_1 = c(900, 545), Sum_Units_Time1 = c(1400,  545), Sum_Units_Time2 = c(700, 1285), Sum_Units_Time_1_And_2 = c(2100,  1830), Avg_Units_Time1 = c(350, 272.5), Avg_Units_Time2 = c(350,  321.25), Avg_Units_Time_1_And_2 = c(350, 305), Max_Units_Time1 = c(600,  345), Max_Units_Time2 = c(500, 700), Max_Units_Time_1_And_2 = c(600,  700)), .Names = c("Product_Name", "New_Units_Sum_Time1", "Refurbished_Units_Sum_Time_1",  "Sum_Units_Time1", "Sum_Units_Time2", "Sum_Units_Time_1_And_2",  "Avg_Units_Time1", "Avg_Units_Time2", "Avg_Units_Time_1_And_2",  "Max_Units_Time1", "Max_Units_Time2", "Max_Units_Time_1_And_2" ), row.names = 1:2, class = "data.frame") ``

In the output, you would see that I have some descriptive statistics:

a) based on the type of product and the period it was sold (e.g. `New_Units_Sum_Time1` i.e. `New` and `Time1`). Please note that in the output, I have only shown the combination of `New` and `Time1`. If you can guide me how to produce descriptive statistics for other combinations of `Refurbished` and `Time`, it would be awesome.

b) based on ignoring type of product but not ignoring the period sold (e.g. `Sum_Units_Time1` i.e. `Time1`)

c) based on ignoring both type of product and period it was sold (e.g. `Sum_Units_Time_1_And_2`).

Ditto for Avg and Mean.

How can I do this? I'd appreciate any help. I am really struggling on this.

Please note that I manually created DFOut using Excel. Although I triple checked it, there could be some manual errors--I will be more than happy to clarify them if there are questions. Thank you for your time.

sessionInfo()

``R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)  locale: [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           [5] LC_TIME=English_United States.1252      attached base packages: [1] grDevices datasets  stats     graphics  grid      tcltk     utils     methods   base       other attached packages:  [1] tables_0.8              Hmisc_4.0-2             Formula_1.2-1           survival_2.40-1          [5] ResourceSelection_0.3-0 magrittr_1.5            stringr_1.1.0           bit64_0.9-5              [9] bit_1.1-12              tufterhandout_1.2.1     knitr_1.15.1            rmarkdown_1.3           [13] tufte_0.2               corrplot_0.77           purrr_0.2.2             readr_1.0.0             [17] tibble_1.2              tidyverse_1.1.1         cowplot_0.7.0           plotly_4.5.6            [21] ggplot2_2.2.1           maps_3.1.1              directlabels_2015.12.16 tidyr_0.6.1             [25] ggthemes_3.3.0          R2HTML_2.3.2            lubridate_1.6.0         xts_0.9-7               [29] zoo_1.7-14              lattice_0.20-34         corrgram_1.10           hexbin_1.27.1           [33] sm_2.2-5.4              compare_0.2-6           installr_0.18.0         psych_1.6.12            [37] reshape2_1.4.2          readstata13_0.8.5       pastecs_1.3-18          boot_1.3-18             [41] vcd_1.4-3               car_2.1-4               xlsxjars_0.6.1          rJava_0.9-8             [45] debug_1.3.1             dplyr_0.5.0             foreign_0.8-67          gmodels_2.16.2          [49] openxlsx_4.0.0          plyr_1.8.4               loaded via a namespace (and not attached):  [1] minqa_1.2.4         colorspace_1.3-2    class_7.3-14        modeltools_0.2-21   mclust_5.2.2         [6] rprojroot_1.2       htmlTable_1.9       base64enc_0.1-3     MatrixModels_0.4-1  flexmix_2.3-13      [11] mvtnorm_1.0-5       xml2_1.1.1          codetools_0.2-15    splines_3.3.2       mnormt_1.5-5        [16] robustbase_0.92-7   jsonlite_1.2        nloptr_1.0.4        pbkrtest_0.4-6      broom_0.4.1         [21] cluster_2.0.5       kernlab_0.9-25      httr_1.2.1          backports_1.0.5     assertthat_0.1      [26] Matrix_1.2-7.1      lazyeval_0.2.0      acepack_1.4.1       htmltools_0.3.5     quantreg_5.29       [31] tools_3.3.2         gtable_0.2.0        Rcpp_0.12.9         trimcluster_0.1-2   gdata_2.17.0        [36] nlme_3.1-128        iterators_1.0.8     fpc_2.1-10          lmtest_0.9-34       lme4_1.1-12         [41] rvest_0.3.2         gtools_3.5.0        dendextend_1.4.0    DEoptimR_1.0-8      MASS_7.3-45         [46] scales_0.4.1        TSP_1.1-4           hms_0.3             parallel_3.3.2      SparseM_1.74        [51] RColorBrewer_1.1-2  gridExtra_2.2.1     rpart_4.1-10        latticeExtra_0.6-28 stringi_1.1.2       [56] gclus_1.3.1         mvbutils_2.7.4.1    foreach_1.4.3       checkmate_1.8.2     seriation_1.2-1     [61] caTools_1.17.1      prabclus_2.2-6      bitops_1.0-6        evaluate_0.10       htmlwidgets_0.8     [66] R6_2.2.0            gplots_3.0.1        DBI_0.5-1           haven_1.0.0         whisker_0.3-2       [71] mgcv_1.8-16         nnet_7.3-12         modelr_0.1.0        KernSmooth_2.23-15  viridis_0.3.4       [76] readxl_0.1.1        data.table_1.10.0   forcats_0.2.0       digest_0.6.12       diptest_0.75-7      [81] stats4_3.3.2        munsell_0.4.3       registry_0.3        viridisLite_0.1.3   quadprog_1.5-5     ``
</div

## 回答リスト

2

これを自動化する1つの方法は、まずグループ化変数のすべての可能な組み合わせを持つベクトル（` ind `）を作成することです。その後、それらの組み合わせを取り、それらを式に変換して` Units `。各式がリストに保存されるにつれて（<コード> l1 ）、そのリストを繰り返し、集計します。

<事前> <コード> ind <- unlist(sapply(c(2,3), function(i) combn(c('Product_Name', 'Product_Type', 'Time'), i, paste, collapse = '+'))) l1 <- sapply(ind, function(i) as.formula(paste('Units ~ ', i))) lapply(l1, function(i) aggregate(i, df, FUN = function(j) c(sum1 = sum(j), avg = mean(j), max_units = max(j)))) #which gives #\$`Product_Name+Product_Type` # Product_Name Product_Type Units.sum1 Units.avg Units.max_units #1 iPhone New 1200 300 500 #2 iPhone Refurbished 900 450 600 #3 Nexus 6P Refurbished 1830 305 700 #\$`Product_Name+Time` # Product_Name Time Units.sum1 Units.avg Units.max_units #1 iPhone 1 1400.00 350.00 600.00 #2 Nexus 6P 1 545.00 272.50 345.00 #3 iPhone 2 700.00 350.00 500.00 #4 Nexus 6P 2 1285.00 321.25 700.00 #\$`Product_Type+Time` # Product_Type Time Units.sum1 Units.avg Units.max_units #1 New 1 500.00 250.00 400.00 #2 Refurbished 1 1445.00 361.25 600.00 #3 New 2 700.00 350.00 500.00 #4 Refurbished 2 1285.00 321.25 700.00 #\$`Product_Name+Product_Type+Time` # Product_Name Product_Type Time Units.sum1 Units.avg Units.max_units #1 iPhone New 1 500.00 250.00 400.00 #2 iPhone Refurbished 1 900.00 450.00 600.00 #3 Nexus 6P Refurbished 1 545.00 272.50 345.00 #4 iPhone New 2 700.00 350.00 500.00 #5 Nexus 6P Refurbished 2 1285.00 321.25 700.00

One way to automate this is to first create a vector (`ind`) with all possible combinations of your grouping variables. We then take those combinations and convert them to formula with `Units`. As each formula is saved in a list (`l1`), we iterate over that list and aggregate.

``ind <- unlist(sapply(c(2,3), function(i) combn(c('Product_Name', 'Product_Type', 'Time'),                                                               i, paste, collapse = '+')))  l1 <- sapply(ind, function(i) as.formula(paste('Units ~ ', i)))  lapply(l1, function(i) aggregate(i, df, FUN = function(j) c(sum1 = sum(j),                                                              avg = mean(j),                                                              max_units = max(j))))  #which gives  #\$`Product_Name+Product_Type` #  Product_Name Product_Type Units.sum1 Units.avg Units.max_units #1       iPhone          New       1200       300             500 #2       iPhone  Refurbished        900       450             600 #3     Nexus 6P  Refurbished       1830       305             700  #\$`Product_Name+Time` #  Product_Name Time Units.sum1 Units.avg Units.max_units #1       iPhone    1    1400.00    350.00          600.00 #2     Nexus 6P    1     545.00    272.50          345.00 #3       iPhone    2     700.00    350.00          500.00 #4     Nexus 6P    2    1285.00    321.25          700.00  #\$`Product_Type+Time` #  Product_Type Time Units.sum1 Units.avg Units.max_units #1          New    1     500.00    250.00          400.00 #2  Refurbished    1    1445.00    361.25          600.00 #3          New    2     700.00    350.00          500.00 #4  Refurbished    2    1285.00    321.25          700.00  #\$`Product_Name+Product_Type+Time` #  Product_Name Product_Type Time Units.sum1 Units.avg Units.max_units #1       iPhone          New    1     500.00    250.00          400.00 #2       iPhone  Refurbished    1     900.00    450.00          600.00 #3     Nexus 6P  Refurbished    1     545.00    272.50          345.00 #4       iPhone          New    2     700.00    350.00          500.00 #5     Nexus 6P  Refurbished    2    1285.00    321.25          700.00 ``
</div

## 関連する質問

1  データフレームの各列をフィルタリングすると、比類のない値の場合はNA  ( Filtering each column of a data frame an put na for unmatched values )

0  Rの照合ケースに基づいて列を変更する  ( Change column based on matched cases in r )

1  RとPURRRを使用して、PMAPを持つリストのリストを使用して複数のデータフレームに参加する  ( Using r and purrr to join multiple dataframes using a list of lists with pmap )
PMAPを使用してリストに埋め込まれているデータフレームをまとめようとしています。 <事前> <コード> library(purrr) library(plyr) # Create a list of 5 data frames create_df <- f...

3  Shiny：dplyrはエラーメッセージを返します  ( Shiny dplyr returns error messages )
<コード> diamonds データセットと下のコードを使用する <事前> <コード> library(dplyr) library(ggplot2) diam <- diamonds %>% dplyr::select(cut, co...

0  このステートメントを最適化する方法はありますか：ベクトルからの値の選択  ( Is there a way to optimize this statement selection of values from a vector )

0  DPLYRを使用して、定義されたグローバル変数にフィルタをフィルタしますか？  ( How do i use dplyr to filter to a pre defined global variable )

0  R：選択した数の列のみで周波数をパーセントに変換する  ( R convert frequency to percentage with only a selected number of columns )

4  巻き巻き毛を用いて変異を変異させる[二重]  ( Mutate a variable with curly curly )
この質問はすでにここで回答を持っています `dplyr` で動的変数名を使用する （9回答） 閉じた 5月前...

0  左側のデータフレームを左join（）に参加する[重複]  ( Joining two data frames with left join )
この質問はすでにここで回答を持っています データフレーム（内側、外側、左、左、左）右） （13回答） 閉じた ...

0  展開DataFrameは `x`を示していますが、通常のシーケンスエラーではありません  ( Expand dataframe shows x is not a regular sequence error )