Organizing a dataframe - splitting one column into threeHow to sort a dataframe by multiple column(s)Drop data frame columns by nameSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasHow to change the order of DataFrame columns?Delete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers
How to show the equivalence between the regularized regression and their constraint formulas using KKT
Cronab fails because shell path not found
Is it legal for company to use my work email to pretend I still work there?
Can I ask the recruiters in my resume to put the reason why I am rejected?
Forgetting the musical notes while performing in concert
90's TV series where a boy goes to another dimension through portal near power lines
Anagram holiday
How badly should I try to prevent a user from XSSing themselves?
Blender 2.8 I can't see vertices, edges or faces in edit mode
AES: Why is it a good practice to use only the first 16bytes of a hash for encryption?
Alternative to sending password over mail?
Is it canonical bit space?
Is the Joker left-handed?
Why is consensus so controversial in Britain?
Has there ever been an airliner design involving reducing generator load by installing solar panels?
Were any external disk drives stacked vertically?
Arrow those variables!
Can a virus destroy the BIOS of a modern computer?
Why doesn't H₄O²⁺ exist?
Is it possible to create light that imparts a greater proportion of its energy as momentum rather than heat?
Can a rocket refuel on Mars from water?
Emailing HOD to enhance faculty application
Plain language with long required phrases
Why "Having chlorophyll without photosynthesis is actually very dangerous" and "like living with a bomb"?
Organizing a dataframe - splitting one column into three
How to sort a dataframe by multiple column(s)Drop data frame columns by nameSelecting multiple columns in a pandas dataframeRenaming columns in pandasAdding new column to existing DataFrame in Python pandasHow to change the order of DataFrame columns?Delete column from pandas DataFrame by column nameHow to iterate over rows in a DataFrame in Pandas?Select rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headers
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I have a dataset that looks like this:
Ord_ID Supplier Trans_Type Date
1 A PO 2/3/18
1 A Receipt 2/15/18
2 B PO 2/4/18
2 B Receipt 3/13/18
3 C PO 2/7/18
3 C Receipt 3/1/18
3 C Receipt 3/5/18
3 C Receipt 3/29/18
4 B PO 2/9/18
4 B Receipt 2/20/18
4 B Receipt 2/27/18
5 D PO 2/18/18
5 D Receipt 4/2/18
Basically, I need to separate the Date column into 3 different columns. I need a PO_Date column, a column that lists the earliest receipt date for each order, and the last receipt date for each order. Because some orders only have one receipt date, the 2nd and 3rd columns should be the same. I've tried using spread(), but I guess because there are varying numbers of Receipt dates for each order it didn't work. How can I make this happen?
Desired result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
1 A 2/3/18 2/15/18 2/15/18
2 B 2/4/18 3/13/18 3/13/18
3 C 2/7/18 3/1/18 3/29/18
4 B 2/9/18 2/20/18 2/27/18
5 D 2/18/18 4/2/18 4/2/18
r dataframe spread
add a comment |
I have a dataset that looks like this:
Ord_ID Supplier Trans_Type Date
1 A PO 2/3/18
1 A Receipt 2/15/18
2 B PO 2/4/18
2 B Receipt 3/13/18
3 C PO 2/7/18
3 C Receipt 3/1/18
3 C Receipt 3/5/18
3 C Receipt 3/29/18
4 B PO 2/9/18
4 B Receipt 2/20/18
4 B Receipt 2/27/18
5 D PO 2/18/18
5 D Receipt 4/2/18
Basically, I need to separate the Date column into 3 different columns. I need a PO_Date column, a column that lists the earliest receipt date for each order, and the last receipt date for each order. Because some orders only have one receipt date, the 2nd and 3rd columns should be the same. I've tried using spread(), but I guess because there are varying numbers of Receipt dates for each order it didn't work. How can I make this happen?
Desired result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
1 A 2/3/18 2/15/18 2/15/18
2 B 2/4/18 3/13/18 3/13/18
3 C 2/7/18 3/1/18 3/29/18
4 B 2/9/18 2/20/18 2/27/18
5 D 2/18/18 4/2/18 4/2/18
r dataframe spread
add a comment |
I have a dataset that looks like this:
Ord_ID Supplier Trans_Type Date
1 A PO 2/3/18
1 A Receipt 2/15/18
2 B PO 2/4/18
2 B Receipt 3/13/18
3 C PO 2/7/18
3 C Receipt 3/1/18
3 C Receipt 3/5/18
3 C Receipt 3/29/18
4 B PO 2/9/18
4 B Receipt 2/20/18
4 B Receipt 2/27/18
5 D PO 2/18/18
5 D Receipt 4/2/18
Basically, I need to separate the Date column into 3 different columns. I need a PO_Date column, a column that lists the earliest receipt date for each order, and the last receipt date for each order. Because some orders only have one receipt date, the 2nd and 3rd columns should be the same. I've tried using spread(), but I guess because there are varying numbers of Receipt dates for each order it didn't work. How can I make this happen?
Desired result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
1 A 2/3/18 2/15/18 2/15/18
2 B 2/4/18 3/13/18 3/13/18
3 C 2/7/18 3/1/18 3/29/18
4 B 2/9/18 2/20/18 2/27/18
5 D 2/18/18 4/2/18 4/2/18
r dataframe spread
I have a dataset that looks like this:
Ord_ID Supplier Trans_Type Date
1 A PO 2/3/18
1 A Receipt 2/15/18
2 B PO 2/4/18
2 B Receipt 3/13/18
3 C PO 2/7/18
3 C Receipt 3/1/18
3 C Receipt 3/5/18
3 C Receipt 3/29/18
4 B PO 2/9/18
4 B Receipt 2/20/18
4 B Receipt 2/27/18
5 D PO 2/18/18
5 D Receipt 4/2/18
Basically, I need to separate the Date column into 3 different columns. I need a PO_Date column, a column that lists the earliest receipt date for each order, and the last receipt date for each order. Because some orders only have one receipt date, the 2nd and 3rd columns should be the same. I've tried using spread(), but I guess because there are varying numbers of Receipt dates for each order it didn't work. How can I make this happen?
Desired result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
1 A 2/3/18 2/15/18 2/15/18
2 B 2/4/18 3/13/18 3/13/18
3 C 2/7/18 3/1/18 3/29/18
4 B 2/9/18 2/20/18 2/27/18
5 D 2/18/18 4/2/18 4/2/18
r dataframe spread
r dataframe spread
asked Mar 21 at 21:37
MillieMillie
253
253
add a comment |
add a comment |
5 Answers
5
active
oldest
votes
Using dplyr. First, make sure column Date is in date format. Assume dataframe is named mydata:
library(dplyr)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%y")
Now you can filter for Receipt, calculate max/min dates, then filter the original data for PO and join them together:
mydata %>%
filter(Trans_Type == "Receipt") %>%
group_by(Ord_ID, Supplier) %>%
summarise(First_Receipt_Date = min(Date),
Last_Receipt_Date = max(Date)) %>%
ungroup() %>%
left_join(filter(mydata, Trans_Type == "PO")) %>%
select(Ord_ID, Supplier, PO_Date = Date, First_Receipt_Date, Last_Receipt_Date)
Result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
<int> <chr> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
When I run this I get this error: "Error:byrequired, because the data sources have no common variables"
– Millie
Mar 22 at 18:36
Works for me using the example data in the question: the join is on Ord_ID and Supplier.
– neilfws
Mar 23 at 3:06
Got it to work this time. Not sure what was wrong last week; I updated packages earlier which might have done the trick.
– Millie
Mar 25 at 22:05
add a comment |
With tidyverse, borrowing @divibisan's sample data :
library(tidyverse)
df %>%
group_by(Ord_ID, Supplier) %>%
slice(c(1:2, n())) %>%
mutate(Trans_Type = c("PO_Date","First_Receipt_Date","Last_Receipt_Date")) %>%
spread(Trans_Type, Date) %>%
ungroup()
# # A tibble: 5 x 5
# Ord_ID Supplier First_Receipt_Date Last_Receipt_Date PO_Date
# <int> <fct> <date> <date> <date>
# 1 1 A 2018-02-15 2018-02-15 2018-02-03
# 2 2 B 2018-03-13 2018-03-13 2018-02-04
# 3 3 C 2018-03-01 2018-03-29 2018-02-07
# 4 4 B 2018-02-20 2018-02-27 2018-02-09
# 5 5 D 2018-04-02 2018-04-02 2018-02-18
If the data is not sorted as in the sample data, add %>% arrange(Trans_Type, Date) as a first step.
I would recommend not to useslice, it is not reproducible as you do not know the order of data. At least, usearrangebefore. A better way would be a combination offilterwithfirstandlastI guess.
– Sébastien Rochette
Mar 22 at 14:32
If data is not sorted, you can add%>% arrange(Trans_Type, Date)as a first step. given the shape of the sample data I assumed it was fair to assume it is sorted. Other assumptions are that there is always a"PO"value, that there are no other values than"PO"and"Receipt", that order of the output columns wasn't important etc...
– Moody_Mudskipper
Mar 22 at 17:15
I don't understand the point onfilterwithfirstandlast, I useslicein its precise intended use case IMO.
– Moody_Mudskipper
Mar 22 at 17:18
I know this is correct with your assumptions in this specific case. But because you never know how a dataset is built, nor can you know how it will be later updated, I do not recommend the use of indices for the selection/filtering of datasets. There must be a better explanation, included in the data, as why you chose these specific lines. Here, these are the smallest and the biggest values. I try to always think about the future use of my scripts. This is a personal recommendation.
– Sébastien Rochette
Mar 22 at 17:22
1
That's a legitimate point, but there's also value in concise code, and on SO you're rarely 100% explicit about assumptions anyway so it's a gray area. Actually my first answer had thearrangepart but I edited it out (not showing in edit history as i edited right away). I added a note at the end of my post as a compromise :).
– Moody_Mudskipper
Mar 22 at 17:29
|
show 5 more comments
I would start with something like this:
data %>%
group_by(Supplier, Trans_Type) %>%
summarise(min_date = min(Date),
max_date = max(Date)
) %>%
ungroup()
Then, you can play with gatherand spread to retrieve the columns you need.
add a comment |
Here's another tidyverse based solution that avoids the left_join. I have no idea which approach would be faster on a large dataset, but it's always good to have more options:
df <- structure(list(Ord_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L), Supplier = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Trans_Type = c("PO", "Receipt", "PO", "Receipt", "PO", "Receipt",
"Receipt", "Receipt", "PO", "Receipt", "Receipt", "PO", "Receipt"
), Date = structure(c(17565, 17577, 17566, 17603, 17569,
17591, 17595, 17619, 17571, 17582, 17589, 17580, 17623), class = "Date")), row.names = c(NA,
-13L), class = "data.frame")
df %>%
group_by(Ord_ID, Supplier, Trans_Type) %>%
# Keep only min and max date values
filter(Date == min(Date) | Date == max(Date) | Trans_Type != 'Receipt') %>%
# Rename 2nd Receipt value Receipt_2 so there are no duplicated values
mutate(Trans_Type2 = if_else(Trans_Type == 'Receipt' & row_number() == 2,
'Receipt_2', Trans_Type)) %>%
# Drop Trans_Type variable (we can't replace in mutate since it's a grouping var)
ungroup(Trans_Type) %>%
select(-Trans_Type) %>%
# Spread the now unduplicated Trans_Type values
spread(Trans_Type2, Date) %>%
# Fill in Receipt_2 values where they're missing
mutate(Receipt_2 = if_else(is.na(Receipt_2), Receipt, Receipt_2))
# A tibble: 5 x 5
Ord_ID Supplier PO Receipt Receipt_2
<int> <fct> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
add a comment |
You can just use dplyr to mutate new columns for PO date, and first and last receipt dates:
test1<-test %>%
mutate(Date = mdy(Date)) %>%
group_by(Ord_ID) %>%
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
filter(!is.na(PO_Date)) %>%
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
A breakdown:
test1<-test %>%
#convert format of "Date" column to as.Date to identify min and max dates
mutate(Date = mdy(Date)) %>%
#group by the Order ID
group_by(Ord_ID) %>%
#PO_Date will be where the "Trans_Type" is "PO" --> since the column is in date format,
#dplyr will convert this to numeric, but can be fixed later
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
#first receipt date is the minimum date of a receipt transaction
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
#last receipt date is the maximum date of a receipt transaction
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
#to remove duplicates
filter(!is.na(PO_Date)) %>%
#convert "PO_Date" column back to as.Date from numeric
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55289624%2forganizing-a-dataframe-splitting-one-column-into-three%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
Using dplyr. First, make sure column Date is in date format. Assume dataframe is named mydata:
library(dplyr)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%y")
Now you can filter for Receipt, calculate max/min dates, then filter the original data for PO and join them together:
mydata %>%
filter(Trans_Type == "Receipt") %>%
group_by(Ord_ID, Supplier) %>%
summarise(First_Receipt_Date = min(Date),
Last_Receipt_Date = max(Date)) %>%
ungroup() %>%
left_join(filter(mydata, Trans_Type == "PO")) %>%
select(Ord_ID, Supplier, PO_Date = Date, First_Receipt_Date, Last_Receipt_Date)
Result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
<int> <chr> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
When I run this I get this error: "Error:byrequired, because the data sources have no common variables"
– Millie
Mar 22 at 18:36
Works for me using the example data in the question: the join is on Ord_ID and Supplier.
– neilfws
Mar 23 at 3:06
Got it to work this time. Not sure what was wrong last week; I updated packages earlier which might have done the trick.
– Millie
Mar 25 at 22:05
add a comment |
Using dplyr. First, make sure column Date is in date format. Assume dataframe is named mydata:
library(dplyr)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%y")
Now you can filter for Receipt, calculate max/min dates, then filter the original data for PO and join them together:
mydata %>%
filter(Trans_Type == "Receipt") %>%
group_by(Ord_ID, Supplier) %>%
summarise(First_Receipt_Date = min(Date),
Last_Receipt_Date = max(Date)) %>%
ungroup() %>%
left_join(filter(mydata, Trans_Type == "PO")) %>%
select(Ord_ID, Supplier, PO_Date = Date, First_Receipt_Date, Last_Receipt_Date)
Result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
<int> <chr> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
When I run this I get this error: "Error:byrequired, because the data sources have no common variables"
– Millie
Mar 22 at 18:36
Works for me using the example data in the question: the join is on Ord_ID and Supplier.
– neilfws
Mar 23 at 3:06
Got it to work this time. Not sure what was wrong last week; I updated packages earlier which might have done the trick.
– Millie
Mar 25 at 22:05
add a comment |
Using dplyr. First, make sure column Date is in date format. Assume dataframe is named mydata:
library(dplyr)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%y")
Now you can filter for Receipt, calculate max/min dates, then filter the original data for PO and join them together:
mydata %>%
filter(Trans_Type == "Receipt") %>%
group_by(Ord_ID, Supplier) %>%
summarise(First_Receipt_Date = min(Date),
Last_Receipt_Date = max(Date)) %>%
ungroup() %>%
left_join(filter(mydata, Trans_Type == "PO")) %>%
select(Ord_ID, Supplier, PO_Date = Date, First_Receipt_Date, Last_Receipt_Date)
Result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
<int> <chr> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
Using dplyr. First, make sure column Date is in date format. Assume dataframe is named mydata:
library(dplyr)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%y")
Now you can filter for Receipt, calculate max/min dates, then filter the original data for PO and join them together:
mydata %>%
filter(Trans_Type == "Receipt") %>%
group_by(Ord_ID, Supplier) %>%
summarise(First_Receipt_Date = min(Date),
Last_Receipt_Date = max(Date)) %>%
ungroup() %>%
left_join(filter(mydata, Trans_Type == "PO")) %>%
select(Ord_ID, Supplier, PO_Date = Date, First_Receipt_Date, Last_Receipt_Date)
Result:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
<int> <chr> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
answered Mar 21 at 22:06
neilfwsneilfws
18.6k53749
18.6k53749
When I run this I get this error: "Error:byrequired, because the data sources have no common variables"
– Millie
Mar 22 at 18:36
Works for me using the example data in the question: the join is on Ord_ID and Supplier.
– neilfws
Mar 23 at 3:06
Got it to work this time. Not sure what was wrong last week; I updated packages earlier which might have done the trick.
– Millie
Mar 25 at 22:05
add a comment |
When I run this I get this error: "Error:byrequired, because the data sources have no common variables"
– Millie
Mar 22 at 18:36
Works for me using the example data in the question: the join is on Ord_ID and Supplier.
– neilfws
Mar 23 at 3:06
Got it to work this time. Not sure what was wrong last week; I updated packages earlier which might have done the trick.
– Millie
Mar 25 at 22:05
When I run this I get this error: "Error:
by required, because the data sources have no common variables"– Millie
Mar 22 at 18:36
When I run this I get this error: "Error:
by required, because the data sources have no common variables"– Millie
Mar 22 at 18:36
Works for me using the example data in the question: the join is on Ord_ID and Supplier.
– neilfws
Mar 23 at 3:06
Works for me using the example data in the question: the join is on Ord_ID and Supplier.
– neilfws
Mar 23 at 3:06
Got it to work this time. Not sure what was wrong last week; I updated packages earlier which might have done the trick.
– Millie
Mar 25 at 22:05
Got it to work this time. Not sure what was wrong last week; I updated packages earlier which might have done the trick.
– Millie
Mar 25 at 22:05
add a comment |
With tidyverse, borrowing @divibisan's sample data :
library(tidyverse)
df %>%
group_by(Ord_ID, Supplier) %>%
slice(c(1:2, n())) %>%
mutate(Trans_Type = c("PO_Date","First_Receipt_Date","Last_Receipt_Date")) %>%
spread(Trans_Type, Date) %>%
ungroup()
# # A tibble: 5 x 5
# Ord_ID Supplier First_Receipt_Date Last_Receipt_Date PO_Date
# <int> <fct> <date> <date> <date>
# 1 1 A 2018-02-15 2018-02-15 2018-02-03
# 2 2 B 2018-03-13 2018-03-13 2018-02-04
# 3 3 C 2018-03-01 2018-03-29 2018-02-07
# 4 4 B 2018-02-20 2018-02-27 2018-02-09
# 5 5 D 2018-04-02 2018-04-02 2018-02-18
If the data is not sorted as in the sample data, add %>% arrange(Trans_Type, Date) as a first step.
I would recommend not to useslice, it is not reproducible as you do not know the order of data. At least, usearrangebefore. A better way would be a combination offilterwithfirstandlastI guess.
– Sébastien Rochette
Mar 22 at 14:32
If data is not sorted, you can add%>% arrange(Trans_Type, Date)as a first step. given the shape of the sample data I assumed it was fair to assume it is sorted. Other assumptions are that there is always a"PO"value, that there are no other values than"PO"and"Receipt", that order of the output columns wasn't important etc...
– Moody_Mudskipper
Mar 22 at 17:15
I don't understand the point onfilterwithfirstandlast, I useslicein its precise intended use case IMO.
– Moody_Mudskipper
Mar 22 at 17:18
I know this is correct with your assumptions in this specific case. But because you never know how a dataset is built, nor can you know how it will be later updated, I do not recommend the use of indices for the selection/filtering of datasets. There must be a better explanation, included in the data, as why you chose these specific lines. Here, these are the smallest and the biggest values. I try to always think about the future use of my scripts. This is a personal recommendation.
– Sébastien Rochette
Mar 22 at 17:22
1
That's a legitimate point, but there's also value in concise code, and on SO you're rarely 100% explicit about assumptions anyway so it's a gray area. Actually my first answer had thearrangepart but I edited it out (not showing in edit history as i edited right away). I added a note at the end of my post as a compromise :).
– Moody_Mudskipper
Mar 22 at 17:29
|
show 5 more comments
With tidyverse, borrowing @divibisan's sample data :
library(tidyverse)
df %>%
group_by(Ord_ID, Supplier) %>%
slice(c(1:2, n())) %>%
mutate(Trans_Type = c("PO_Date","First_Receipt_Date","Last_Receipt_Date")) %>%
spread(Trans_Type, Date) %>%
ungroup()
# # A tibble: 5 x 5
# Ord_ID Supplier First_Receipt_Date Last_Receipt_Date PO_Date
# <int> <fct> <date> <date> <date>
# 1 1 A 2018-02-15 2018-02-15 2018-02-03
# 2 2 B 2018-03-13 2018-03-13 2018-02-04
# 3 3 C 2018-03-01 2018-03-29 2018-02-07
# 4 4 B 2018-02-20 2018-02-27 2018-02-09
# 5 5 D 2018-04-02 2018-04-02 2018-02-18
If the data is not sorted as in the sample data, add %>% arrange(Trans_Type, Date) as a first step.
I would recommend not to useslice, it is not reproducible as you do not know the order of data. At least, usearrangebefore. A better way would be a combination offilterwithfirstandlastI guess.
– Sébastien Rochette
Mar 22 at 14:32
If data is not sorted, you can add%>% arrange(Trans_Type, Date)as a first step. given the shape of the sample data I assumed it was fair to assume it is sorted. Other assumptions are that there is always a"PO"value, that there are no other values than"PO"and"Receipt", that order of the output columns wasn't important etc...
– Moody_Mudskipper
Mar 22 at 17:15
I don't understand the point onfilterwithfirstandlast, I useslicein its precise intended use case IMO.
– Moody_Mudskipper
Mar 22 at 17:18
I know this is correct with your assumptions in this specific case. But because you never know how a dataset is built, nor can you know how it will be later updated, I do not recommend the use of indices for the selection/filtering of datasets. There must be a better explanation, included in the data, as why you chose these specific lines. Here, these are the smallest and the biggest values. I try to always think about the future use of my scripts. This is a personal recommendation.
– Sébastien Rochette
Mar 22 at 17:22
1
That's a legitimate point, but there's also value in concise code, and on SO you're rarely 100% explicit about assumptions anyway so it's a gray area. Actually my first answer had thearrangepart but I edited it out (not showing in edit history as i edited right away). I added a note at the end of my post as a compromise :).
– Moody_Mudskipper
Mar 22 at 17:29
|
show 5 more comments
With tidyverse, borrowing @divibisan's sample data :
library(tidyverse)
df %>%
group_by(Ord_ID, Supplier) %>%
slice(c(1:2, n())) %>%
mutate(Trans_Type = c("PO_Date","First_Receipt_Date","Last_Receipt_Date")) %>%
spread(Trans_Type, Date) %>%
ungroup()
# # A tibble: 5 x 5
# Ord_ID Supplier First_Receipt_Date Last_Receipt_Date PO_Date
# <int> <fct> <date> <date> <date>
# 1 1 A 2018-02-15 2018-02-15 2018-02-03
# 2 2 B 2018-03-13 2018-03-13 2018-02-04
# 3 3 C 2018-03-01 2018-03-29 2018-02-07
# 4 4 B 2018-02-20 2018-02-27 2018-02-09
# 5 5 D 2018-04-02 2018-04-02 2018-02-18
If the data is not sorted as in the sample data, add %>% arrange(Trans_Type, Date) as a first step.
With tidyverse, borrowing @divibisan's sample data :
library(tidyverse)
df %>%
group_by(Ord_ID, Supplier) %>%
slice(c(1:2, n())) %>%
mutate(Trans_Type = c("PO_Date","First_Receipt_Date","Last_Receipt_Date")) %>%
spread(Trans_Type, Date) %>%
ungroup()
# # A tibble: 5 x 5
# Ord_ID Supplier First_Receipt_Date Last_Receipt_Date PO_Date
# <int> <fct> <date> <date> <date>
# 1 1 A 2018-02-15 2018-02-15 2018-02-03
# 2 2 B 2018-03-13 2018-03-13 2018-02-04
# 3 3 C 2018-03-01 2018-03-29 2018-02-07
# 4 4 B 2018-02-20 2018-02-27 2018-02-09
# 5 5 D 2018-04-02 2018-04-02 2018-02-18
If the data is not sorted as in the sample data, add %>% arrange(Trans_Type, Date) as a first step.
edited Mar 22 at 17:25
answered Mar 22 at 9:58
Moody_MudskipperMoody_Mudskipper
24.7k33570
24.7k33570
I would recommend not to useslice, it is not reproducible as you do not know the order of data. At least, usearrangebefore. A better way would be a combination offilterwithfirstandlastI guess.
– Sébastien Rochette
Mar 22 at 14:32
If data is not sorted, you can add%>% arrange(Trans_Type, Date)as a first step. given the shape of the sample data I assumed it was fair to assume it is sorted. Other assumptions are that there is always a"PO"value, that there are no other values than"PO"and"Receipt", that order of the output columns wasn't important etc...
– Moody_Mudskipper
Mar 22 at 17:15
I don't understand the point onfilterwithfirstandlast, I useslicein its precise intended use case IMO.
– Moody_Mudskipper
Mar 22 at 17:18
I know this is correct with your assumptions in this specific case. But because you never know how a dataset is built, nor can you know how it will be later updated, I do not recommend the use of indices for the selection/filtering of datasets. There must be a better explanation, included in the data, as why you chose these specific lines. Here, these are the smallest and the biggest values. I try to always think about the future use of my scripts. This is a personal recommendation.
– Sébastien Rochette
Mar 22 at 17:22
1
That's a legitimate point, but there's also value in concise code, and on SO you're rarely 100% explicit about assumptions anyway so it's a gray area. Actually my first answer had thearrangepart but I edited it out (not showing in edit history as i edited right away). I added a note at the end of my post as a compromise :).
– Moody_Mudskipper
Mar 22 at 17:29
|
show 5 more comments
I would recommend not to useslice, it is not reproducible as you do not know the order of data. At least, usearrangebefore. A better way would be a combination offilterwithfirstandlastI guess.
– Sébastien Rochette
Mar 22 at 14:32
If data is not sorted, you can add%>% arrange(Trans_Type, Date)as a first step. given the shape of the sample data I assumed it was fair to assume it is sorted. Other assumptions are that there is always a"PO"value, that there are no other values than"PO"and"Receipt", that order of the output columns wasn't important etc...
– Moody_Mudskipper
Mar 22 at 17:15
I don't understand the point onfilterwithfirstandlast, I useslicein its precise intended use case IMO.
– Moody_Mudskipper
Mar 22 at 17:18
I know this is correct with your assumptions in this specific case. But because you never know how a dataset is built, nor can you know how it will be later updated, I do not recommend the use of indices for the selection/filtering of datasets. There must be a better explanation, included in the data, as why you chose these specific lines. Here, these are the smallest and the biggest values. I try to always think about the future use of my scripts. This is a personal recommendation.
– Sébastien Rochette
Mar 22 at 17:22
1
That's a legitimate point, but there's also value in concise code, and on SO you're rarely 100% explicit about assumptions anyway so it's a gray area. Actually my first answer had thearrangepart but I edited it out (not showing in edit history as i edited right away). I added a note at the end of my post as a compromise :).
– Moody_Mudskipper
Mar 22 at 17:29
I would recommend not to use
slice, it is not reproducible as you do not know the order of data. At least, use arrange before. A better way would be a combination of filter with first and last I guess.– Sébastien Rochette
Mar 22 at 14:32
I would recommend not to use
slice, it is not reproducible as you do not know the order of data. At least, use arrange before. A better way would be a combination of filter with first and last I guess.– Sébastien Rochette
Mar 22 at 14:32
If data is not sorted, you can add
%>% arrange(Trans_Type, Date) as a first step. given the shape of the sample data I assumed it was fair to assume it is sorted. Other assumptions are that there is always a "PO" value, that there are no other values than "PO" and "Receipt", that order of the output columns wasn't important etc...– Moody_Mudskipper
Mar 22 at 17:15
If data is not sorted, you can add
%>% arrange(Trans_Type, Date) as a first step. given the shape of the sample data I assumed it was fair to assume it is sorted. Other assumptions are that there is always a "PO" value, that there are no other values than "PO" and "Receipt", that order of the output columns wasn't important etc...– Moody_Mudskipper
Mar 22 at 17:15
I don't understand the point on
filter with first and last, I use slice in its precise intended use case IMO.– Moody_Mudskipper
Mar 22 at 17:18
I don't understand the point on
filter with first and last, I use slice in its precise intended use case IMO.– Moody_Mudskipper
Mar 22 at 17:18
I know this is correct with your assumptions in this specific case. But because you never know how a dataset is built, nor can you know how it will be later updated, I do not recommend the use of indices for the selection/filtering of datasets. There must be a better explanation, included in the data, as why you chose these specific lines. Here, these are the smallest and the biggest values. I try to always think about the future use of my scripts. This is a personal recommendation.
– Sébastien Rochette
Mar 22 at 17:22
I know this is correct with your assumptions in this specific case. But because you never know how a dataset is built, nor can you know how it will be later updated, I do not recommend the use of indices for the selection/filtering of datasets. There must be a better explanation, included in the data, as why you chose these specific lines. Here, these are the smallest and the biggest values. I try to always think about the future use of my scripts. This is a personal recommendation.
– Sébastien Rochette
Mar 22 at 17:22
1
1
That's a legitimate point, but there's also value in concise code, and on SO you're rarely 100% explicit about assumptions anyway so it's a gray area. Actually my first answer had the
arrange part but I edited it out (not showing in edit history as i edited right away). I added a note at the end of my post as a compromise :).– Moody_Mudskipper
Mar 22 at 17:29
That's a legitimate point, but there's also value in concise code, and on SO you're rarely 100% explicit about assumptions anyway so it's a gray area. Actually my first answer had the
arrange part but I edited it out (not showing in edit history as i edited right away). I added a note at the end of my post as a compromise :).– Moody_Mudskipper
Mar 22 at 17:29
|
show 5 more comments
I would start with something like this:
data %>%
group_by(Supplier, Trans_Type) %>%
summarise(min_date = min(Date),
max_date = max(Date)
) %>%
ungroup()
Then, you can play with gatherand spread to retrieve the columns you need.
add a comment |
I would start with something like this:
data %>%
group_by(Supplier, Trans_Type) %>%
summarise(min_date = min(Date),
max_date = max(Date)
) %>%
ungroup()
Then, you can play with gatherand spread to retrieve the columns you need.
add a comment |
I would start with something like this:
data %>%
group_by(Supplier, Trans_Type) %>%
summarise(min_date = min(Date),
max_date = max(Date)
) %>%
ungroup()
Then, you can play with gatherand spread to retrieve the columns you need.
I would start with something like this:
data %>%
group_by(Supplier, Trans_Type) %>%
summarise(min_date = min(Date),
max_date = max(Date)
) %>%
ungroup()
Then, you can play with gatherand spread to retrieve the columns you need.
answered Mar 21 at 21:55
Sébastien RochetteSébastien Rochette
4,3232929
4,3232929
add a comment |
add a comment |
Here's another tidyverse based solution that avoids the left_join. I have no idea which approach would be faster on a large dataset, but it's always good to have more options:
df <- structure(list(Ord_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L), Supplier = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Trans_Type = c("PO", "Receipt", "PO", "Receipt", "PO", "Receipt",
"Receipt", "Receipt", "PO", "Receipt", "Receipt", "PO", "Receipt"
), Date = structure(c(17565, 17577, 17566, 17603, 17569,
17591, 17595, 17619, 17571, 17582, 17589, 17580, 17623), class = "Date")), row.names = c(NA,
-13L), class = "data.frame")
df %>%
group_by(Ord_ID, Supplier, Trans_Type) %>%
# Keep only min and max date values
filter(Date == min(Date) | Date == max(Date) | Trans_Type != 'Receipt') %>%
# Rename 2nd Receipt value Receipt_2 so there are no duplicated values
mutate(Trans_Type2 = if_else(Trans_Type == 'Receipt' & row_number() == 2,
'Receipt_2', Trans_Type)) %>%
# Drop Trans_Type variable (we can't replace in mutate since it's a grouping var)
ungroup(Trans_Type) %>%
select(-Trans_Type) %>%
# Spread the now unduplicated Trans_Type values
spread(Trans_Type2, Date) %>%
# Fill in Receipt_2 values where they're missing
mutate(Receipt_2 = if_else(is.na(Receipt_2), Receipt, Receipt_2))
# A tibble: 5 x 5
Ord_ID Supplier PO Receipt Receipt_2
<int> <fct> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
add a comment |
Here's another tidyverse based solution that avoids the left_join. I have no idea which approach would be faster on a large dataset, but it's always good to have more options:
df <- structure(list(Ord_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L), Supplier = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Trans_Type = c("PO", "Receipt", "PO", "Receipt", "PO", "Receipt",
"Receipt", "Receipt", "PO", "Receipt", "Receipt", "PO", "Receipt"
), Date = structure(c(17565, 17577, 17566, 17603, 17569,
17591, 17595, 17619, 17571, 17582, 17589, 17580, 17623), class = "Date")), row.names = c(NA,
-13L), class = "data.frame")
df %>%
group_by(Ord_ID, Supplier, Trans_Type) %>%
# Keep only min and max date values
filter(Date == min(Date) | Date == max(Date) | Trans_Type != 'Receipt') %>%
# Rename 2nd Receipt value Receipt_2 so there are no duplicated values
mutate(Trans_Type2 = if_else(Trans_Type == 'Receipt' & row_number() == 2,
'Receipt_2', Trans_Type)) %>%
# Drop Trans_Type variable (we can't replace in mutate since it's a grouping var)
ungroup(Trans_Type) %>%
select(-Trans_Type) %>%
# Spread the now unduplicated Trans_Type values
spread(Trans_Type2, Date) %>%
# Fill in Receipt_2 values where they're missing
mutate(Receipt_2 = if_else(is.na(Receipt_2), Receipt, Receipt_2))
# A tibble: 5 x 5
Ord_ID Supplier PO Receipt Receipt_2
<int> <fct> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
add a comment |
Here's another tidyverse based solution that avoids the left_join. I have no idea which approach would be faster on a large dataset, but it's always good to have more options:
df <- structure(list(Ord_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L), Supplier = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Trans_Type = c("PO", "Receipt", "PO", "Receipt", "PO", "Receipt",
"Receipt", "Receipt", "PO", "Receipt", "Receipt", "PO", "Receipt"
), Date = structure(c(17565, 17577, 17566, 17603, 17569,
17591, 17595, 17619, 17571, 17582, 17589, 17580, 17623), class = "Date")), row.names = c(NA,
-13L), class = "data.frame")
df %>%
group_by(Ord_ID, Supplier, Trans_Type) %>%
# Keep only min and max date values
filter(Date == min(Date) | Date == max(Date) | Trans_Type != 'Receipt') %>%
# Rename 2nd Receipt value Receipt_2 so there are no duplicated values
mutate(Trans_Type2 = if_else(Trans_Type == 'Receipt' & row_number() == 2,
'Receipt_2', Trans_Type)) %>%
# Drop Trans_Type variable (we can't replace in mutate since it's a grouping var)
ungroup(Trans_Type) %>%
select(-Trans_Type) %>%
# Spread the now unduplicated Trans_Type values
spread(Trans_Type2, Date) %>%
# Fill in Receipt_2 values where they're missing
mutate(Receipt_2 = if_else(is.na(Receipt_2), Receipt, Receipt_2))
# A tibble: 5 x 5
Ord_ID Supplier PO Receipt Receipt_2
<int> <fct> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
Here's another tidyverse based solution that avoids the left_join. I have no idea which approach would be faster on a large dataset, but it's always good to have more options:
df <- structure(list(Ord_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L), Supplier = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Trans_Type = c("PO", "Receipt", "PO", "Receipt", "PO", "Receipt",
"Receipt", "Receipt", "PO", "Receipt", "Receipt", "PO", "Receipt"
), Date = structure(c(17565, 17577, 17566, 17603, 17569,
17591, 17595, 17619, 17571, 17582, 17589, 17580, 17623), class = "Date")), row.names = c(NA,
-13L), class = "data.frame")
df %>%
group_by(Ord_ID, Supplier, Trans_Type) %>%
# Keep only min and max date values
filter(Date == min(Date) | Date == max(Date) | Trans_Type != 'Receipt') %>%
# Rename 2nd Receipt value Receipt_2 so there are no duplicated values
mutate(Trans_Type2 = if_else(Trans_Type == 'Receipt' & row_number() == 2,
'Receipt_2', Trans_Type)) %>%
# Drop Trans_Type variable (we can't replace in mutate since it's a grouping var)
ungroup(Trans_Type) %>%
select(-Trans_Type) %>%
# Spread the now unduplicated Trans_Type values
spread(Trans_Type2, Date) %>%
# Fill in Receipt_2 values where they're missing
mutate(Receipt_2 = if_else(is.na(Receipt_2), Receipt, Receipt_2))
# A tibble: 5 x 5
Ord_ID Supplier PO Receipt Receipt_2
<int> <fct> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
answered Mar 21 at 22:24
divibisandivibisan
5,14581834
5,14581834
add a comment |
add a comment |
You can just use dplyr to mutate new columns for PO date, and first and last receipt dates:
test1<-test %>%
mutate(Date = mdy(Date)) %>%
group_by(Ord_ID) %>%
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
filter(!is.na(PO_Date)) %>%
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
A breakdown:
test1<-test %>%
#convert format of "Date" column to as.Date to identify min and max dates
mutate(Date = mdy(Date)) %>%
#group by the Order ID
group_by(Ord_ID) %>%
#PO_Date will be where the "Trans_Type" is "PO" --> since the column is in date format,
#dplyr will convert this to numeric, but can be fixed later
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
#first receipt date is the minimum date of a receipt transaction
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
#last receipt date is the maximum date of a receipt transaction
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
#to remove duplicates
filter(!is.na(PO_Date)) %>%
#convert "PO_Date" column back to as.Date from numeric
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
add a comment |
You can just use dplyr to mutate new columns for PO date, and first and last receipt dates:
test1<-test %>%
mutate(Date = mdy(Date)) %>%
group_by(Ord_ID) %>%
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
filter(!is.na(PO_Date)) %>%
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
A breakdown:
test1<-test %>%
#convert format of "Date" column to as.Date to identify min and max dates
mutate(Date = mdy(Date)) %>%
#group by the Order ID
group_by(Ord_ID) %>%
#PO_Date will be where the "Trans_Type" is "PO" --> since the column is in date format,
#dplyr will convert this to numeric, but can be fixed later
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
#first receipt date is the minimum date of a receipt transaction
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
#last receipt date is the maximum date of a receipt transaction
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
#to remove duplicates
filter(!is.na(PO_Date)) %>%
#convert "PO_Date" column back to as.Date from numeric
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
add a comment |
You can just use dplyr to mutate new columns for PO date, and first and last receipt dates:
test1<-test %>%
mutate(Date = mdy(Date)) %>%
group_by(Ord_ID) %>%
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
filter(!is.na(PO_Date)) %>%
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
A breakdown:
test1<-test %>%
#convert format of "Date" column to as.Date to identify min and max dates
mutate(Date = mdy(Date)) %>%
#group by the Order ID
group_by(Ord_ID) %>%
#PO_Date will be where the "Trans_Type" is "PO" --> since the column is in date format,
#dplyr will convert this to numeric, but can be fixed later
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
#first receipt date is the minimum date of a receipt transaction
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
#last receipt date is the maximum date of a receipt transaction
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
#to remove duplicates
filter(!is.na(PO_Date)) %>%
#convert "PO_Date" column back to as.Date from numeric
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
You can just use dplyr to mutate new columns for PO date, and first and last receipt dates:
test1<-test %>%
mutate(Date = mdy(Date)) %>%
group_by(Ord_ID) %>%
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
filter(!is.na(PO_Date)) %>%
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
A breakdown:
test1<-test %>%
#convert format of "Date" column to as.Date to identify min and max dates
mutate(Date = mdy(Date)) %>%
#group by the Order ID
group_by(Ord_ID) %>%
#PO_Date will be where the "Trans_Type" is "PO" --> since the column is in date format,
#dplyr will convert this to numeric, but can be fixed later
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
#first receipt date is the minimum date of a receipt transaction
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
#last receipt date is the maximum date of a receipt transaction
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
#to remove duplicates
filter(!is.na(PO_Date)) %>%
#convert "PO_Date" column back to as.Date from numeric
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
answered Mar 21 at 23:21
S. AshS. Ash
413
413
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55289624%2forganizing-a-dataframe-splitting-one-column-into-three%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown