Merge DF's on str ID and take averagesHow to merge two dictionaries in a single expression?How to merge a specific commit in GitJPA EntityManager: Why use persist() over merge()?How to join (merge) data frames (inner, outer, left, right)How do you merge two Git repositories?How to merge two arrays in JavaScript and de-duplicate itemsMerge / convert multiple PDF files into one PDFWhat is the difference between `git merge` and `git merge --no-ff`?How to merge multiple dataframes with different column namesMerge Dataframes in Pandas (without column names)
Is it possible to invoke "super" with less ambiguous results?
🍩🔔🔥Scrambled emoji tale⚛️🎶🛒 #2️⃣
What makes MOVEQ quicker than a normal MOVE in 68000 assembly?
Which modern firearm should a time traveler bring to be easily reproducible for a historic civilization?
Is it possible to have a career in SciComp without contributing to arms research?
Why is the Intel 8086 CPU called a 16-bit CPU?
How do I reproduce this layout and typography?
How to interpret a promising preprint that was never published in peer-review?
Why aren't there any women super GMs?
How was Luke's prosthetic hand in Episode V filmed?
Which GPUs to get for Mathematical Optimization (if any)?
What is a Romeo Word™?
How can I help our ranger feel special about her beast companion?
How much solution to fill Paterson Universal Tank when developing film?
Why is an object not defined as identity morphism?
Has sea level rise slowed down?
Why didn't Doctor Strange restore Tony Stark after he used the Stones?
How slow ( not zero) can a car engine run without hurting engine and saving on fuel
What is actually sent/loaded to a microcontroller / stm32
Three Subway Escalators
I have found a mistake on someone's code published online: what is the protocol?
Operation Unzalgo
What is this green alien supposed to be on the American covers of the "Hitchhiker's Guide to the Galaxy"?
Is encryption still applied if you ignore the SSL certificate warning for self-signed certs?
Merge DF's on str ID and take averages
How to merge two dictionaries in a single expression?How to merge a specific commit in GitJPA EntityManager: Why use persist() over merge()?How to join (merge) data frames (inner, outer, left, right)How do you merge two Git repositories?How to merge two arrays in JavaScript and de-duplicate itemsMerge / convert multiple PDF files into one PDFWhat is the difference between `git merge` and `git merge --no-ff`?How to merge multiple dataframes with different column namesMerge Dataframes in Pandas (without column names)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a bunch of csvs in a folder in this format:
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [53] [[45930]] [45930] 53
1 [53, 50] [[45930], [46480]] [46480] 53-50 550.0
2 [53, 50, 63] [[45930], [46480], [47980]] [47980] 53-50-63 1025.0 671.7514421272201
3 [53, 50, 63, 60] [[45930], [46480], [47980], [49360]] [49360] 53-50-63-60 1143.3333333333333 517.3329037798903
4 [53, 50, 63, 60, 73] [[45930], [46480], [47980], [49360], [50040]] [50040] 53-50-63-60-73 1027.5 481.75893003313035
5 [53, 50, 63, 60, 73, 70] [[45930], [46480], [47980], [49360], [50040], [50310]] [50310] 53-50-63-60-73-70 876.0 537.4290650867331
6 [50] [[46480]] [46480] 50
7 [50, 63] [[46480], [47980]] [47980] 50-63 1500.0
8 [50, 63, 60] [[46480], [47980], [49360]] [49360] 50-63-60 1440.0 84.8528137423857
9 [50, 63, 60, 73] [[46480], [47980], [49360], [50040]] [50040] 50-63-60-73 1186.6666666666667 442.86943147313
I am read them in as DF's and make a list of the DF's:
csvs = []
list_of_files = [i for i in glob.glob('*.'.format('csv'))]
for file in list_of_files:
f = pd.read_csv(file)
csvs.append(f)
What I am trying to do is reduce them down to one dataframe with no repetitions of "chunk_id". Instead, I would like to merge on this ID.
I tried this:
from functools import reduce
red = reduce(pd.merge, csvs)
which just gives me a really wide dataframe with no entries.
I haven't attempted the averaging yet, but I would like to end up with a dataframe that has exactly the same columns as the example above, but where each row in all the dataframes with the same "chunk_id" is merged but their "diffs_avg", "timecodes", "chunk_completed" and "sd" columns averaged.
So, if I had read in the following dfs:
DF1
chunk timecodes chunk_completed chunk_id diffs_avg sd
[60 62] [100, 200] 500 60-62 2 1
[58 53] [800, 900] 1000 58-53 4 6
DF2
chunk timecodes chunk_completed chunk_id diffs_avg sd
[60 62] [200, 400] 1000 60-62 4 2
[30 33] [200, 700] 800 30-33 6 7
RESULT:
*[60 62] [150, 300] 750 60-62 3 1.5*
[58 53] [800, 900] 1000 58-53 4 6
[30 33] [200, 700] 800 30-33 6 7
Reproducible DF:
'chunk': 0: '[53]',
1: '[53, 50]',
2: '[53, 50, 63]',
3: '[53, 50, 63, 60]',
4: '[53, 50, 63, 60, 73]',
5: '[53, 50, 63, 60, 73, 70]',
6: '[50]',
7: '[50, 63]',
8: '[50, 63, 60]',
9: '[50, 63, 60, 73]',
'chunk_completed': 0: '[45930]',
1: '[46480]',
2: '[47980]',
3: '[49360]',
4: '[50040]',
5: '[50310]',
6: '[46480]',
7: '[47980]',
8: '[49360]',
9: '[50040]',
'chunk_id': 0: '53',
1: '53-50',
2: '53-50-63',
3: '53-50-63-60',
4: '53-50-63-60-73',
5: '53-50-63-60-73-70',
6: '50',
7: '50-63',
8: '50-63-60',
9: '50-63-60-73',
'diffs_avg': 0: np.nan,
1: 550.0,
2: 1025.0,
3: 1143.3333333333333,
4: 1027.5,
5: 876.0,
6: np.nan,
7: 1500.0,
8: 1440.0,
9: 1186.6666666666667,
'sd': 0: np.nan,
1: np.nan,
2: 671.7514421272201,
3: 517.3329037798903,
4: 481.75893003313035,
5: 537.4290650867331,
6: np.nan,
7: np.nan,
8: 84.8528137423857,
9: 442.86943147313,
'timecodes': 0: '[[45930]]',
1: '[[45930], [46480]]',
2: '[[45930], [46480], [47980]]',
3: '[[45930], [46480], [47980], [49360]]',
4: '[[45930], [46480], [47980], [49360], [50040]]',
5: '[[45930], [46480], [47980], [49360], [50040], [50310]]',
6: '[[46480]]',
7: '[[46480], [47980]]',
8: '[[46480], [47980], [49360]]',
9: '[[46480], [47980], [49360], [50040]]'
python pandas dataframe merge
|
show 2 more comments
I have a bunch of csvs in a folder in this format:
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [53] [[45930]] [45930] 53
1 [53, 50] [[45930], [46480]] [46480] 53-50 550.0
2 [53, 50, 63] [[45930], [46480], [47980]] [47980] 53-50-63 1025.0 671.7514421272201
3 [53, 50, 63, 60] [[45930], [46480], [47980], [49360]] [49360] 53-50-63-60 1143.3333333333333 517.3329037798903
4 [53, 50, 63, 60, 73] [[45930], [46480], [47980], [49360], [50040]] [50040] 53-50-63-60-73 1027.5 481.75893003313035
5 [53, 50, 63, 60, 73, 70] [[45930], [46480], [47980], [49360], [50040], [50310]] [50310] 53-50-63-60-73-70 876.0 537.4290650867331
6 [50] [[46480]] [46480] 50
7 [50, 63] [[46480], [47980]] [47980] 50-63 1500.0
8 [50, 63, 60] [[46480], [47980], [49360]] [49360] 50-63-60 1440.0 84.8528137423857
9 [50, 63, 60, 73] [[46480], [47980], [49360], [50040]] [50040] 50-63-60-73 1186.6666666666667 442.86943147313
I am read them in as DF's and make a list of the DF's:
csvs = []
list_of_files = [i for i in glob.glob('*.'.format('csv'))]
for file in list_of_files:
f = pd.read_csv(file)
csvs.append(f)
What I am trying to do is reduce them down to one dataframe with no repetitions of "chunk_id". Instead, I would like to merge on this ID.
I tried this:
from functools import reduce
red = reduce(pd.merge, csvs)
which just gives me a really wide dataframe with no entries.
I haven't attempted the averaging yet, but I would like to end up with a dataframe that has exactly the same columns as the example above, but where each row in all the dataframes with the same "chunk_id" is merged but their "diffs_avg", "timecodes", "chunk_completed" and "sd" columns averaged.
So, if I had read in the following dfs:
DF1
chunk timecodes chunk_completed chunk_id diffs_avg sd
[60 62] [100, 200] 500 60-62 2 1
[58 53] [800, 900] 1000 58-53 4 6
DF2
chunk timecodes chunk_completed chunk_id diffs_avg sd
[60 62] [200, 400] 1000 60-62 4 2
[30 33] [200, 700] 800 30-33 6 7
RESULT:
*[60 62] [150, 300] 750 60-62 3 1.5*
[58 53] [800, 900] 1000 58-53 4 6
[30 33] [200, 700] 800 30-33 6 7
Reproducible DF:
'chunk': 0: '[53]',
1: '[53, 50]',
2: '[53, 50, 63]',
3: '[53, 50, 63, 60]',
4: '[53, 50, 63, 60, 73]',
5: '[53, 50, 63, 60, 73, 70]',
6: '[50]',
7: '[50, 63]',
8: '[50, 63, 60]',
9: '[50, 63, 60, 73]',
'chunk_completed': 0: '[45930]',
1: '[46480]',
2: '[47980]',
3: '[49360]',
4: '[50040]',
5: '[50310]',
6: '[46480]',
7: '[47980]',
8: '[49360]',
9: '[50040]',
'chunk_id': 0: '53',
1: '53-50',
2: '53-50-63',
3: '53-50-63-60',
4: '53-50-63-60-73',
5: '53-50-63-60-73-70',
6: '50',
7: '50-63',
8: '50-63-60',
9: '50-63-60-73',
'diffs_avg': 0: np.nan,
1: 550.0,
2: 1025.0,
3: 1143.3333333333333,
4: 1027.5,
5: 876.0,
6: np.nan,
7: 1500.0,
8: 1440.0,
9: 1186.6666666666667,
'sd': 0: np.nan,
1: np.nan,
2: 671.7514421272201,
3: 517.3329037798903,
4: 481.75893003313035,
5: 537.4290650867331,
6: np.nan,
7: np.nan,
8: 84.8528137423857,
9: 442.86943147313,
'timecodes': 0: '[[45930]]',
1: '[[45930], [46480]]',
2: '[[45930], [46480], [47980]]',
3: '[[45930], [46480], [47980], [49360]]',
4: '[[45930], [46480], [47980], [49360], [50040]]',
5: '[[45930], [46480], [47980], [49360], [50040], [50310]]',
6: '[[46480]]',
7: '[[46480], [47980]]',
8: '[[46480], [47980], [49360]]',
9: '[[46480], [47980], [49360], [50040]]'
python pandas dataframe merge
Which type is column timecodes? Are these lists or strings?
– Erfan
Mar 26 at 10:31
Please add the dataframe code, so that people working on this directly copy-paste it,
– min2bro
Mar 26 at 10:33
They are strings, but they should be a list of integers. That's a mistake which I'll have to sort.
– syntheso
Mar 26 at 10:45
Added reproducible DF.
– syntheso
Mar 26 at 10:51
This is not the appended df right? Because im not seeing any duplicatechunk_id
so its not the same as your two small example dataframe. This changed your question to be honest.
– Erfan
Mar 26 at 10:59
|
show 2 more comments
I have a bunch of csvs in a folder in this format:
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [53] [[45930]] [45930] 53
1 [53, 50] [[45930], [46480]] [46480] 53-50 550.0
2 [53, 50, 63] [[45930], [46480], [47980]] [47980] 53-50-63 1025.0 671.7514421272201
3 [53, 50, 63, 60] [[45930], [46480], [47980], [49360]] [49360] 53-50-63-60 1143.3333333333333 517.3329037798903
4 [53, 50, 63, 60, 73] [[45930], [46480], [47980], [49360], [50040]] [50040] 53-50-63-60-73 1027.5 481.75893003313035
5 [53, 50, 63, 60, 73, 70] [[45930], [46480], [47980], [49360], [50040], [50310]] [50310] 53-50-63-60-73-70 876.0 537.4290650867331
6 [50] [[46480]] [46480] 50
7 [50, 63] [[46480], [47980]] [47980] 50-63 1500.0
8 [50, 63, 60] [[46480], [47980], [49360]] [49360] 50-63-60 1440.0 84.8528137423857
9 [50, 63, 60, 73] [[46480], [47980], [49360], [50040]] [50040] 50-63-60-73 1186.6666666666667 442.86943147313
I am read them in as DF's and make a list of the DF's:
csvs = []
list_of_files = [i for i in glob.glob('*.'.format('csv'))]
for file in list_of_files:
f = pd.read_csv(file)
csvs.append(f)
What I am trying to do is reduce them down to one dataframe with no repetitions of "chunk_id". Instead, I would like to merge on this ID.
I tried this:
from functools import reduce
red = reduce(pd.merge, csvs)
which just gives me a really wide dataframe with no entries.
I haven't attempted the averaging yet, but I would like to end up with a dataframe that has exactly the same columns as the example above, but where each row in all the dataframes with the same "chunk_id" is merged but their "diffs_avg", "timecodes", "chunk_completed" and "sd" columns averaged.
So, if I had read in the following dfs:
DF1
chunk timecodes chunk_completed chunk_id diffs_avg sd
[60 62] [100, 200] 500 60-62 2 1
[58 53] [800, 900] 1000 58-53 4 6
DF2
chunk timecodes chunk_completed chunk_id diffs_avg sd
[60 62] [200, 400] 1000 60-62 4 2
[30 33] [200, 700] 800 30-33 6 7
RESULT:
*[60 62] [150, 300] 750 60-62 3 1.5*
[58 53] [800, 900] 1000 58-53 4 6
[30 33] [200, 700] 800 30-33 6 7
Reproducible DF:
'chunk': 0: '[53]',
1: '[53, 50]',
2: '[53, 50, 63]',
3: '[53, 50, 63, 60]',
4: '[53, 50, 63, 60, 73]',
5: '[53, 50, 63, 60, 73, 70]',
6: '[50]',
7: '[50, 63]',
8: '[50, 63, 60]',
9: '[50, 63, 60, 73]',
'chunk_completed': 0: '[45930]',
1: '[46480]',
2: '[47980]',
3: '[49360]',
4: '[50040]',
5: '[50310]',
6: '[46480]',
7: '[47980]',
8: '[49360]',
9: '[50040]',
'chunk_id': 0: '53',
1: '53-50',
2: '53-50-63',
3: '53-50-63-60',
4: '53-50-63-60-73',
5: '53-50-63-60-73-70',
6: '50',
7: '50-63',
8: '50-63-60',
9: '50-63-60-73',
'diffs_avg': 0: np.nan,
1: 550.0,
2: 1025.0,
3: 1143.3333333333333,
4: 1027.5,
5: 876.0,
6: np.nan,
7: 1500.0,
8: 1440.0,
9: 1186.6666666666667,
'sd': 0: np.nan,
1: np.nan,
2: 671.7514421272201,
3: 517.3329037798903,
4: 481.75893003313035,
5: 537.4290650867331,
6: np.nan,
7: np.nan,
8: 84.8528137423857,
9: 442.86943147313,
'timecodes': 0: '[[45930]]',
1: '[[45930], [46480]]',
2: '[[45930], [46480], [47980]]',
3: '[[45930], [46480], [47980], [49360]]',
4: '[[45930], [46480], [47980], [49360], [50040]]',
5: '[[45930], [46480], [47980], [49360], [50040], [50310]]',
6: '[[46480]]',
7: '[[46480], [47980]]',
8: '[[46480], [47980], [49360]]',
9: '[[46480], [47980], [49360], [50040]]'
python pandas dataframe merge
I have a bunch of csvs in a folder in this format:
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [53] [[45930]] [45930] 53
1 [53, 50] [[45930], [46480]] [46480] 53-50 550.0
2 [53, 50, 63] [[45930], [46480], [47980]] [47980] 53-50-63 1025.0 671.7514421272201
3 [53, 50, 63, 60] [[45930], [46480], [47980], [49360]] [49360] 53-50-63-60 1143.3333333333333 517.3329037798903
4 [53, 50, 63, 60, 73] [[45930], [46480], [47980], [49360], [50040]] [50040] 53-50-63-60-73 1027.5 481.75893003313035
5 [53, 50, 63, 60, 73, 70] [[45930], [46480], [47980], [49360], [50040], [50310]] [50310] 53-50-63-60-73-70 876.0 537.4290650867331
6 [50] [[46480]] [46480] 50
7 [50, 63] [[46480], [47980]] [47980] 50-63 1500.0
8 [50, 63, 60] [[46480], [47980], [49360]] [49360] 50-63-60 1440.0 84.8528137423857
9 [50, 63, 60, 73] [[46480], [47980], [49360], [50040]] [50040] 50-63-60-73 1186.6666666666667 442.86943147313
I am read them in as DF's and make a list of the DF's:
csvs = []
list_of_files = [i for i in glob.glob('*.'.format('csv'))]
for file in list_of_files:
f = pd.read_csv(file)
csvs.append(f)
What I am trying to do is reduce them down to one dataframe with no repetitions of "chunk_id". Instead, I would like to merge on this ID.
I tried this:
from functools import reduce
red = reduce(pd.merge, csvs)
which just gives me a really wide dataframe with no entries.
I haven't attempted the averaging yet, but I would like to end up with a dataframe that has exactly the same columns as the example above, but where each row in all the dataframes with the same "chunk_id" is merged but their "diffs_avg", "timecodes", "chunk_completed" and "sd" columns averaged.
So, if I had read in the following dfs:
DF1
chunk timecodes chunk_completed chunk_id diffs_avg sd
[60 62] [100, 200] 500 60-62 2 1
[58 53] [800, 900] 1000 58-53 4 6
DF2
chunk timecodes chunk_completed chunk_id diffs_avg sd
[60 62] [200, 400] 1000 60-62 4 2
[30 33] [200, 700] 800 30-33 6 7
RESULT:
*[60 62] [150, 300] 750 60-62 3 1.5*
[58 53] [800, 900] 1000 58-53 4 6
[30 33] [200, 700] 800 30-33 6 7
Reproducible DF:
'chunk': 0: '[53]',
1: '[53, 50]',
2: '[53, 50, 63]',
3: '[53, 50, 63, 60]',
4: '[53, 50, 63, 60, 73]',
5: '[53, 50, 63, 60, 73, 70]',
6: '[50]',
7: '[50, 63]',
8: '[50, 63, 60]',
9: '[50, 63, 60, 73]',
'chunk_completed': 0: '[45930]',
1: '[46480]',
2: '[47980]',
3: '[49360]',
4: '[50040]',
5: '[50310]',
6: '[46480]',
7: '[47980]',
8: '[49360]',
9: '[50040]',
'chunk_id': 0: '53',
1: '53-50',
2: '53-50-63',
3: '53-50-63-60',
4: '53-50-63-60-73',
5: '53-50-63-60-73-70',
6: '50',
7: '50-63',
8: '50-63-60',
9: '50-63-60-73',
'diffs_avg': 0: np.nan,
1: 550.0,
2: 1025.0,
3: 1143.3333333333333,
4: 1027.5,
5: 876.0,
6: np.nan,
7: 1500.0,
8: 1440.0,
9: 1186.6666666666667,
'sd': 0: np.nan,
1: np.nan,
2: 671.7514421272201,
3: 517.3329037798903,
4: 481.75893003313035,
5: 537.4290650867331,
6: np.nan,
7: np.nan,
8: 84.8528137423857,
9: 442.86943147313,
'timecodes': 0: '[[45930]]',
1: '[[45930], [46480]]',
2: '[[45930], [46480], [47980]]',
3: '[[45930], [46480], [47980], [49360]]',
4: '[[45930], [46480], [47980], [49360], [50040]]',
5: '[[45930], [46480], [47980], [49360], [50040], [50310]]',
6: '[[46480]]',
7: '[[46480], [47980]]',
8: '[[46480], [47980], [49360]]',
9: '[[46480], [47980], [49360], [50040]]'
python pandas dataframe merge
python pandas dataframe merge
edited Mar 26 at 11:49
Erfan
9,0502 gold badges7 silver badges25 bronze badges
9,0502 gold badges7 silver badges25 bronze badges
asked Mar 26 at 10:22
synthesosyntheso
929 bronze badges
929 bronze badges
Which type is column timecodes? Are these lists or strings?
– Erfan
Mar 26 at 10:31
Please add the dataframe code, so that people working on this directly copy-paste it,
– min2bro
Mar 26 at 10:33
They are strings, but they should be a list of integers. That's a mistake which I'll have to sort.
– syntheso
Mar 26 at 10:45
Added reproducible DF.
– syntheso
Mar 26 at 10:51
This is not the appended df right? Because im not seeing any duplicatechunk_id
so its not the same as your two small example dataframe. This changed your question to be honest.
– Erfan
Mar 26 at 10:59
|
show 2 more comments
Which type is column timecodes? Are these lists or strings?
– Erfan
Mar 26 at 10:31
Please add the dataframe code, so that people working on this directly copy-paste it,
– min2bro
Mar 26 at 10:33
They are strings, but they should be a list of integers. That's a mistake which I'll have to sort.
– syntheso
Mar 26 at 10:45
Added reproducible DF.
– syntheso
Mar 26 at 10:51
This is not the appended df right? Because im not seeing any duplicatechunk_id
so its not the same as your two small example dataframe. This changed your question to be honest.
– Erfan
Mar 26 at 10:59
Which type is column timecodes? Are these lists or strings?
– Erfan
Mar 26 at 10:31
Which type is column timecodes? Are these lists or strings?
– Erfan
Mar 26 at 10:31
Please add the dataframe code, so that people working on this directly copy-paste it,
– min2bro
Mar 26 at 10:33
Please add the dataframe code, so that people working on this directly copy-paste it,
– min2bro
Mar 26 at 10:33
They are strings, but they should be a list of integers. That's a mistake which I'll have to sort.
– syntheso
Mar 26 at 10:45
They are strings, but they should be a list of integers. That's a mistake which I'll have to sort.
– syntheso
Mar 26 at 10:45
Added reproducible DF.
– syntheso
Mar 26 at 10:51
Added reproducible DF.
– syntheso
Mar 26 at 10:51
This is not the appended df right? Because im not seeing any duplicate
chunk_id
so its not the same as your two small example dataframe. This changed your question to be honest.– Erfan
Mar 26 at 10:59
This is not the appended df right? Because im not seeing any duplicate
chunk_id
so its not the same as your two small example dataframe. This changed your question to be honest.– Erfan
Mar 26 at 10:59
|
show 2 more comments
1 Answer
1
active
oldest
votes
Without knowing your timecodes
column and which type it has, you can average the rest over chunk_id
by using pandas.DataFrame.groupby
and .agg
# First of all you should concat your csv's into one big dataframe:
df3 = pd.concat(csvs, axis=0, ignore_index=True)
# First we concat df1 & df2 which is the appending of the CSV's
# Note this is a simulation of your csv's
df3 = pd.concat([df1,df2], ignore_index=True)
print(df3)
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [60 62] [100, 200] 500 60-62 2 1
1 [58 53] [800, 900] 1000 58-53 4 6
2 [60 62] [200, 400] 1000 60-62 4 2
3 [30 33] [200, 700] 800 30-33 6 7
Now we can aggregate with groupby
df_grouped = df3.groupby('chunk_id').agg('chunk_completed':'mean',
'diffs_avg':'mean',
'sd':'mean').reset_index()
print(df_grouped)
chunk_id chunk_completed diffs_avg sd
0 30-33 800 6 7.0
1 58-53 1000 4 6.0
2 60-62 750 3 1.5
This isn't a complete solution since a main issue is retrieving multiple files which from a directory which become many dataframes. How could this work with concat?
– syntheso
Mar 26 at 10:48
Its to simulate your appending of CSV's which I understood was already done by you and not your problem. I cannot "append" your csv's, I dont have them. Its just to show Iappend
the two dataframes. @SebSilas
– Erfan
Mar 26 at 10:50
See my edit, this should clear things up @SebSilas
– Erfan
Mar 26 at 10:53
Thank you Erfan!
– syntheso
Mar 26 at 11:07
Additionally, how might we get an average for each item in the list in the timecodes column? Assuming it is a list of integers.
– syntheso
Mar 26 at 11:08
|
show 2 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55354742%2fmerge-dfs-on-str-id-and-take-averages%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Without knowing your timecodes
column and which type it has, you can average the rest over chunk_id
by using pandas.DataFrame.groupby
and .agg
# First of all you should concat your csv's into one big dataframe:
df3 = pd.concat(csvs, axis=0, ignore_index=True)
# First we concat df1 & df2 which is the appending of the CSV's
# Note this is a simulation of your csv's
df3 = pd.concat([df1,df2], ignore_index=True)
print(df3)
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [60 62] [100, 200] 500 60-62 2 1
1 [58 53] [800, 900] 1000 58-53 4 6
2 [60 62] [200, 400] 1000 60-62 4 2
3 [30 33] [200, 700] 800 30-33 6 7
Now we can aggregate with groupby
df_grouped = df3.groupby('chunk_id').agg('chunk_completed':'mean',
'diffs_avg':'mean',
'sd':'mean').reset_index()
print(df_grouped)
chunk_id chunk_completed diffs_avg sd
0 30-33 800 6 7.0
1 58-53 1000 4 6.0
2 60-62 750 3 1.5
This isn't a complete solution since a main issue is retrieving multiple files which from a directory which become many dataframes. How could this work with concat?
– syntheso
Mar 26 at 10:48
Its to simulate your appending of CSV's which I understood was already done by you and not your problem. I cannot "append" your csv's, I dont have them. Its just to show Iappend
the two dataframes. @SebSilas
– Erfan
Mar 26 at 10:50
See my edit, this should clear things up @SebSilas
– Erfan
Mar 26 at 10:53
Thank you Erfan!
– syntheso
Mar 26 at 11:07
Additionally, how might we get an average for each item in the list in the timecodes column? Assuming it is a list of integers.
– syntheso
Mar 26 at 11:08
|
show 2 more comments
Without knowing your timecodes
column and which type it has, you can average the rest over chunk_id
by using pandas.DataFrame.groupby
and .agg
# First of all you should concat your csv's into one big dataframe:
df3 = pd.concat(csvs, axis=0, ignore_index=True)
# First we concat df1 & df2 which is the appending of the CSV's
# Note this is a simulation of your csv's
df3 = pd.concat([df1,df2], ignore_index=True)
print(df3)
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [60 62] [100, 200] 500 60-62 2 1
1 [58 53] [800, 900] 1000 58-53 4 6
2 [60 62] [200, 400] 1000 60-62 4 2
3 [30 33] [200, 700] 800 30-33 6 7
Now we can aggregate with groupby
df_grouped = df3.groupby('chunk_id').agg('chunk_completed':'mean',
'diffs_avg':'mean',
'sd':'mean').reset_index()
print(df_grouped)
chunk_id chunk_completed diffs_avg sd
0 30-33 800 6 7.0
1 58-53 1000 4 6.0
2 60-62 750 3 1.5
This isn't a complete solution since a main issue is retrieving multiple files which from a directory which become many dataframes. How could this work with concat?
– syntheso
Mar 26 at 10:48
Its to simulate your appending of CSV's which I understood was already done by you and not your problem. I cannot "append" your csv's, I dont have them. Its just to show Iappend
the two dataframes. @SebSilas
– Erfan
Mar 26 at 10:50
See my edit, this should clear things up @SebSilas
– Erfan
Mar 26 at 10:53
Thank you Erfan!
– syntheso
Mar 26 at 11:07
Additionally, how might we get an average for each item in the list in the timecodes column? Assuming it is a list of integers.
– syntheso
Mar 26 at 11:08
|
show 2 more comments
Without knowing your timecodes
column and which type it has, you can average the rest over chunk_id
by using pandas.DataFrame.groupby
and .agg
# First of all you should concat your csv's into one big dataframe:
df3 = pd.concat(csvs, axis=0, ignore_index=True)
# First we concat df1 & df2 which is the appending of the CSV's
# Note this is a simulation of your csv's
df3 = pd.concat([df1,df2], ignore_index=True)
print(df3)
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [60 62] [100, 200] 500 60-62 2 1
1 [58 53] [800, 900] 1000 58-53 4 6
2 [60 62] [200, 400] 1000 60-62 4 2
3 [30 33] [200, 700] 800 30-33 6 7
Now we can aggregate with groupby
df_grouped = df3.groupby('chunk_id').agg('chunk_completed':'mean',
'diffs_avg':'mean',
'sd':'mean').reset_index()
print(df_grouped)
chunk_id chunk_completed diffs_avg sd
0 30-33 800 6 7.0
1 58-53 1000 4 6.0
2 60-62 750 3 1.5
Without knowing your timecodes
column and which type it has, you can average the rest over chunk_id
by using pandas.DataFrame.groupby
and .agg
# First of all you should concat your csv's into one big dataframe:
df3 = pd.concat(csvs, axis=0, ignore_index=True)
# First we concat df1 & df2 which is the appending of the CSV's
# Note this is a simulation of your csv's
df3 = pd.concat([df1,df2], ignore_index=True)
print(df3)
chunk timecodes chunk_completed chunk_id diffs_avg sd
0 [60 62] [100, 200] 500 60-62 2 1
1 [58 53] [800, 900] 1000 58-53 4 6
2 [60 62] [200, 400] 1000 60-62 4 2
3 [30 33] [200, 700] 800 30-33 6 7
Now we can aggregate with groupby
df_grouped = df3.groupby('chunk_id').agg('chunk_completed':'mean',
'diffs_avg':'mean',
'sd':'mean').reset_index()
print(df_grouped)
chunk_id chunk_completed diffs_avg sd
0 30-33 800 6 7.0
1 58-53 1000 4 6.0
2 60-62 750 3 1.5
edited Mar 26 at 10:52
answered Mar 26 at 10:43
ErfanErfan
9,0502 gold badges7 silver badges25 bronze badges
9,0502 gold badges7 silver badges25 bronze badges
This isn't a complete solution since a main issue is retrieving multiple files which from a directory which become many dataframes. How could this work with concat?
– syntheso
Mar 26 at 10:48
Its to simulate your appending of CSV's which I understood was already done by you and not your problem. I cannot "append" your csv's, I dont have them. Its just to show Iappend
the two dataframes. @SebSilas
– Erfan
Mar 26 at 10:50
See my edit, this should clear things up @SebSilas
– Erfan
Mar 26 at 10:53
Thank you Erfan!
– syntheso
Mar 26 at 11:07
Additionally, how might we get an average for each item in the list in the timecodes column? Assuming it is a list of integers.
– syntheso
Mar 26 at 11:08
|
show 2 more comments
This isn't a complete solution since a main issue is retrieving multiple files which from a directory which become many dataframes. How could this work with concat?
– syntheso
Mar 26 at 10:48
Its to simulate your appending of CSV's which I understood was already done by you and not your problem. I cannot "append" your csv's, I dont have them. Its just to show Iappend
the two dataframes. @SebSilas
– Erfan
Mar 26 at 10:50
See my edit, this should clear things up @SebSilas
– Erfan
Mar 26 at 10:53
Thank you Erfan!
– syntheso
Mar 26 at 11:07
Additionally, how might we get an average for each item in the list in the timecodes column? Assuming it is a list of integers.
– syntheso
Mar 26 at 11:08
This isn't a complete solution since a main issue is retrieving multiple files which from a directory which become many dataframes. How could this work with concat?
– syntheso
Mar 26 at 10:48
This isn't a complete solution since a main issue is retrieving multiple files which from a directory which become many dataframes. How could this work with concat?
– syntheso
Mar 26 at 10:48
Its to simulate your appending of CSV's which I understood was already done by you and not your problem. I cannot "append" your csv's, I dont have them. Its just to show I
append
the two dataframes. @SebSilas– Erfan
Mar 26 at 10:50
Its to simulate your appending of CSV's which I understood was already done by you and not your problem. I cannot "append" your csv's, I dont have them. Its just to show I
append
the two dataframes. @SebSilas– Erfan
Mar 26 at 10:50
See my edit, this should clear things up @SebSilas
– Erfan
Mar 26 at 10:53
See my edit, this should clear things up @SebSilas
– Erfan
Mar 26 at 10:53
Thank you Erfan!
– syntheso
Mar 26 at 11:07
Thank you Erfan!
– syntheso
Mar 26 at 11:07
Additionally, how might we get an average for each item in the list in the timecodes column? Assuming it is a list of integers.
– syntheso
Mar 26 at 11:08
Additionally, how might we get an average for each item in the list in the timecodes column? Assuming it is a list of integers.
– syntheso
Mar 26 at 11:08
|
show 2 more comments
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55354742%2fmerge-dfs-on-str-id-and-take-averages%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Which type is column timecodes? Are these lists or strings?
– Erfan
Mar 26 at 10:31
Please add the dataframe code, so that people working on this directly copy-paste it,
– min2bro
Mar 26 at 10:33
They are strings, but they should be a list of integers. That's a mistake which I'll have to sort.
– syntheso
Mar 26 at 10:45
Added reproducible DF.
– syntheso
Mar 26 at 10:51
This is not the appended df right? Because im not seeing any duplicate
chunk_id
so its not the same as your two small example dataframe. This changed your question to be honest.– Erfan
Mar 26 at 10:59