How to manage a million airflow tasks with different start dates?Can I have tasks under one DAG with different start dates in Airflow?Airflow dynamic DAG and Task IdsHow to Run a Simple Airflow DAGAirflow Docker Deployment: Task Not Getting Run After start_date + schedule_intervalDynamic task generation in an Airflow DAGAirflow: Tasks queued but not runningHow to submit parameters to a Python program in Airflow?Unable to run Airflow Tasks due to execution date and start dateGetting unique_id for apache airflow tasksHow to exit with error from script to Airflow?Airflow task initiation issue
Can one use the present progressive or gerund like an adjective?
What do you call a notepad used to keep a record?
Comment traduire « That screams X »
How to unit test methods which using static methods?
Is it okay to submit a paper from a master's thesis without informing the advisor?
My colleague is constantly blaming me for his errors
I need help with pasta
What game is this character in the Pixels movie from?
Can a nowhere continuous function have a connected graph?
How do I tell the reader that my character is autistic in Fantasy?
Just graduated with a master’s degree, but I internalised nothing
How to properly say asset/assets in German
How is this practical and very old scene shot?
Company threatening to call my current job after I declined their offer
Prime parity peregrination
Can SOCPs approximate better than LPs?
Why wasn't ASCII designed with a contiguous alphanumeric character order?
Is there a legal way for US presidents to extend their terms beyond two terms of four years?
Single level file directory
Is Cyclic Ether oxidised by periodic acid
Have any large aeroplanes been landed - safely and without damage - in locations that they could not be flown away from?
for xml path('') output
Can a stressful Wish's Strength reduction be cured early by a Greater Restoration spell?
Find the radius of the hoop.
How to manage a million airflow tasks with different start dates?
Can I have tasks under one DAG with different start dates in Airflow?Airflow dynamic DAG and Task IdsHow to Run a Simple Airflow DAGAirflow Docker Deployment: Task Not Getting Run After start_date + schedule_intervalDynamic task generation in an Airflow DAGAirflow: Tasks queued but not runningHow to submit parameters to a Python program in Airflow?Unable to run Airflow Tasks due to execution date and start dateGetting unique_id for apache airflow tasksHow to exit with error from script to Airflow?Airflow task initiation issue
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have about one million Airflow tasks which use the same python function. Each need to run with a different start date and parameters.
Earlier I asked a question on how to run two such tasks under one DAG. However, when the tasks become many, the answers there are not scalable. (see link and notes)
Question
How can I run a million (or any large number) or tasks under in a scalable fashion on Airflow, where each tasks stems from the same python function but has a different start date and different arguments?
Notes
The tasks don't need to run on a PythonOperator
(as they stem from a python function). In reality, they would run in a distributed fashion on a Kubernetes Cluster most likely (so with a KubernetesExecutor
or KubernetesPodOperator
). Either way, the architectural problem behind the contribution of the DAG(s) still remains.)
Solution ideas
One solution which I was thinking of is that under one DAG, to dynamically construct all tasks and within the python function which gets executed, to pass the different start dates. On the outside Airflow will execute each task every day, but inside the function, if the execution_date
is earlier than the start_date
, the function will just return 0
.
airflow
|
show 1 more comment
I have about one million Airflow tasks which use the same python function. Each need to run with a different start date and parameters.
Earlier I asked a question on how to run two such tasks under one DAG. However, when the tasks become many, the answers there are not scalable. (see link and notes)
Question
How can I run a million (or any large number) or tasks under in a scalable fashion on Airflow, where each tasks stems from the same python function but has a different start date and different arguments?
Notes
The tasks don't need to run on a PythonOperator
(as they stem from a python function). In reality, they would run in a distributed fashion on a Kubernetes Cluster most likely (so with a KubernetesExecutor
or KubernetesPodOperator
). Either way, the architectural problem behind the contribution of the DAG(s) still remains.)
Solution ideas
One solution which I was thinking of is that under one DAG, to dynamically construct all tasks and within the python function which gets executed, to pass the different start dates. On the outside Airflow will execute each task every day, but inside the function, if the execution_date
is earlier than the start_date
, the function will just return 0
.
airflow
1
Can you provide a bit more detail? This sounds like a lot of work; are you prepared to throw an army of machines at it so that it finishes before the heat death of the universe?
– Robert Harvey♦
Mar 25 at 14:10
Sure, let me know what kind of information would be used to add? I have limited it to one clear question now.
– Newskooler
Mar 25 at 14:15
1
I'm a bit confused. I once worked for a company that had a similar workflow arrangement. I can't imagine the number of tasks being in the thousands, let alone the millions, so I think I'm missing something here (by at least three orders of magnitude).
– Robert Harvey♦
Mar 25 at 14:30
Here is an example: Say I have a million users. Each of them joined my network at a different date (so has a different start date). Everyone has its activity saved in daily .json files. If I want to download this date to work for, I need to have a task fro each users. They would all have different start dates and the function I use to download would have a different argument (e.g. the user name). Your comment is suggesting that I may be thinking about the issue in the wrong way I guess.
– Newskooler
Mar 25 at 14:54
You are thinking about it wrong way. Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. I suggest you to use another tools for this problem. You still can use Airflow, for example, to process the whole bunch of users and use this info in your ETL process later.
– vurmux
Mar 26 at 9:53
|
show 1 more comment
I have about one million Airflow tasks which use the same python function. Each need to run with a different start date and parameters.
Earlier I asked a question on how to run two such tasks under one DAG. However, when the tasks become many, the answers there are not scalable. (see link and notes)
Question
How can I run a million (or any large number) or tasks under in a scalable fashion on Airflow, where each tasks stems from the same python function but has a different start date and different arguments?
Notes
The tasks don't need to run on a PythonOperator
(as they stem from a python function). In reality, they would run in a distributed fashion on a Kubernetes Cluster most likely (so with a KubernetesExecutor
or KubernetesPodOperator
). Either way, the architectural problem behind the contribution of the DAG(s) still remains.)
Solution ideas
One solution which I was thinking of is that under one DAG, to dynamically construct all tasks and within the python function which gets executed, to pass the different start dates. On the outside Airflow will execute each task every day, but inside the function, if the execution_date
is earlier than the start_date
, the function will just return 0
.
airflow
I have about one million Airflow tasks which use the same python function. Each need to run with a different start date and parameters.
Earlier I asked a question on how to run two such tasks under one DAG. However, when the tasks become many, the answers there are not scalable. (see link and notes)
Question
How can I run a million (or any large number) or tasks under in a scalable fashion on Airflow, where each tasks stems from the same python function but has a different start date and different arguments?
Notes
The tasks don't need to run on a PythonOperator
(as they stem from a python function). In reality, they would run in a distributed fashion on a Kubernetes Cluster most likely (so with a KubernetesExecutor
or KubernetesPodOperator
). Either way, the architectural problem behind the contribution of the DAG(s) still remains.)
Solution ideas
One solution which I was thinking of is that under one DAG, to dynamically construct all tasks and within the python function which gets executed, to pass the different start dates. On the outside Airflow will execute each task every day, but inside the function, if the execution_date
is earlier than the start_date
, the function will just return 0
.
airflow
airflow
edited Mar 25 at 14:23
Newskooler
asked Mar 25 at 14:08
NewskoolerNewskooler
7032 gold badges12 silver badges29 bronze badges
7032 gold badges12 silver badges29 bronze badges
1
Can you provide a bit more detail? This sounds like a lot of work; are you prepared to throw an army of machines at it so that it finishes before the heat death of the universe?
– Robert Harvey♦
Mar 25 at 14:10
Sure, let me know what kind of information would be used to add? I have limited it to one clear question now.
– Newskooler
Mar 25 at 14:15
1
I'm a bit confused. I once worked for a company that had a similar workflow arrangement. I can't imagine the number of tasks being in the thousands, let alone the millions, so I think I'm missing something here (by at least three orders of magnitude).
– Robert Harvey♦
Mar 25 at 14:30
Here is an example: Say I have a million users. Each of them joined my network at a different date (so has a different start date). Everyone has its activity saved in daily .json files. If I want to download this date to work for, I need to have a task fro each users. They would all have different start dates and the function I use to download would have a different argument (e.g. the user name). Your comment is suggesting that I may be thinking about the issue in the wrong way I guess.
– Newskooler
Mar 25 at 14:54
You are thinking about it wrong way. Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. I suggest you to use another tools for this problem. You still can use Airflow, for example, to process the whole bunch of users and use this info in your ETL process later.
– vurmux
Mar 26 at 9:53
|
show 1 more comment
1
Can you provide a bit more detail? This sounds like a lot of work; are you prepared to throw an army of machines at it so that it finishes before the heat death of the universe?
– Robert Harvey♦
Mar 25 at 14:10
Sure, let me know what kind of information would be used to add? I have limited it to one clear question now.
– Newskooler
Mar 25 at 14:15
1
I'm a bit confused. I once worked for a company that had a similar workflow arrangement. I can't imagine the number of tasks being in the thousands, let alone the millions, so I think I'm missing something here (by at least three orders of magnitude).
– Robert Harvey♦
Mar 25 at 14:30
Here is an example: Say I have a million users. Each of them joined my network at a different date (so has a different start date). Everyone has its activity saved in daily .json files. If I want to download this date to work for, I need to have a task fro each users. They would all have different start dates and the function I use to download would have a different argument (e.g. the user name). Your comment is suggesting that I may be thinking about the issue in the wrong way I guess.
– Newskooler
Mar 25 at 14:54
You are thinking about it wrong way. Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. I suggest you to use another tools for this problem. You still can use Airflow, for example, to process the whole bunch of users and use this info in your ETL process later.
– vurmux
Mar 26 at 9:53
1
1
Can you provide a bit more detail? This sounds like a lot of work; are you prepared to throw an army of machines at it so that it finishes before the heat death of the universe?
– Robert Harvey♦
Mar 25 at 14:10
Can you provide a bit more detail? This sounds like a lot of work; are you prepared to throw an army of machines at it so that it finishes before the heat death of the universe?
– Robert Harvey♦
Mar 25 at 14:10
Sure, let me know what kind of information would be used to add? I have limited it to one clear question now.
– Newskooler
Mar 25 at 14:15
Sure, let me know what kind of information would be used to add? I have limited it to one clear question now.
– Newskooler
Mar 25 at 14:15
1
1
I'm a bit confused. I once worked for a company that had a similar workflow arrangement. I can't imagine the number of tasks being in the thousands, let alone the millions, so I think I'm missing something here (by at least three orders of magnitude).
– Robert Harvey♦
Mar 25 at 14:30
I'm a bit confused. I once worked for a company that had a similar workflow arrangement. I can't imagine the number of tasks being in the thousands, let alone the millions, so I think I'm missing something here (by at least three orders of magnitude).
– Robert Harvey♦
Mar 25 at 14:30
Here is an example: Say I have a million users. Each of them joined my network at a different date (so has a different start date). Everyone has its activity saved in daily .json files. If I want to download this date to work for, I need to have a task fro each users. They would all have different start dates and the function I use to download would have a different argument (e.g. the user name). Your comment is suggesting that I may be thinking about the issue in the wrong way I guess.
– Newskooler
Mar 25 at 14:54
Here is an example: Say I have a million users. Each of them joined my network at a different date (so has a different start date). Everyone has its activity saved in daily .json files. If I want to download this date to work for, I need to have a task fro each users. They would all have different start dates and the function I use to download would have a different argument (e.g. the user name). Your comment is suggesting that I may be thinking about the issue in the wrong way I guess.
– Newskooler
Mar 25 at 14:54
You are thinking about it wrong way. Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. I suggest you to use another tools for this problem. You still can use Airflow, for example, to process the whole bunch of users and use this info in your ETL process later.
– vurmux
Mar 26 at 9:53
You are thinking about it wrong way. Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. I suggest you to use another tools for this problem. You still can use Airflow, for example, to process the whole bunch of users and use this info in your ETL process later.
– vurmux
Mar 26 at 9:53
|
show 1 more comment
1 Answer
1
active
oldest
votes
After our conversation in comments I think I can get an answer:
Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. You still can use Airflow, for example, to process the whole bunch of users (given from somewhere) and use this info in your ETL process later.
I recommend to build your task system on top of Celery library (don't mess up with the CeleryExecutor
in Airflow, because Airflow can be used on top of Celery). It is a task queue that is focused on millions of real-time tasks:
Celery is used in production systems to process millions of tasks a day.
Celery is written on Python, is production-ready, stable and is incredibly scalable. I think it is the best tool to solve your problem.
I am already running airflow with aCeleryExecutor
. However when I construct a DAG for say 20130102 it will have 120k tasks and on the next day it will have 150k tasks and a week later it will have 100k tasks. How does the fact that I am using celery help out here? I though it's good to keep the number of tasks constants in a DAG?
– Newskooler
Mar 26 at 13:47
Apache Airflow can work on top of Celery (it is why Airflow is so scalable). I recommend to use the Celery itself, not inside the Airflow. You can write a script that will run 100k tasks with Celery, get their result, and send it to some Airflow task, which will work with them.
– vurmux
Mar 26 at 13:53
So when you say working with Celery, you don't mean theCeleryExecutor
, but Celery standalone integrated into my code?
– Newskooler
Mar 26 at 14:41
Yes, exactly. I meant that you can build your code on top of the Celery library
– vurmux
Mar 26 at 15:10
1
Perfect, thanks. Maybe add this to your reply, so that there is no confusion over the Celery Executor.
– Newskooler
Mar 26 at 15:12
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55339704%2fhow-to-manage-a-million-airflow-tasks-with-different-start-dates%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
After our conversation in comments I think I can get an answer:
Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. You still can use Airflow, for example, to process the whole bunch of users (given from somewhere) and use this info in your ETL process later.
I recommend to build your task system on top of Celery library (don't mess up with the CeleryExecutor
in Airflow, because Airflow can be used on top of Celery). It is a task queue that is focused on millions of real-time tasks:
Celery is used in production systems to process millions of tasks a day.
Celery is written on Python, is production-ready, stable and is incredibly scalable. I think it is the best tool to solve your problem.
I am already running airflow with aCeleryExecutor
. However when I construct a DAG for say 20130102 it will have 120k tasks and on the next day it will have 150k tasks and a week later it will have 100k tasks. How does the fact that I am using celery help out here? I though it's good to keep the number of tasks constants in a DAG?
– Newskooler
Mar 26 at 13:47
Apache Airflow can work on top of Celery (it is why Airflow is so scalable). I recommend to use the Celery itself, not inside the Airflow. You can write a script that will run 100k tasks with Celery, get their result, and send it to some Airflow task, which will work with them.
– vurmux
Mar 26 at 13:53
So when you say working with Celery, you don't mean theCeleryExecutor
, but Celery standalone integrated into my code?
– Newskooler
Mar 26 at 14:41
Yes, exactly. I meant that you can build your code on top of the Celery library
– vurmux
Mar 26 at 15:10
1
Perfect, thanks. Maybe add this to your reply, so that there is no confusion over the Celery Executor.
– Newskooler
Mar 26 at 15:12
add a comment |
After our conversation in comments I think I can get an answer:
Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. You still can use Airflow, for example, to process the whole bunch of users (given from somewhere) and use this info in your ETL process later.
I recommend to build your task system on top of Celery library (don't mess up with the CeleryExecutor
in Airflow, because Airflow can be used on top of Celery). It is a task queue that is focused on millions of real-time tasks:
Celery is used in production systems to process millions of tasks a day.
Celery is written on Python, is production-ready, stable and is incredibly scalable. I think it is the best tool to solve your problem.
I am already running airflow with aCeleryExecutor
. However when I construct a DAG for say 20130102 it will have 120k tasks and on the next day it will have 150k tasks and a week later it will have 100k tasks. How does the fact that I am using celery help out here? I though it's good to keep the number of tasks constants in a DAG?
– Newskooler
Mar 26 at 13:47
Apache Airflow can work on top of Celery (it is why Airflow is so scalable). I recommend to use the Celery itself, not inside the Airflow. You can write a script that will run 100k tasks with Celery, get their result, and send it to some Airflow task, which will work with them.
– vurmux
Mar 26 at 13:53
So when you say working with Celery, you don't mean theCeleryExecutor
, but Celery standalone integrated into my code?
– Newskooler
Mar 26 at 14:41
Yes, exactly. I meant that you can build your code on top of the Celery library
– vurmux
Mar 26 at 15:10
1
Perfect, thanks. Maybe add this to your reply, so that there is no confusion over the Celery Executor.
– Newskooler
Mar 26 at 15:12
add a comment |
After our conversation in comments I think I can get an answer:
Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. You still can use Airflow, for example, to process the whole bunch of users (given from somewhere) and use this info in your ETL process later.
I recommend to build your task system on top of Celery library (don't mess up with the CeleryExecutor
in Airflow, because Airflow can be used on top of Celery). It is a task queue that is focused on millions of real-time tasks:
Celery is used in production systems to process millions of tasks a day.
Celery is written on Python, is production-ready, stable and is incredibly scalable. I think it is the best tool to solve your problem.
After our conversation in comments I think I can get an answer:
Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. You still can use Airflow, for example, to process the whole bunch of users (given from somewhere) and use this info in your ETL process later.
I recommend to build your task system on top of Celery library (don't mess up with the CeleryExecutor
in Airflow, because Airflow can be used on top of Celery). It is a task queue that is focused on millions of real-time tasks:
Celery is used in production systems to process millions of tasks a day.
Celery is written on Python, is production-ready, stable and is incredibly scalable. I think it is the best tool to solve your problem.
edited Mar 26 at 15:14
answered Mar 26 at 13:42
vurmuxvurmux
5,6852 gold badges8 silver badges30 bronze badges
5,6852 gold badges8 silver badges30 bronze badges
I am already running airflow with aCeleryExecutor
. However when I construct a DAG for say 20130102 it will have 120k tasks and on the next day it will have 150k tasks and a week later it will have 100k tasks. How does the fact that I am using celery help out here? I though it's good to keep the number of tasks constants in a DAG?
– Newskooler
Mar 26 at 13:47
Apache Airflow can work on top of Celery (it is why Airflow is so scalable). I recommend to use the Celery itself, not inside the Airflow. You can write a script that will run 100k tasks with Celery, get their result, and send it to some Airflow task, which will work with them.
– vurmux
Mar 26 at 13:53
So when you say working with Celery, you don't mean theCeleryExecutor
, but Celery standalone integrated into my code?
– Newskooler
Mar 26 at 14:41
Yes, exactly. I meant that you can build your code on top of the Celery library
– vurmux
Mar 26 at 15:10
1
Perfect, thanks. Maybe add this to your reply, so that there is no confusion over the Celery Executor.
– Newskooler
Mar 26 at 15:12
add a comment |
I am already running airflow with aCeleryExecutor
. However when I construct a DAG for say 20130102 it will have 120k tasks and on the next day it will have 150k tasks and a week later it will have 100k tasks. How does the fact that I am using celery help out here? I though it's good to keep the number of tasks constants in a DAG?
– Newskooler
Mar 26 at 13:47
Apache Airflow can work on top of Celery (it is why Airflow is so scalable). I recommend to use the Celery itself, not inside the Airflow. You can write a script that will run 100k tasks with Celery, get their result, and send it to some Airflow task, which will work with them.
– vurmux
Mar 26 at 13:53
So when you say working with Celery, you don't mean theCeleryExecutor
, but Celery standalone integrated into my code?
– Newskooler
Mar 26 at 14:41
Yes, exactly. I meant that you can build your code on top of the Celery library
– vurmux
Mar 26 at 15:10
1
Perfect, thanks. Maybe add this to your reply, so that there is no confusion over the Celery Executor.
– Newskooler
Mar 26 at 15:12
I am already running airflow with a
CeleryExecutor
. However when I construct a DAG for say 20130102 it will have 120k tasks and on the next day it will have 150k tasks and a week later it will have 100k tasks. How does the fact that I am using celery help out here? I though it's good to keep the number of tasks constants in a DAG?– Newskooler
Mar 26 at 13:47
I am already running airflow with a
CeleryExecutor
. However when I construct a DAG for say 20130102 it will have 120k tasks and on the next day it will have 150k tasks and a week later it will have 100k tasks. How does the fact that I am using celery help out here? I though it's good to keep the number of tasks constants in a DAG?– Newskooler
Mar 26 at 13:47
Apache Airflow can work on top of Celery (it is why Airflow is so scalable). I recommend to use the Celery itself, not inside the Airflow. You can write a script that will run 100k tasks with Celery, get their result, and send it to some Airflow task, which will work with them.
– vurmux
Mar 26 at 13:53
Apache Airflow can work on top of Celery (it is why Airflow is so scalable). I recommend to use the Celery itself, not inside the Airflow. You can write a script that will run 100k tasks with Celery, get their result, and send it to some Airflow task, which will work with them.
– vurmux
Mar 26 at 13:53
So when you say working with Celery, you don't mean the
CeleryExecutor
, but Celery standalone integrated into my code?– Newskooler
Mar 26 at 14:41
So when you say working with Celery, you don't mean the
CeleryExecutor
, but Celery standalone integrated into my code?– Newskooler
Mar 26 at 14:41
Yes, exactly. I meant that you can build your code on top of the Celery library
– vurmux
Mar 26 at 15:10
Yes, exactly. I meant that you can build your code on top of the Celery library
– vurmux
Mar 26 at 15:10
1
1
Perfect, thanks. Maybe add this to your reply, so that there is no confusion over the Celery Executor.
– Newskooler
Mar 26 at 15:12
Perfect, thanks. Maybe add this to your reply, so that there is no confusion over the Celery Executor.
– Newskooler
Mar 26 at 15:12
add a comment |
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Got a question that you can’t ask on public Stack Overflow? Learn more about sharing private information with Stack Overflow for Teams.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55339704%2fhow-to-manage-a-million-airflow-tasks-with-different-start-dates%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Can you provide a bit more detail? This sounds like a lot of work; are you prepared to throw an army of machines at it so that it finishes before the heat death of the universe?
– Robert Harvey♦
Mar 25 at 14:10
Sure, let me know what kind of information would be used to add? I have limited it to one clear question now.
– Newskooler
Mar 25 at 14:15
1
I'm a bit confused. I once worked for a company that had a similar workflow arrangement. I can't imagine the number of tasks being in the thousands, let alone the millions, so I think I'm missing something here (by at least three orders of magnitude).
– Robert Harvey♦
Mar 25 at 14:30
Here is an example: Say I have a million users. Each of them joined my network at a different date (so has a different start date). Everyone has its activity saved in daily .json files. If I want to download this date to work for, I need to have a task fro each users. They would all have different start dates and the function I use to download would have a different argument (e.g. the user name). Your comment is suggesting that I may be thinking about the issue in the wrong way I guess.
– Newskooler
Mar 25 at 14:54
You are thinking about it wrong way. Airflow can be used in millions of dynamic tasks, but it should not. Airflow DAGs are supposed to be pretty constant. I suggest you to use another tools for this problem. You still can use Airflow, for example, to process the whole bunch of users and use this info in your ETL process later.
– vurmux
Mar 26 at 9:53