High Throughput Computing in the Service of Scientific Discovery

High Throughput Computing in the Service of Scientific Discovery

Show Video

You. Yeah. Thank you and. It's. It's. A short but, so. Far pleasurable. Visit. Here, yeah. And, I, must admit, that, I was not aware until this morning that I am, scheduled. To give a talk here I thought, it would be just a visit, so. I am. Recycling. The talk that I gave that. Is. It. Earlier. This week, but. I. Strongly. Encourage you to. Guide. Me through the process with, a, question. Suggestions. Disagreements. Whatever. Complains. Especially. If you want to know, more. You. Know deeper, technical, stuff, on. What, I am presenting, because. The, talk is, a. Pity. Fellow on. Technical. Things and it's more, about. The. Big picture. So. I. Hope you, will find it useful and. But. I'm. Open, to any. Adjustment. And suggestion, to, to. How to. How to do it so. I'll. Start by a, little, bit background about. Uw-madison. Besides. The fact that it always looks like this right. And a. But. They. The really important, thing of the numbers, and. The. You, know 43,000. Students. 22,000. To faculty and staff three. Billion dollars, of total budget, but, what, what is important. For the discussion, here is that. There's. Always one, point it is also one, point two billion. Dollars of external, support for. Research that goes into the campus, so. That's, a very, significant. Research. Engine. That. Obviously. Is strongly. Related to what I'm going to talk about so. That's. The reason I wanted. To make sure that this is on the table, as as. We start there. So. Before, I dive, into the. More science, part let me talk business. Commercial and. It's. It's always a challenge when, you are in the business world. To. Decide. Where you are on, the. Vertical versus, the horizontal. Or the horizontal, versus the vertical, do. You focus on a market, that is vertical, very. Few. Users. With, very. Specific, requirement, or do you go, horizontal. And do something, that applies, to everyone, and, what. Have you and where, you position yourself the way you put your effort, and all the kind of thing is, is. A challenge, for any business. So. So. With this as a background. When. You work with domain, scientists. Then. They.

Are. Interested. In verticals. They are interested, in solving, the problem and, they. Don't care about. What. Delivers. To them they the. Solution, for the computational. Problem how. Elegant, is it does it use, block. Chains or does it use machine, learning and whatever it, is it has, to be, effective. Because if it's effective, they will adopt it and if, not. Who. Cares and. It's, not efficient, it's effective. Because. It has to do, what they need to do with the resources, that they have yes, at the. Amount of time that they have. So. So. This is. Sort. Of the vertical, that. Researchers. Of the main scientists. Care. About, computer. Scientists, are, horizontal. They. Want to come up with something, that everyone, will use and. Therefore. In terms of their science, they are, focusing. And I. Am a computer, scientist, so I'm. Supposed to, belong. To, this side, but. I would say and we'll see later that I, strongly. Believe, that if you want to be a horizontal. You. Must evaluate, what. You are doing in, real. Verticals. So. Even, if you want to be Horizonte, you cannot run, away from the verticals, and, and. Work with them, so. I think, the, same realization. Is. Shared, with MIT, that, a year ago came and said. We. Are going to invest a billion dollars, in creating. A new, college. That. Will reshape you, know when, I see, MIT. Saying, they, will, reshape, themselves. That's huge. You. Know institution. Like this don't reshape. Commercial. Entity restructure. Every year or every six, months. But. Organization. Like MIT don't reshape themselves, but the, main thing is that to reorient. MIT. To. Bring the. Computing. Nai to, all fields of study and. To. Use all fields, of study to impact. Research. And development. In. Computing. In AI so. This. Is. Recognizing. The fact that there, is a strong, synergy. Between. The. Two areas, that requires, a different, approach to, how an. Established. And successful, institution. Like MIT is, going to move forward. So. In. A. Rephrasing. The. Announcement. Of MIT, to the context, of the Center for supercomputing I. Claim. That we. Established. The center in. 2006. To. Do exactly. That on a much, narrower, scope. Of. The. You, know computing. Technology. And. To. Do it on distributed, high-throughput computer, to, have impact, on all fields, of, study. On the, campus, and to. Leverage from, this. 1.2. Billion dollar enterprise. To. Do. What we do better. Ok. So. One way to. Summarize. What, we, are doing in C HTC's, to harmonize, the vertical, and the horizontal. Now. That may sound as if we are trying to square the circle, because. Usually you, say that two, things that, are, completely. Independent they're, orthogonal. So. If you think about the horizontal, and the vertical as, being orthogonal and. You would say how can you. Harmonize. Them, but I believe, we can square the circle. So. This is the kind of love. Letters, we like to get from. Our verticals. And. This is a guy that is studying.

Mice. On, on. A remote, island in, the. The South. Atlantic. With. Some. In. Evolutionary. Phenomenon. That, is it's not. Understood, you, know why are, the mice they're twice, the, size of the mice in. Europe, even. Though they arrive to this island, not by swimming from Europe but on, cargo. Ships that went to the to the island. Two. Hundred years ago and suddenly they're twice as big as their. Ancestors. In Europe, and by. The way I learned as a result of, talking. To these guys that there are two types of field mice there the German and there the French I. Don't. Know which one made it or both of them made it but. The thing is that what, is. Mr.. Or they don't understand, is how you, can, have. Evolutionary. Process. In two hundred years. Because. They are twice as big and actually they're causing problem to to divorce it's one of these little you know an island is basically a rock in the middle of the. Ocean that. Unless. They swim over there but that's that's a little bit. So. What. What we like to see is that CH TC is the same essential, to my project, this. Is an adoption of a vertical I. Couldn't. Do what I did, without. This. They. Say this, is the effectiveness, if you want to. So. In. Order to, do. This kind of harmonization, you. Have to, I created. This acronym as I put together the slide you have to do D D D or E namely. You have to, be able to design to develop to deploy to operate, and to evaluate. The. Researchers, are coming with their, upper, layer. You. Have to provide the, rest. Which. Is the, lower layer and the stuff in the middle and you, want to do that you put the best you can on this. Bar. You also have to make sure that you partner with them, which. Is their more, of a personal. Commitment. Of, time and effort and. What. You. Would, you, would like to aim at or you should aim at if you really, believe in it is how. Can you influence. Their. Science. With. The, capabilities. That you offer, see. Because the. Ultimate, success, is if the vertical that you created, not. Only, was adopted, but change the way they do science. So. If you can do more you can do better you can do different. They. May change the scientific, method. To, do. Things more different. So, their science, is changing a. Simple. Way to think about it is if, you science, you. You, formulate, the problem by assuming you can only calculate. Two points, then. You you. Know that. All you can deal with is, straight, lines so, you, will not ask question, about what happens between the points.

But. If you can, do all many points, you. Can start asking question about the shape of the curve is it, goes between the two lines, that's. A different, way to do your science because you realize you, can do more and. It's. A frequent question that I asked researcher. And say what, will you do if I give you two. Million core hours. What. And in. Many cases they get the blank phase and. Part. Of it is because researchers. Don't. Even. Dream. Of what, they can do if. They, had more computing, power available to. Them I would. Argue that this is also true with, researchers, in, general, that, when you come and say I give. You two million dollars to do your work what would you do, they. Didn't they didn't have these dreams, of what. Can I date they always come. And say oh I don't believe I can get that much money and I say let's, assume you have the money. But. At the same time. We. Want to take, care of our horizontals. And. We. Have been developing. Quite. A bit of, new. Stuff in. The area of distributed. What, we call, high throughput computing and. This. Is. Translated. Or materialized. In, all the HD, kondo software, which. Is a whole family, of. Technologies. That. Obviously. I don't have the time even, to start. Explaining all the the different pieces and. Also. What is nice is that, we. Are. We, were considered. As, significant. Contributors. To two Nobel, prizes in physics. The. First one was the. Detection. Of the or. The identification, of the Higgs boson. At. The at the LHC and the second one was the detection. Of gravitational waves. That. Has proven there, the. The. The prediction, or the, model of the Tyne Stein envisioned, a hundred years ago as being, correct, by. The way I shall, never believe that it can be measured because. It's such a tiny, thing and measuring. It and detecting, it and all this I will get back to to the gravitational, wave later, the the other part of of our, work is to has. Been to, pioneer. The. The. Concept of research computing facilitation. Which, is different, than user support. Because. It really recognizes. The role of these individuals. In, bringing. Computing. And science. Together. And, not helping, somebody to, implement, a piece of software to run on something, and. We. Use. Non-technical. People to do research computing facilitation. Because, we, learned, over, the years that. The. Technical. People don't. Know how to listen. Technical. People, have. A solution. Before. You, presented, the problem, because. The solution, is what they know so. They, will map whatever, your problem, is to their solution, and. Facilitation. Is about. Finding, the right solution for you and therefore. You should not have. You should learn how to listen, and to understand, what, the problem is rather than to, focus how to map the problem to the solution that, you know. To. Give you a little bit of a feeling of what we do on, the ground is. We. Delivered. Last year, 400, million to. Hours to researchers, on the campus, from. 250. Different project. And they'd, about a thousand. Research. Computing facilitation. Hours. And. This. Is the snapshot of. That. Was, before my, talk at TI FR last week of the. Top 10, users. Of, CH, TC, in. 24. Hours so, we delivered, 1.1, million core hours, the. First one CMS, is one of the LHC. Experiments. Is, representing. An. International. Group of more. Than three. Three, and a half thousand, researchers. The. Second one is Ice Cube I will talk about it later is is also. An international, collaboration. That. Is anchored, at Wisconsin, they did about, 50%. Of these cycles, and then. You, have individual. Group form from, physics this, is bio magnetic, resonance databank. Mass bobba. Bobba and even. Here, number. 10 at nutritional, sciences, did. Almost, a thousand. Cause solid. For 24, hours, to. Do. Something. Okay. So. That's. The way it looks on the ground and, what, we are doing in ch TC and this is enabled, through, the research computing, facilitation.

This. Is Lauren, Mitchell. Who is Michael. Sorry that is. Leading. The facilitation. Effort, and an, important, part of it to teach out to fish rather than to fish, you. The support usually does the fishing and said oh let me write for you the coda let me rewrite this thing and let me do, this kind of thing we don't do any of this we. Guide. You to do what needs to be done or we can connect you with somebody that will help, you do things but this, is not what, the facilitator, do and we apply, the same, concept. Of scaling, out not. Only to our infrastructure. But also to the facilitation, process and, it's, now there. Are national. Activities. That are related to it there were several in a, safe project, that were related. To it so it it's, becoming more and more understood. So. Ice cube is a detector. Of neutrinos, in the South Pole and. They. Drilled, one and a half kilometer. Probes. Into, the eyes and, through. The eyes they, are detecting. These these, particles. And. The. Elusive, particles, the important thing is that all the infrastructure, is based on HT condo it's. A vertical and, you. May, find it interesting that. GPUs. Are critical, to their computing, capacity. But. They, don't do machine learning, they're. Using GPUs, to do a, simulation. Of, the eyes that. Is critical, to their ability to turn, the, information that they get into, the science that they that. They need and. This. Is the. Distribution. Of, GPU. Our. That, they did over the last year and a half where, every, color. Is. A. Different. Pool. Of resources. Or. A different. Site or. A different organization you, will see it later that provided, them. With the GPUs, and at, the top of the you, know the the upper value here 300 thousand. Hours. In you, know in these columns. Are monthly columns, and some. Of the places the the first one is university, of wisconsin-madison. They. They did one one and a half million. Total. GPU. Hours in a year and a half. But. GPUs. Are not the only thing that they need they need also CPUs, and that, is which. Is easier, for them to get, obviously. And these. Are again, they, can go out and, use. CPU. Hours from, all these different institution. We're, here the top institution. In 35, million. Total. Over. The year and a half of, CPU. Hours so. The important thing is having. Many. Colors. Multicolor. Is good which means the, contribution is coming for many different, sources. Speaking. About sources. This. Is a picture, of. A. Blazer that. Is. Emitting, particles. Towards. Earth and. The. Distance, between this. Blazer, and, the, earth, is. Exactly. 4 billion, light years. And. They. Recently, detected, the particles. Which. They called a, ghost particle. That was emitted. From, this. Blazer. And. By. Knowing. That he came and the properties, and all this and the, ability of other what's, called multi messenger. Astronomy. To focus. On this blazer, that. Is related, to the emission, because, the emission was triggered by an event that. Was related to the black holes at in. This in this galaxy. So. That's. What, we. Enable. Them to do. In. Terms of the computational. Part once. They get the signals, out of the eyes from. These different, detectors and, I, encourage. You if anyone is interested go, to you. Know search, for it on the web and there's a lot of interesting, stuff. There but, speaking, about four. Billion years, and, trying. To give a meaning to the term. Monty, scale, so. This is another love letter that we got earlier, this year in July and saying, look, at this. Protein. Image. That, we just created. Using. Ch TC and this is, 12 angstrom. For, billion light years 12. Angstrom. Now. What we are seeing more and more in science is, that. The. Instruments. In this case, a cryo, am a, microscope. A. They're. Generating, a lot of signals. That.

Require, Significant. Computing power to turn the signal into, data before. You can do advanced. Machine learning on it, or. Whatever you are doing was data in order to get, your science, so. The, computing, barrier is, more, in. How. Do I get, the data out of it rather. Than how, do I turn the, data into, the science and. This. Is an example. So. One. We are getting signals, of particles, that were emitted four. Billion years ago and. This. One it, is something. That was emitted from an electron. Microscope. That. Now is being deployed in more and more universities, around. The world, to. To. Study structures. And they. Can, see things at the atomic level. I, mentioned. Earlier the gravitational. Wave this, is like oh this, is a report, that came, from, the grouping, in Germany, and again. We. Couldn't do it. Without. The. Collaboration and, the technology, that we have been doing together and, here. This is a partnership, that has been going on for over 15. Years now and this. Is common, that to. To, build these things it's. A long relationship. Across. Technologies. Across. Science. Software. And, all the kind of thing to, get. To. Where people are because they start. With one instrument, and they upgrade, the instrument, the first instrument. Couldn't. Detect things, but. They had to go through the first instrument, to get the second instrument and then they, made the. The detection and these instruments are big investment. So. The. First installation, of, Kondo was at. At. Wisconsin. At. 1995. So. Even in our small, world we are working. On, on something, for. Quite a number of years, and. We. From, the beginning, we, made. Sure. That we, create verticals. And they at the beginning the vertical, were just our, computer, science department. Now. The work that, went into Kondo. We have to change the name because we had the lawsuit, about calling. A software, calling it, condor and bla bla we. Managed to get out of it after spending a lot of money to. Buy putting, an HD in front of it, and. This. Work is actually, a. Continuation. Of, my PhD, work that's, the reason why I am, sympathetic. With with, your with, your and. My. PhD, work was strongly, influenced. By the. Distributed, computing era. Of the, the. The mid 70s. And, here's. A paper. That. Ends low published, in the, computer, magazine, yes we had come we, had magazines. In. 75, about, computers, you know those of you with, who. Don't believe, that, the. Web publication actually. I have another publication, that's from, the communication, of the ACM, that is even earlier in a minute or in in a few minutes but, the impulsing, here that Enslow, listed. The. Benefits, of distributed. System, and as. You. Can tell we. Have new. Generation. Of. Buzzwords, that our. Reprimanded. Were listed by angelo in in, the mid seventies.

And. He. Said if you have a, true. Distributed, system that will be there and. Then. You ask yourself, the obvious, question, how comes that if we knew it in. In, the in the 70s, how. Comes that you cannot order through. Amazon a distributed. System in a box deliver, hit deploy, it and be done right and why. Do you have to call these things now clouds, because. You. Know. Something. Something, new and, the. Reason for that is actually in a technical, report that, Enslow any student. Published. In 81 and they said ok, you want all these good things, these. Are the properties, that. The. Sister. Must have in. Order. To be a true distributed. System and if it's not a true distributed, system it will not give you the benefits and I. Want to focus on two elements of this is unity, of control, and component. Autonomy, because. I think the this, distress. Between, these two. Is. Key. Now. You know we are trying to do system. Transparency. You. Know we now call the virtual machine or container whatever whatever is there the buzzword of them but every, everyone wants to may to, make it. Transparent. So, the, unity of control, says that all the element of. The. Of, the system have to be unified. In. A, common, to achieve a common goal I, know. It doesn't sound like computer science right. You. Never talk about the, component, of a database being unified, to, achieve a common goal this sounds like social, science but. This is critical when you bring together the, pieces of a distributed, system. You. Have to assume that there is some. Driving. Goal. That, they all share and they want to achieve at. The. Same time you, have to make sure that all the components are autonomous. Because. If, they are not fully autonomous. You. Will not get all these beautiful benefits of, before now. How do you. Achieve. This. Commonality. Of. Gold. With. Full. Local, autonomy is. The. Hard problem, and. Therefore. If you are ever presented. With. A distributed. System and, somebody. Says this is it. Check. The unity of control, and check the, autonomy, and. We. As, human, don't like, to. Build system, with local autonomy we. Like to be in control, we. Like to know about everything, and make decision, for everything, rather. Than letting them. Sort. Of exist. And. You, can take it to a question. Of short. Forces, and long forces, in in physical. System, and, you. Achieve stability through, the short force is not there, so. You're, creating some, kind of local autonomy and, in creating. There, so. That. Is one of the things that have been driving us from from the beginning, so. It's very important. When you do. Something. In. This space whether, it's your PhD or a continuation, of it try, to put. It on on, principles. Rather, than on the the latest buzzword of the day, at. The same time if you build systems. Then. You have to go to the, to the masters and see what did they teach us and I. Assume I don't have to introduce Dijkstra. To you, but. He published, in. 68. It. Was received. Presented. That the. Or. I think it's actually happening, this week the. The operator, oh yeah. The operating, system the. ACM. Operating. System, conference, is happening this week in, 67. Where. He said if you want to build the system. Understand. The sequential processes. And. Put. Them in a hierarchy and. Build. Them on. A. Solid, design where, you understand, what are the pieces and what is the relationship, between the pieces namely. Don't sit down at your terminal, with. The latest scripting, language and start writing the. System. Yes. He. Was a logician you know he wanted to prove properties, and, that. Is, but. Another important. Part. Of this paper which I it's it's a short read I. Encourage. You to read, it was. That, Dijkstra. Published. A paper where, he listed, the, mistakes, that they made. How. Many papers, did you read recently. Where. People listed. The, mistake, that they made in. Doing. Anything. All. Papers, are just saying look how great I am and. Here. He said we made mistake the first mistake. Was, that we tried to build a perfect system. Which. Is way too complicated. Which. Is way, too. Involved. With the boundary, and, the the corners, and all the kind of things that a makes.

It Very, complex, and then also is, continuously. Following. Assumption, that are changing, all the time that it you, have to simplify and, you have. To. Not. Build. Something. That can handle. In the optimal, way every, corner case. Because. You end up with nothing. The. Second, mistake that they made it was that, they, didn't, focus. Or. Didn't. Include, the bugging in the. Original. In the from, the beginning, and. This. Is something that I'm sure you see all the time I'll, write it I will make it run fast the scale all the kind of thing our debugging, will deal with it later. Because. Debugging, is the functionality, the bugging is not publishable. How easy it is to debug your, latest. Greatest algorithm. And and, I can tell you that in my group when somebody comes to me and says I. Want, how. About if we do this as an, algorithm for doing something and, then say are, you willing, to, be responsible of, debugging, and supporting. It in the. Field on that, many. Installation. Yeah we, don't have the number of installation, that you guys are talking about but still we have enough for the size, of the team that we have and typically. The answer is saying no I don't want to deal with that so then we are not going to follow and not implement, it even though it's an amazing, algorithm. Because. It's so complex, that we will never be able to to, debug it and. Proving. Correctness, of implementation. You know how difficult it is. So taking, all these. Principles. And all these concepts, together it. Was unavoidable, for, us to come up with with a vision, that, says. Eh. We. Can do global computing. Via a flock of condors, and we presented, it at CERN in, 92. And. We. Said let's connect, all these computers, all the kind of things and create one worldwide, I know. It sounds like a cloud right and. We'll. Run all this and. We. Even have already a, system, that can do it we have a demonstration, of, of a system, that. Can do it so each of these is a condor pool, that. Spanned from Dublin Russia, all. The way through, Europe to. Wisconsin. Yes at that time we had only 200, work station, in the pool in the. Early, 90s, and. We. Were able to submit jobs, that. Traveled, through, these. Interconnect. From. Dubna. To to Wisconsin. Actually. If you think about. Execution. Of jobs and, message passing it's, a little bit similar you. Enter it one place, you. Send, your. Job somewhere. And you, get the results, back which are like an acknowledgment. So. The principles, of routers. And network transfer. Can be applied and here we actually use, routers. And gateways. Conceptually. To move the jobs across so. If you went from from. Berlin to here, you may have gone this way, to. Learn in Wisconsin. But. The.

Fact Is that even, an advanced. Organization. Like CERN, didn't. Get it. We. Presented, it we, but. Didn't. Now, 25. Years years. Later they got. It they installed, it there. The. Transition, to HD, condo. On, all, their batch processing. Which. Includes. About. 15,000. Servers with. 230,000. Cause. Completed. And, it's running, and. If. I have time we'll get to open science grid later. We. Do this whole thing worldwide. But. It's. To. Two lessons here is if you build. On the right primitives. You reach, things which, are unavoidable. Okay. There. Are not that many solutions. To. The same problem. Even. Though it seems that you can choose here you can choose there so, anchor. Them on the right, principle. And you have to wait. If. People, adopt, what you are, doing quickly, I. Don't. Consider this as a good sign. So. In. 96. We. I. Decided. It. Was more me than I said. Okay. We, have to articulate. That we are different. Everyone. Around us is HPC. And. The. Only way that they refer to the work that we are doing, was. Ah, this, is embarassingly. Parallel, work and. It. Hurts you when people, say that what you do is embarrassing. So. Part, of what I have. Been doing since is, any time somebody talks about embarrassingly. Parallel I, say this. Is pleasantly. Powerless. This. Is naturally. Followed there's nothing embarrassing about doing, it and. I. Also coined the. Term, high throughput computing. And a, year. Later I was interviewed, by HPC. Wire about. High, supercomputing. Which is also a, funny. But. Here's the way that I tried to contrast. High. Throughput versus. High performance, so, high stroke high performance. Measures. Everything in floating-point. Operation. Per second. If. You. Look at the top 500 list, of the machines, they. Are they, are based on Linpack and they're giving you their, their. Flops. How. Many floating-point, operation. You could do per second and by, the way the. Way they build these machines is, really, focusing. Putting. As much. Silicon. As possible, into the floating. Something. Doesn't give me, my. Battery, is I don't. Have. Okay. Let's assume that that's fine No. Okay. You. See it does. Fault-tolerant. Is not, you. Know you connect, and you plug in it and you still don't have power in there. So. The. So, they focus. On flops to be in in the top of the top 500 list which is the biggest machine in the world you have to run limp back once. And. Be. Faster, than anyone else. The. Scientists, that we are working with care. About floppy. Which. Is floating-point, operation. Per year. Now. You, know very well that you cannot take what you can do in a second, multiply it by the number of seconds, in. A year and get what you can do in a year. If. You can run a kilometer. In three minutes it doesn't mean that you can run that, many kilometers, in a year right. It's, a different, problem and. It. Requires. Robustness. But. It also requires automation. Why automation. Because there is a huge difference between running, one job of a. Hundred thousand, hours to. Running a hundred thousand jobs for one hour and. The. Big machines, are designed for this i. Suport. Is designed, for that and. I. Will not have time to talk about what we are doing for automation we. Are using DAGs, and, directed. At cyclic graph for capturing, the things, and a you. Know you, can find information about, it but that's, the. Core of the high throughput computing. Again. You. Have to be patient, so. I introduced. It in 96. In. 2017. In. National. Academy of Science, report. Stated. That, many. Fields, today rely, on high supercomputing. For discovery, and then. They even said many feels increasingly, will, ionize report, computing. For. Discovery. So. These now a, recognition. That something that didn't exist. 20. Years, ago in a way as a concept, is.

Critical. For scientific, discovery. So. The good news is the tie through put is important. We. Try to do similar, things, you. Know we talked about the grid. You. Know the grid world and. It. We, had a chapter in this book that was referenced I don't know how many tens. Of thousands, of times because, everyone. Who wrote any paper about grid reference, the book I think without opening. It but. It says a great she is the book, we. Try, to convince, the community that the key for the grid is. Integrated. Mechanism. That, are robust. Scalable, and portable. The. Community, didn't follow that and then. The whole grid movement. Disappeared. Because. Everything. That happened there people, said but it doesn't work and. We should sure it doesn't work because nobody. Focused on mechanism, that are dependable. And. We. Have. A problem, in in our field that people like to work on policies. People. Like to poet to publish, policies, people. Claim. That they know how to evaluate policies. Doing. The same for mechanism is much more difficult. But. I would argue that, if, you give me a good set of mechanisms I, can. Get decent. Behavior. Or performance. From. Any system, with the policy, that will take me 15. Minutes to invent. But. If you give me an amazing. Policy. And no mechanisms. Good. Luck. It. Will take you more than 15 minutes to come up with the mechanism. So. We. Need mechanism. And at the end of the day, you. Know even, when. You come up with an optimizer, as we talked earlier you, still need something that, will run, the plan right and if, this sucks, and if, this breaks half. The way because, of whatever. The. Optimizer, is. Is. Not really the solution. So. Well. I'm checking, on this because there's no there's no time here anywhere, okay. We are a, I. Can. Open are. There any questions any. Suggestion. Anything. We. Are, about. 10 minutes from from, the hour so, yeah, I have. More stuff I can. Continue but I will, give you an opportunity, to. Yes. We. Consider. All. Of these things as. Normal. And. And. Just. Try. Again. No. Now. If you have a checkpoint we'll start for where you are if you. A, if. Two. If. Two nodes, disconnect. We. Try to reconnect. The question is how long was. The deacon. Were. The two disconnected. Okay. So so. Okay. In, condo, there is that's. Where you submitted, the job that's where the job is running and they. Have a relationship which, keep alive and all the kind of stuff now, if this goes away. Then. How. Long, this. Will stay committed, to the relationship, and how. Long this will stay is an autonomous decision. Of both players. So. We. Use this, for. Doing life update for example on the submit so, this submit machine may have 10,000. Jobs. That are running and we. Want to upgrade it, so. Actually, we, send, a. An. Update. To all the, the workers, and say, would. You please wait, longer, because. I'm going down and. It may take me longer to come back. So. Just. Keep doing what you are doing and, then. Reconnect. Because. We don't want. You. To be all the time staying. In the connect because if I really crashed, and the machine was burnt and all the kind of thing I want you to let go because. There, will be nobody, to take the results, so. The.

Advantage, That we had. Or. The, luck is that, we started with workstations. That. Anything. Can happen it's. Not only that is any engagement. That happens. Can. The. The. Data or the rules that, govern, the engagement. Can. Change. Between. The time that they. Were advertised. And the. Time that the action, took. Place. So. When the action, comes to take place we. Have to check. Whether. The condition, for this what we call a match are still valid and. If. Any. Partner. This is the autonomy, decides. To get. Out of this relationship everyone is okay okay. We. Tried. So. So the mentality, is. If. It, worked, it's, a miracle or. It's a coincident. Not. If it didn't work it's a it's, a problem, it's. It's. Normal, and, typically. When when, people come to give, presentation. To. My, group is. That. They, sometimes, almost, run away because. You. Know they come and say look I did this and this and it can run, fast and that these condition, and all this and we. Stay numb and. Then. When they are done in this they ask they, say okay question, saying okay what happened if this does what, happen if this goes away what happened if this didn't fail to communicate what happened if this buffet is for what, you know. Picking. On on databases, because, my, host is the database guy. The. Worst thing that can happen to you is the database that runs out of disk space. All. The optimizer, in the world will, not save you and it will take you a long time to recover your database, and he just ran out of disk it's. Not that it burned. Down you know it stopped journaling, it's up there so. In. Kondo. If. You cannot log. You. Shut down. And. We. Are working. Hard to put. Into the log a message, that we shut down because we couldn't log. Anymore. We. Keep, all the time yeah. It doesn't always work because, it's. Hard to keep space. On disk but but. We we, will because. This, is sort of we. Have to be prepared for for, the same to to, fall apart so that's. A long long, answer, to it, it, has to be in the DNA. Not. Not as an afterthought. Anything. Else. So. Let me give. You another principle. Before we have to call it quit. So. In. 92. We. Published. The paper about, here's. What we did was we were very proud about. 250,000. Jobs you. Know we do it on on at. Wisconsin, poverty in in one hour these days and. In. A, worldwide maybe. Every second, I don't know something. Like that but, the important thing is, we. Said look if we listen to the user. Back. To the verticals. And. We ask them what what, do, you care about and, they said we, want to get access to as much capacity, from. A single point and. We. Want this point of access to preserve our local, environment. So. I. The. You. Know and by, the way the the. Overhead, here is not networking. Overhead, is there the effort to make it all happen, it's. In the effectiveness, so. We. Turn this into a principle, that we use throughout. Our. System, and other systems that are built in this context, including. Open science grid and alike. And we said. Submit, locally, and one globally. Submit. To a local environment. Where. You manage everything locally. And. It's. Not only that you manage things locally, you use a local namespace you look you use a local identity, space and all the kind of thing and try to reach as far as you can, so. That's by the way the reason why, we don't, assume. Shared file system. Because. You. Cannot run globally, if you assume a shared file system you. Have to be prepared I know it sounds like containers, pack. Everything with you and go. Can. You do it for everything can you do Bobby. But. My understanding, from talking to to, a ghoul that, one of the challenges, that you guys are facing and, I understand he's coming to visit here, shortly. Yeah. I saw, him a. Couple. Of weeks ago in Madison and. He. Said yes we are now facing what you have been complaining about for, such, a long time is in the cloud environment where, you need elasticity. In. Your, in. Your database management, system, then how. How, do you grow and shrink. And. Heat, this is this is the same kind of problem how do I give. You the data that you need where you are going to execute and how do I bring the data back to -, right it's not identical but it's there it's the same a, fundamental.

Yeah And. Let. Me make a quick, comment here is that there is this whole business of resource acquisition. That. Is, related. Today, to to, the cloud and maybe I'll make a comment on it later. Quickly. When I show, you sort of one. One of the pictures of chec so so that's that's. The the thing and I for. Many years have been using this picture of the desktop, that goes to the floor that goes to the building that goes to the campus goes to the region goes to the world. You. Want to sit here and. Run. Here. But. You want to use everything. In between. It's. Not that you, go. To the cloud and I understanding. From what I know about the Jura is that, it has been the Microsoft, philosophy, of using, it, as an extension of your local environment rather, than taking, everything, that you have and put it there so, that's the death, decimate locally, in one globally. But. You go to the young generation and. They. Don't have desktops, anymore, and they don't have workstation. Anymore they have Jupiter. Notebooks. Everything. Starts and ends in a Jupiter notebook so the question is and that will we are working on just to give you a flavor why, we are not done. Is. How do you bring. Python. To. Condo. All. The, binding, all the days what what do you created, is an is an API for. An. Interface, from, Python, to, expose, what. Condo does is, jobs, and stuff like that. But. Also, to, go in the opposite, direction and, to say how can you bring the concept, of high supercomputing. Into, the Python life and here. What we are doing is, that you. Know Python has map, as an. Important. Construct. Of, capturing. Doing. Multiple. Things, so. We created a, module that. We call HTML. That. Implements. The map as a. High. Support computing, things through kondal in the back so. The functions, that are invoked. By. The map are basically, job that a running installed of is all the issues with a synchronicity, and, all the kind of thing but at the, end of it you, have an object. Which. Is the map object, that, got populated. By condo jobs but. Then you have to make the object richer, because they are all the question about the a synchronous. Execution. Of the of the, of the map, so. I think that that will be the. Last. Picture, I will share with you unless there. It. Is interesting, more, later but so. This is the world that we present. Our CH, TC users, so. You, have the researcher, with, a. Problem. To solve this is a workflow, described. As a dag a piece, of it and by, the way we have chairs in Wisconsin, so not everyone is sitting on the on, the ground and. The. Researcher, is. Interfacing. With the. Count, with, CH T see that, basically, presents all the resources, of, the. Campus. The single. Accessible. Resource, from here. So. The, one anyhow the 1.1, million hours, that are delivered, annual, a daily. Sorry. It's, coming, from a whole bunch of condor pools of the campus, over, a dozen. Some. Of their mind see HTC and some of them are owned and operated by, others but. See. HTC, also. Can bring in resources. From. HPC. Systems, a. Growing. Source of computing, power I would. Say in the US and Europe, now because there is significant, investment, in it and. They. Wanted to be used for, science. Including HTC. You. Saw the previous, I find, it a report. That I showed you go. To the cloud so. The. The, guys at. Atia. File, the, CMS, they. They. Expand, their computing, capacity using, condo, into a zoo. They. Can get 10,000, calls in a show up in 10 minutes connect them to the condo fool at the area far and go. And. The. Open science grid, so. This is basically. The. World view the. Thing is that really this. Researcher. Has to be. Shown here with a pile, of money and an allocation, and, a. Priority. That. Using. These. Resources, has. To be paid with. And. That.

Actually, Brings up I. Call. It capacity, planning you can call it optimization. Which. Is. Moving. From an era where, you did capacity. Planning at very low frequency. Once. A year or once every five years to. High. Frequency, capacity, planning, because. You have to decide when. To use your money to buy cloud resources. When. To use the allocation. To, you. HPC. Resources. And when to use your priority. To. Use open science, grid. Resources. So. That's. As far as I can go in the, time that we have any. Further. Questions. Or, something, which is not related but, you still are interested, to know I, can. Try, and answer yes. Yes. We, with. The checkpointing, in the. In the black yeah in the black and white I mean we I think. Only a year ago we, decided, to stop. Supporting it because it. Became. Non-sustainable. We. From, actually. In this. 90. To convey, workshop. We. Sent. Out a list, of requests. To. The. Operating. System for, workstation, community. And say give. Us checkpointing. Give. A check pointing. None. Of the solution, that that were offered, to us since. And. There, were promises that checkpointing, vivillon's, and checkpointing, containers, and all the kind of thing don't. Work in our in, our environment. And also. If we are not convinced. That. These checkpointing, is reliable. It. Becomes, unsustainable, right. If if, somebody, is running a job in, your system, that was checkpoint a 20 time and crashes. Whose. Fault is it, the. Jobs fault or your checkpointing. And. This. Can. Consume. Way, too many resources, to for. A group like ours, so. We, couldn't, we, could we couldn't find anyone, to do checkpointing, in a different. Way than what we were doing with. You, know writing out the, decor. And more, and more of the jobs. Complex. Scripts, and, other, crazy things so. Yes. It, would be amazing. If we, what, we are doing our best today, is to support application. To, do their own logical. Checkpointing. So. The. Days we, need to understand, that we need to move the things reliably, back. To, the origin where is it it's. Not even, non-trivial in this case how, is the signal how do they tell you that they are done you, know they usually, they want to shut down and they want to restart, because, only when you restart, you verify, that the check point that you created it's, valid because how, long do you keep there the, check it. But. Yeah, in the in the black and white there is we we. We did it but, we. Had to give up. We, we had users, who youth checkpointing, to. Address, also the. Local. The local file system there. We'll start the job read everything that they need checkpoint. And then move. So. There are many science, applications. That. Really. Build the, entire, state, by, reading, in the first couple. Of minutes so. Rather, than knowing, which data you need and looking up database, at ba-ba-ba-ba-ba, rather, than doing it remotely they, did it locally, and then check pointed, a.

Maybe. For, these cases we can do something better with with. Containers, and, think like that but there, this. Is still the Juara is out and and there is also a question with all these technologies moving. So fast that, if. We invest in something and get it to work there and then, we wake up and people. Started, with daca in the science community now, everyone, is doing singularity. Singularity. As its own part and then singularity. Will move to another version and suddenly they solve a the, the, the dev up world, is, is, pretty pretty, demanding and. Because. We have to build, on stable, and dependable services. That. You. Know when by. The way at the footnote that i should have mentioned and, a. We. Are one of the few i don't. Want to say only but one of the few, signs. Middle. Words if you want to call it that, we, have a pretty wide, window. Deployment. We. The Condor works very. Well on windows, and we have quite a bit including. Commercial. Users. That. That are running condo on windows and we. Made early, on a, decision, that that, took. Us quite a bit of effort but we did a deep. Implementation. A, according to Windows where. We. Raised. Our. Abstraction. Level to, a point where we can go to. Linux. UNIX and windows. From, the same abstraction. Level because earlier, we were tied. And rather. Than trying to go through some kind of an interpreter, and all the kind of thing we said we, need the native. Windows. Implementation. But we have quite a bit of of. Users. I would say more heavily in the commercial, world so. Since, I mentioned the commercial, world just, I think that will be sort of a fun piece so one, of our users. Who. Who. Is using condo, is the dream work. So. The entire. Rendering. Form. Of dream work which the last time they reported, that in one of our meeting, was. 45,000, servers and, a. 15,000. Desktop, machines. They. Are using it to do all the production. Rendering, a. Delay. The, latest movie, that. They released. Took. 300. Million core hours. To. Render. So. Oh if you watched. Any of their movies since, 2011, that, was released since 2011. You. Are watching the. You're. Watching condor. Any. Other. Questions. Thank. You.

2019-11-22 10:04

Show Video

Comments:

First

Other news