PyHEP 2021: Python for XENON1T

PyHEP 2021: Python for XENON1T

Show Video

my name is johan on behalf of the xenon anton experiment here on pi hep whereas we're actually doing so much high energy physics but actually low energy physics because we're doing a rare event search but nonetheless let me explain a little bit how we use python in our experiment our experiment consists of many institutes all around the world and our aim is to detect dark matter to that end we have here our time projection chamber filled with liquid xenon where with one day we hope to find a dark matter interaction it's located in italy which you see here on the right side of the figure let me explain a little bit how this experiment works so within this vessel we have liquid xenon and if a particle interacts we create a first scintillation signal which we call the s1 and the electrons that are free in that interaction are drifted upwards because of an electric field creating an s2 and with those s1 and s2 we have all the information we need to do the physics which may one day allow us to find dark matter now of course we need to reconstruct this properly and to that end we have 500 light sensors bmts both at the top and the bottom of this detector and each of these pmt's is continuously being read out so they measure pulses over time and we do this this is the only place that we're in the software stack c plus comes into play we do this with with read x uh at a rate of about 50 megabytes per second so let me stress again that uh each of these channels is read out in a self-triggered way that means that there is actually no trigger globally applied to this data nevertheless we're able to store all of this data because we're a rare event search so all of this data is making it to disk and from then on it's python only so in order to reconstruct what went on in the tpc in our detector we need to first group all these pulses together to make the s1 and the s2 peaks if we then have those those peaks we again have to group them together to get the real event what happened in the tpc and with that we can do the physics we want now let me showcase how how our software works and rather than talking about it a lot i'd rather start off with a demo i must apologize in advance that i didn't upload the notebook because of the amount of data that's created um it would just take a long long time um but all right so in the following demo we're gonna look at how this did how our software would enable us to look at the detector life while we're for instance calibrating it with a radioactive source a krypton source and we're just going to look how uh how we can use this this software to look at the data and next i will also focus on if we have the data how we can can do multiple analysis at the same time but but let me just start off with the demo um so imagine we're i'm now on my own computer starting up a fake daq which is processing the data live while we're doing that we can already look at the data and see what's going on if we look inside this graph we see for some energy ranges how uh the the event rate is building up the different lines indicate different energy ranges we can see also see the the different types of peaks so we have these as ones which are very sharp and s2s which are very broad and this is all happening live from my own machine you by the way you can also look at how much each of the channels are seeing and this can go on uh while we're doing a radon krypton calibration to actually see live the development of a radioactive source building up now imagine that you're an analyzer you now want to look at this data krypton has three characteristic peaks which you see over here there's three of these peaks characteristic for krypton however an analyzer notices that there's also a lot of stuff at the low energy and comes up with a brilliant idea to get rid of this low energy stuff by changing a configuration for the number of bmts needed in order to define a peak this analyzer changes the configuration that is involved which is this 50 uh it changes it to saying that each peak should be seen by at least 50 empties and while it's doing that it's asking uh the software okay well then just give me the the new data as if we were processed with this setting and within just 18 seconds we have reprocessed this data and of course uh we are targeting this low energy part of the the data and now i would like to show you that there is also something interesting uh going on if we look at the amount of data we have processed i don't want to go into details of what all this data is but if you now look in our process data folder you see actually that there's two copies of peak basics where it's just only one copy of records and pulse counts i will come that to that in just a moment in the rest of the presentation but in order to we actually also do this online processing live on the entire xenon anton data so that means that we're at the daq level already doing a full reconstruction of our entire data chain so going from the raw signals all the way to these events where and this comprises doing the baselining finding the hits doing the big finding and actually this peak finding you might call a trigger as before that we don't actually have a trigger other than the self trigger of the digitizers also we need to do position reconstruct using neural nets using tensorflow we apply corrections everything is happening faster than we as fast as the data is actually coming in and to to be able to look at this data we also ship a part of it into a database which allows people to actually look at this data online as data is being being collected so within 30 seconds you can anywhere in the world see the fully reconstructed data as it was as it was detected in the daq and we actually made it even easier for for people that you don't even have to open a terminal you can just ask via slack for the latest status report so how do we do all this because i said this is all done in python how we were able to to to get this all doing all this reconstruction as fast as the data comes in well of course what you need to have is performance and when we were upgrading from xenon one ton to xenon anton we also upgraded our software so in the old days we were able to do just 200 gigabytes a second per core now with the new uh the new software which we call strex we're able to go much faster so let's look a little bit deeper into how do we actually get there how does this work the answer is this strikes the streaming analysis for xenon and our software is divided in this generic strax framework and the xenon anton specific extraction framework so in strikes we have all the generalized and optimized code we as i will uh explain a little bit later we're relying heavily on numpy and number to make all this stuff fast in python the detector specific software like choices how many pmts you need in order to define a peak is all stored in strats both are of course relying heavily on continuous integration and the core code is public whereas analyses and detector specific conditions are of course private all right so let's look a little bit deeper into how what is our goal for this software so the goal is to get from these raw pmt traces to the events and in the meantime we also of course need to cluster those vmt traces correctly in order to build the peaks we have chosen to go for a tiered data structure and a modular processing framework so how does it look like well at the lowest level of course we need to read out what the daq spits out so as i said we have a reader software in c plus plus which just dumps the data as it comes out of the digitizers to disk then when we start reading the same with python we have a daq reader which is able to read the format and convert that to this to the python format we are using later on we then have to process these pulses to get the peaks and events and as you can see and as you can imagine that at the lowest level all these raw pmt traces there's a lot of data there so this is typical for a one hour run however if you go further up in the chain you're doing the triggering you're doing the triage you're making selections the data becomes much smaller very quickly and in the end when we're talking about events where orders of magnitude smaller in size by the way illustrating here all the all the processing processing steps along the chain along the chain which we call plugins as lego blocks because we went for a very modular approach which makes it very easy to interchange one of these plugins for instance in the demo i just showed there wasn't actually a daq reader but instead there was a simulator plugin which simulated the data now we wanted to know how we make this fast so for that we need to look a little bit further into how is the data shared in between those plugins and the answer is that for instance when we're looking at pulse processing we look at a given chunk of data we organize all of our data in interval time intervals which we call chunks so there's a chunk of data which is being processed by the post processing and after it's done with that it will send it up to the next plugin to be processed there we can do multiple chunks at the same time here illustrated with the conveyor belts because we are able to multi-process all this each of these chunks is organized in a tabular format that means that the arrays are of fixed length we use numpy structured arrays for this and everything is a fixed length however this may not sound like a very intuitive choice for these s1s are very short signals so if you're talking about if things love buffer length uh so that switching has one signal easily fixed fits into this as well in this buffer however s2 signals can be very wide up to microsecond long that means that we sometimes need to split all these these uh these raw traces into multiple buffers thereby introducing some extra bookkeeping however of course with this extra bookkeeping we're later able to reconstruct 1s1 followed by an s2 but it may not sound like a logical choice given that you know that we have such different data for s1s and s2s however it is what makes our software fast because of this and because of the the great features from numpy structure the race and numpa number we're able to do all this processing live in in python so using just in time compilation most of the ver especially the low level software is all very much relying heavily on number and because of this tabular format we're able to do this auto factorization which makes it really fast to to go through all this data now there was also a second question which i already briefly touched upon in the demo because now that we have a very nice fast framework we of course want to uh use it for the many different types of analyses we can be doing in xenon anton therefore i'm going to zoom in a little bit more onto these plugins and i've of course been very much making this a lot simpler than it actually is because actually each of these lego blocks shown here can consist of multiple sub sub algorithms needed for the processing and each of these plugins has a fixed set of attributes so it has some options there's a version it knows what kind of lego block it was built upon and it also knows what kind of data it provides then of course it uses the highly optimized code we have in instructs in order to do all this fast because if we're thinking about the experiment we might want to change one of these options as i showed in the demo and if this is happening we of course still want to know what data is belonging to what plugin what options were used so for that we used these these incomprehensible hashes which you might have already seen in the demo so all of our data has a given run number the kind of data it is and also this hash which is behind it and that hash is an encoding using hash lib of the entire ancestry on where the data was built on for instance if we look at peaklets it's produced by a plugin which we call piklitz which has a given version it has some options and it's built on top of the records which has a similar format again now this entire dictionary of how things were built up is encoded in this hash and now if we want to change something for instance we want to use a different gain model and you ask for this the data which belongs to this the ancestry changes therefore we use a different hash for encoding this data however of course the data which is below that the records where we actually do the pulse processing doesn't change when we change something which is associated to peaklets therefore in the example i also showed the records were actually only produced once because we were only asking for one configuration the higher level data where we changed the configuration was produced twice and this is how we allow for partial reprocessing and of course if you think about the stack of how we build up our data uh you only want to do you mostly want to do this for high level uh stuff where actually most analyzers are working on anyway if you change something very very low in this processing chain of course you don't need to reprocess a whole lot of data now this may sound complicated but we make we try to make this easy for the average user of the of our framework because there's always so many different needs but there's only a limited number of developers so to make things simple and robust we set up everything like these options and these plugins for a given search for instance if you're talking about xenon one ton data uh you might wanna have a specific set of options when you're working on detector simulation you're maybe having some different options but everybody who's looking for xenon anton data position reconstruction or dark matter or uh anything else concerning cnn anton these people all want to use the same options and we use we bundle those options in the context the context is a centralized object within our framework so also when you're changing only one of these parameters you're still able to use the anton context because maybe you've only changed something very high up in the event level this context is talking to all the parts we have been discussing so far so when a user asks the context for some given data the context will first compute all right so with this data you're asking for how does it work together with all these plugins and options do i actually have the data stored already if so it will return it but if not it will just go all the way down in the the data chain to find where it first finds some data like the lowest level data that is consistent with what the user asks for and then it will start processing it and return it to the user as well as store it to disk so that when next time or someone wants to use it or the user wants to use it it's still there all right so xenontom we have gone for a data which we organize in chunks so fixed time intervals and in a tabular way we organize our data in order to make processing fast we're using modular design of our algorithms and a tiered data storage approach where the high level data is much smaller than the low level data because we keep track of the entire ancestry of data the entire lineage partial reprocessing becomes very easy and given the speed of of our framework people actually do this quite a lot so with this we get from pmt signals to events in as fast as we can actually take the data and all what i've been telling about is not only limited to time projection chambers so where uh we are looking for dark matter for instance our veto systems are using the exact same framework uh to talk to to the pmts used for for instance the neutron veto uh and the immune veto we use for tagging uh backgrounds from neutrons and muons so with that i would like to conclude thanks so much for your attention thank you very much and uh you actually have a handful of questions on slido i've dropped a link on the chat and if you want to share that to go to the questions um so the first question i you can see my screen right um yeah okay so what i was just showing wasn't real data but actually the the uh it was based on simulated data but the overall real data looks really similar and also the workings are exactly the same [Music] right there the next one is any of the general software used outside the xenon anton community um good question so the general software at some point was also there were some searches by nexo to see if they could use it for their mc chain i must admit that i'm not fully up to date whether they're actually using it but also small scale experiments like for instance we have here at our institute at the lab use the same general software but it only has two bmts so the those kind of very specific things are are changed but the general software is also used for these small experiments um there are various dark matter experiments out there are these collaborations doing as we are in high energy physics and trying to share as much a dark dark matter ecosystem as possible if not why well that's a very good question so our main colleagues are from the us and from china um and we don't use the same kind of analysis software we are at this moment at an entire different paradigm whereas we are only using python for all the for all the processing um our us colleagues have gone for c plus plus all the way and it's just a very few last bits where they are using python so already there it's completely different and it's also dictated a little bit by the design choices made it in the daq whereas we have gone for the design choice to do everything in software our colleagues from the us are doing a lot of fpga triage already very low level at the daq um maybe for the next generation experiments we are trying to get together much more closely but at this point we're quite diverse in these uh um in these ecosystems i hope that this uh has answered your question uh so it wasn't real data was simulated data um do you share the cache lower level output between collaborators or between institutes how is this managed um yeah good question i actually maybe i have a slide on this let me see [Music] okay unfortunately i don't so it's a good question though um so all of our data is shared between our own collaboration um so this very this absolutely this raw raw data we are getting from the digitizers themselves is so massive that is very hard to store in one place so for that we we send it out to the data storage [Music] facilities it's very easy to download it but only on request somewhere intermediate so maybe i can share my slides again at the intermediate level when we're talking about peaks so this data this data is shared between everyone and can be opened and so when we do a pro reprocessing we are also mostly focusing on this aspect because this step is so so fast and so easy that analyzers can do it just on themselves and we're having one centralized analysis software facility where all of this is stored and everybody can do their analysis now i need to get back to the question you exclusively use python for the high level analysis or is it the usual mix up of c c plus framework for inside the experiment um so we only use python everybody uses python for the high energy for the high level analysis so there's no mix up between c plus plus and c there's no no real war going on there it's only for our simulation software that for instance gn4 and stuff is coming into play but for the high level uh analysis it's only by far how is the performance of the us c plus framework compile compared to um i must admit that i'm not familiar with the ah okay so you're talking about our main colleagues um a very good question i have actually never tried using their software also because their software isn't publicly available otherwise i would be very interested in trying this out i wouldn't be surprised if our framework is doing quite well compared to them but yeah i must admit i have never used our framework and i also cannot because it's it's not open source are there any open science initiatives in xenon anton um so you would mean the the sharing of data in xenon anton um unfortunately as far as i know there aren't uh any initiatives to make the the data itself publicly available all of our software is and we highly also encourage people to to be able to to run these tools for instance you can very easily run our simulation software yourself but as far as i know there aren't uh any uh initiatives to share like the real the low level data so those are quite hard because it's so much data um oh there's questions keep on coming uh which is good yes sharing data and tools yes i agree on that um how do people uh that have to deal with gm4 and c plus find it if they're now more used to getting work in python for the rest of the experiment software um yeah there's a good question there many people actually are completely fine with having this mix-up but there's also a bit of a shared of distributed tasks right so the people who do most of the heavy geon 4 simulations usually aren't the same people doing the very high high-end analysis sometimes they are but most of the times they're different people and we just work together in order to bring everything together so so far there's no real struggle there it's uh also because we have been using python so much that it's it's it's kind of a given it's mostly for new people entering our collaboration uh which sometimes have to get accustomed to using python in this way

2021-07-24 16:18

Show Video

Other news