Telemetry transformed Terraforming Grafana for next-gen dashboards
[Music] hi everybody uh who's ready to talk about terraform okay we got some fans in the crowd great wonderful um so my talk this is one of my favorite topics when talking about terraform and that is anytime we can mash up providers together so uh we're going to be mashing up providers together um the Azure provider and the uh graphon provider so
that's that's what this talk is all about um this is me my name is Mark tinderholt I work at Microsoft in engineering um on the I'm on the Microsoft Azure team uh and my team is focused on solving cross cutting issues within the platform of of azure um that's that's what I do at Microsoft um I've been work I've been doing terraform for gosh about 10 years now uh started prior to my time at Microsoft uh so I've worked on a WS Azure and Google cloud
and just really fell in love with the tool and you know even you know after joining Microsoft still you know feel like I want to share you know about about terraform and still find ways to use it um at Microsoft so um this the Genesis of this talk uh actually comes from an internal solution um that that my team worked on where we were basically you know running a workload and capturing Telemetry um from uh uh
from A system that we were trying to you know run some chaos experiments um against the platform um and so we had your typical you know infrastructure and application you know manage life cycle management problems that we had um we were collecting Telemetry ultimately that was what we were really interested in because we wanted to see what was happening during these chaos experiments that we were running um and then we of course as part of the solution we had a grafana dashboard that helped us observe the workload while this chaos was
happening um and after the fact as well like we'd go back and we'd we'd do research and studies and things like that so that's you know what our solution entailed and you know from the infrastructure uh standpoint you know it we had The Usual Suspects you know we had virtual machines databases and we had logs aure monitor stuff right um and you know for for you know to get this automated we we had that covered you know we we used terraform right and uh
across the board we we were able to automate our entire solution um and so it was you know it was easy we got this like it everything is fine like nothing to worry about there however another big part of our solution was not an Azure you know it was it was these graphon components that were actually a pretty important part of our solution because really our solution was about deploying this thing you know and then running experiments on it and then seeing what happened to it right so really the the
Zeitgeist to the solution was being able to observe you know this workload um and see see what was going on and so we had uh a lot of dashboards and panels and queries that we were running against the Telemetry that we were collecting from this workload um and you know we we had these you know grafana Wizards you know that were working within grafana to you know create these dashboards that we used in order to look inside our you know this workload see what was going on
and I wish I could say our our dashboards look like that um you know God knows those graffo Wizards like they do some magic but in real life you know grafana you know kind of looks like this right uh so I you know I'm taking some Liberties here but um and you know from a grafana standpoint the anatomy you know you have dashboards uh you know and then within dashboards you have one to to many panels and then those panels reference data sources and they have queries embedded within them that's kind of the
structure of what grafana looks like so you know the first time you know we rolled this out it was pretty easy you know we automated all the things which included our workload which included a grafana uh managed instance we set the grafana Wizards off to work you know they click Ops things you know to make things beautiful and it was all wonderful we had these beautiful dashboards and like we were able to observe like what was going on when those uh when we ran our chaos uh experiments but we started noticing a
pattern when we started trying to stamp this out to different environments um you know uh the first time it was fine but as we did this more and more some some patterns emerged mainly this one uh the grafana dashboards uh were ba mly basically our strategy was copy pasta right um and our grafana Wizards ended up doing a lot of copy pasta it was quite quite unfortunate and so even though we had this wonderful automation
pipeline that provisioned our infrastructure and application and tied everything together we had a lot of copy pasta for all of these graphon components um and so you know this was not ideal right and so we needed a better solution um and so interestingly the solution was you know kind of hiding in plain sight and that was terraform um you know terraform can automate a lot of stuff I'm sure you
guys are well aware right um there's uh you know we have of course the hyperscalers out there um that I'm sure many of you have one or more of those that you use terraform to automate with but terraform can do more than just your favorite cloud provider enter grafana um and so once we discovered this it's like oh well so we already have this automation you know motion where we're provisioning our infrastructure deploying our application stitching this all together
um we could use uh you know grafana to do that um and so that's it's kind of what we set out to do so what do we have to automate um and it was those dashboards panels queries uh but also data sources and this this is kind of where the grafana control plane touches the infrastructure the Azure control plane um all of those queries and you know whizbang dashboards were ultimately quering data sources that we were already provisioning with terraform
um and so how do we fit all this together right how do we automate these things um you know it's started with the grafana managed instance which we provision through the Azure control plane through the through the Azure provider um but then on top of that we have the data sources the queries the panels and all these things would be provisioned through the grafana provider um and so that's that's kind of what our solution you know is going to look like but
how do we how do we Mash this new approach with the way that we're currently working right we have you know the folks focused on the infrastructure and the application this workload right but then we also have these grafana Wizards that are like working in the grafana you know tool you know to author these dashboards so we needed to find a way that we didn't disrupt you know what they were doing right they needed to keep doing what they're doing you know that is you know uh click to conf this click Ops mode motion you know of
dropping into grafana um you know pulling in data sources and then typ you know creating the queries and creating the beautiful dashboards um with the information that we wanted um but you know once that stabilized you know we would introduce a developer Persona that would kind of export that from gravana codify that um and then in uh embed that within our you know repeatable release process um using terraform um and so uh
we also noticed that if we structured things a certain way um unless there was really heavy surgery like there was a lot of the updates that would happen um you know could be handled like just with query updates like we didn't really need to change the dashboards and the panel structured too much it was really like maybe we might change some logic change some logic within the queries themselves so now let's kind of look at the code right and so this should look pretty familiar this is an Azure resource this is where we started right we provisioned
that grafana managed instance um but once that has been provisioned um then we configure the grafana provider itself and that that takes inputs from the the Azure resource the Azure managed grafana instance that we provisioned um the endpoint but then also some authentication details um that we provision um using the like the uh the C the CLI right to to set up a a service
account within grafana and then a a token that we can use to off against grafana and once we do that we can set up our data sources um and those data like we used the data sources we used were Azure Monitor and Azure data explore which is like an app pendon database that's used for Telemetry um on Azure but grafana has a tremendous number of these different data source types each with their own schema so there this resource the f a data source
you know you can uh as once you find out what that type is that you want you know to use as a data source within your within your dashboards you just need to figure out what that schema looks like and you'll be able to connect up to it chances are you're already provisioning that thing with terraform whatever on whatever Cloud you know that you're that you're targeting um now this is where it starts to get a little bit different a little bit weird um and you know if
you've Prov if you worked with terraform um on CL automating Cloud infrastructure um the approaches that we're going to start talking about um just in the next couple of slides are going to be are going to feel a little non-conventional um and that's because we're not we're not automating infrastructure anymore we're automating to the grafana control plane and we're automating those uh grafana dashboards um and so it's it's going to be a little it's going to be a little bit of a learning curve um and so
let's go back to what we have to automate um we have these data sources we have queries panels dashboards and you know essentially if you think about it like the the the da the data sources are kind of this base layer that talks that's the closest to the infrastructure the queries you know hit that the queries are referenced by panels the panels go into dashboards and you have this kind of layering effect uh kind of like these uh matrioska nesting dolls right like a smaller thing
encapsulated within a bigger thing into a bigger thing and a bigger thing you know until you work up to that epic grafana dashboard where you know you have all that whiz whizbang happening um and so if you think about it from a layering perspective right you know these smaller bits uh it kind of helps understand kind of the anatomy of like what's going on from an automation standpoint so we're going to go down a journey where we're kind of opening these dolls going down till we hit bottom and then we'll kind of you know
build them back up uh until the culmination of like when we're when we actually automate this whole thing so starting at the top right the big grandma Babushka doll right is the dashboard right and essentially this is just a big Json file right and this is where um our our developers would go in and pull you know once the once the grafana Wizards are done and kind of stabilized um they'd go and Export this and it would would be huge and um you
can see here there's a spot where panels like expand um and so you know if you doubleclick into that okay we're going to kind of open that the next doll up right now we're down inside of a panel and a panel is just a Json um object right that goes into that collection within the dashboard's document itself um and so inside here you see references to our data source and then you also see I don't know if you can see it very well but um you see the query down in there
and um that query is let's just say not optimal I'm going to try and be uh uh polite here but um let let's just face it it's pretty nasty um it's that whole query you know is on a single line it's got Escape characters not not ideal um so you know if there was a way that we could you know make that a little bit more maintainable um of course as I mentioned like this is an area actually
where most of our change happened at from a day to op standpoint um going in there you know and trying to you know make sure make changes to that inline query is not not going to be a great experience right and so um you know so now we're pretty much at the smallest of dolls right you know we're the with the base layer now let's start thinking about how we can build back up from there and so you know we we looked at that that query that query is ultimately
you know the the bottom right and you know if we can extract that into a file then we don't have to embed it within that nasty Json we can have you know a nice uh Dev experience where it's very readable and maintainable um and that's kind of like our first you know doll that we're going to that we're going to build the the component that we're going to automate the the next the next you know layer as we as we move up there is the panel and we'll use the the file function in
terraform to pull the content of that query and embed it within that Json document uh Json object for the panel um and then um you know we roll up to the dashboard and the dashboard is also a Json document and you know we will use template file to parameterize and plug in um the the panels um with the queries and the data sources and now we have you
know our next doll and the final step right is we have this terraform resource where we take this fully aggregated U structured uh Json uh uh document and uh you know we provision it using a terraform resource and uh yeah so that that's it's a kind of a combination of template file and Json and code that we're using where you know we load the query from a text
file we insert that into the J the Json for the panel that creates one uh unit and then we insert that into the dashboard and then we provision that with the terraform resource and that's ultimately what our you know what we what we're going to provision um and so looking at that you know the query ends up looking much nicer um No Escape characters
multi-line um this is a lot more comfortable for somebody who wants to actually see what this thing is doing right and detect if there's like a syntax eror or something like that it's almost impossible if it's nested in Json right um then the next step you know is uh is the panel and that is that Json um object which we store into a Json file um and we have because we're going to be us the we're going to be using the
template file function um we're going to parameterize you know that the query and the data source that this panel is going to use um and we'll be able to compose that into this before we roll it up to the next level um so again the query is in its own file the panel is is is in its own file um and then you know the the next process right is where we kind of stitch this all together and roll it up and so we'll be be using Json encode
and template file and starting at the very bottom right we're going to load the query file the contents of that query file um we're uh we're going to Jason encode that and pass that in as a parameter to the the query for the panel um we'll also be able to reference a DAT the the data source for the um the actual uh grafana data source that we're that we want to that we want to use within this panel um which is also
provisioned you know by by terraform you know probably another TF file um and then we'll reference the panels Json file in order to compose this into one Json block then and and that rolls up to that Json file uh for the panel um you know layering stack stacking these dolls okay and then finally we're at the you know we we are at the place where we can provision the dashboard um
and we need to pull in you know that that panel Json as a parameter to the the dashboard Json file um and then we just provision that um you know to uh you know to to the graphon managed instance that we deployed um and so yeah it's kind of like this four-step process as you roll it up um to be uh you know into this into this resource that will
carry it into the gravana instance um yeah so just to recap right we are we're loading you know the query into the panels into the dashboard and we've isolated um those um each of those artifacts into their own file right so that you know if we want to make updates to any one of those components we don't have to go like search around in some giant Json file in
order to do that um we can we can drop into the query we can edit the query um we can tweak the panel if we want to tweak the panel in the dashboard if we want to add additional panels right we can we can just you know insert the additional panels um you know from there and so this creates a much more maintainable um you know Co code library right that we can that we can maintain all these components for um now the the
the graffo Wizards are still working within the authoring tool of grafana itself um and so only only when you know a new version that's ready to be extracted um you know is is ready is is ready you know would we go through this process of having a developer uh pull those updates whether it's the query the panel or the dashboard um and to embed them into the code um in the appropriate
location so um yeah the uh I I think if there are three things that I'd like you to take away from this talk um you know one is be creative you know look at the providers that are available um out there in the provider registry think about components of your architecture beyond the cloud platform that you that you happen to R targeting you know whether it's data dog or whether it's
you know a particular database or something like that um there are a tremendous number of providers out there um that you can stitch together comprehensive Solutions um to streamline the Automation and release process for for what you're trying to achieve um and you know the Graff grafana is just another control plane that that layers on top of azure I think kubernetes is another great example of like where there's another control plane that's
layered on top of a cloud platform uh control plane um but yeah these are I think it's some something that you know we often don't think about um and then I the the next thing probably number two is also recognize that there are different ways of working um so you know when typically when we're automating infrastructure you know we we have a platform team we have an appdev team um you know we we already recognize like
the different ways of working of those two teams because and those two personas and what their priorities are what they're what they're focused on um and and so we we already have that Comfort level but like when working with a a new component like like grafana or like data dog or something like that um or even policy right where you might manage that as code um you you have to think about um the ways of working of the people that are responsible for authoring that
aspect of your solution and so with grafana like there's there's no way we're going to have folks that are used to authoring in a click Ops way with grafana drop into vs code and start submitting PLL requests that's just not going to happen um that it's not even uh close to being a productive use of their time so finding ways to infuse those uh workflows right that that they of authorship uh that those different
personas are already using um you know with your release process you can you can kind of achieve The Best of Both Worlds all right where the grafana you know click Ops folks can you know work and do their thing but you don't you can have a repeatable way of uh deploying those grafana dashboards in a consistent fashion um without without having those additional manual like steps of friction that I described at the beginning um so
that that was you know kind of a a key observation you know that uh that we that we saw oh my bad um and then the third thing always be thinking about day two operations um and so you know we we could have very easily just left um the uh you know just extracted the flat Json file and you know just dumped that into the source code repo uh and that that would be it um but uh
you that that that that wouldn't have helped us like maintain that long term um because you know as we saw with the query embedded within Json right it's a very tedious and you know kind of cumbersome process to to kind of enact change and uh know that the change that you're introducing into the system isn't going to cause defects uh or problems later on so the way that we decomposed
the components of grafana um into those individual artifacts helped us isolate the the change to those specific files that we were going to modify so that we knew okay if I only need to change the query I'm just updating the query um and that that made that release motion that much more predictable for us um you know I I I I day two Ops is is something that often gets neglected I it's I think of it as like when you go shopping for a sofa right you
know are are you spending more time thinking about what that sofa is going to be like what your life is going to be like in your living room with that sofa or is the first thing that enter enters your mind is like how do I get this thing through the door I think we spend too much time thinking about how we can get the you know the infrastructure and things through the door and not what it's going to be like to live with it once it's there um so absolutely focus on day two Ops and optimizing for those
day2 Ops um that's it um I kind of jump to this guy a little bit early um but I I hope you enjoyed you know this little delve into a provider mashup with the Azure RM provider and the grafana provider um I I love finding new providers and you know intertwining them you know with the Azure RM provider that's my favorite provider um naturally um but uh yeah it's it's a lot of fun and I think
that's that's one of the superpowers that terraform has so it's it's always fun when we can when we can find providers to mash up um anyways uh this is me thank you so much uh for having me uh here at hashid days here in Sydney um and uh you you can find me on the socials uh I have a YouTube channel called the Azure terraformer um where I post somewhat educational content some times uh amongst other things and uh
yeah you hit me up on LinkedIn or X um yeah thank you so much [Applause] [Music]