Automated NSX DFW validation with PowerNSX, Traceflow & NSX API

For a project i'm currently working on, we need to provide a full end-to-end validation of the NSX routing, distributed firewall rules and VXLAN functionality. As the amount and complexity of firewall rules are quite significant, i've written a script that allows you to run and retrieve data from NSX Traceflow in an automated fashion, which can be found at https://bitbucket.org/srobroek/pnsx-traceflow/. (Note: this is still very preliminary, needs some feature updates and hopefully will at some point be merged into PowerNSX).

In addition i'd like to warn that this is not the lightest reading material. It presumes the reader has sufficient knowledge in NSX, the distributed firewall, some PowerShell and API usage. It doesn't contain any pictures, but it does contain some amazing code ;)

The functionality of the script is relatively simple: One can call the start-NSXTraceFlow command with a variety of options. All of the options can be found in the sourcecode or by running the cmdlet, in this example we're using the following ones:

  • SourcevNic - This is mandatory as the NSX API requires a vNic connected to a VXLAN as a source. Currently no check is performed if the vNic is connected to a VXLAN, but this should be relatively simple and can be considered a future improvement.
  • Protocol - This determines the protocol and determines dynamic parameters related to the specific protocol.
  • TrafficType - This determines l2, l3, unicast, multicast or broadcast. Currently only unicast is guaranteed to be working, multicast should work and broadcast needs to be implemented.
  • Destination - This can either be a DestinationvNic, DestinationIP or DestinationMac, depending on the TrafficType.
  • SourcePort - The source port for tcp or udp protocols
  • DestinationPort - The destination port for tcp or udp protocols.

As an example, the way i've used this in my lab is as follows:

Start-NSXTraceflow -SourcevNic $nic1 -Protocol tcp -TrafficType l2-unicast -destinationvNic $nic2 -SourcePort 12345 -destinationPort 443.  

When ran, this provides an object in the following format:

TraceFlowId  
-----------
00000000-0000-0000-0000-00000e3ebc30  

Which in turn can be utilised by either Get-NSXTraceFlowResult, providing the following resulting object:

vnicId                : 502147c0-ad14-6de3-7abc-ef2250633df8.000  
id                    : 00000000-0000-0000-0000-00000e3ebc30  
receivedCount         : 1  
forwardedCount        : 0  
deliveredCount        : 1  
logicalReceivedCount  : 2  
logicalDroppedCount   : 0  
logicalForwardedCount : 2  
timeout               : 10000  
completeAvailable     : true  
result                : SUCCESS  
resultSummary         : Traceflow delivered observation(s) reported  
srcIp                 : 172.20.205.10  
srcMac                : 00:50:56:a1:49:79  
dstMac                : 172.20.205.11  

Or Get-NSXTraceFlowObservations, which provides a much more extensive object, of which i'm showing a sample of the data below:

The main object

pagingInfo                           : pagingInfo  
traceflowObservationReceived         : traceflowObservationReceived  
traceflowObservationLogicalReceived  : {traceflowObservationLogicalReceived, traceflowObservationLogicalReceived}  
traceflowObservationLogicalForwarded : {traceflowObservationLogicalForwarded, traceflowObservationLogicalForwarded}  
traceflowObservationDelivered        : traceflowObservationDelivered  

The logical forwarding overview of the traceflow

roundId         : 00000000-0000-0000-0000-00000e3ebc30  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : FW  
compDisplayName : Firewall  
hopCount        : 2  
ruleId          : 1006

roundId         : 00000000-0000-0000-0000-00000e3ebc30  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : FW  
compDisplayName : Firewall  
hopCount        : 4  
ruleId          : 1006  

As you can see above, this is in a single host in my lab, without any kind of routing. The traceflow hits the default firewall rule as it exits the source VM, and hits the default firewall rule again as it reaches the destination VM, so it's not very exciting.

Let's show the same result for two VMs on a different VXLAN with two different types of firewall rules scoped for both individual VMs, one deny and one allow rule:

First off, we start with a connection on port 443, which is allowed:

Start-NSXTraceflow -SourcevNic $nic1 -Protocol tcp -TrafficType l3-unicast -destinationvNic $nic2 -SourcePort 12345 -destinationPort 443  

Now, when we look at the traceflowObservatioLogicalFowarded Object in our results we see the following:

roundId         : 00000000-0000-0000-0000-00007ef51715  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : FW  
compDisplayName : Firewall  
hopCount        : 2  
ruleId          : 1021

roundId         : 00000000-0000-0000-0000-00007ef51715  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : LS  
compDisplayName : LB  
hopCount        : 3  
vni             : 10006  
logicalCompId   : universalwire-22  
logicalCompName : LB

roundId              : 00000000-0000-0000-0000-00007ef51715  
transportNodeId      : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName             : esxi.int.vxsan.com  
hostId               : host-29  
component            : LR  
compDisplayName      : udlr1  
hopCount             : 5  
vni                  : 10004  
lifName              : 27100000000c  
compId               : 10000  
compName             : default+edge-0503a72f-955c-4a96-85f6-20b1306c24fc  
srcNsxManager        : 422185e4-b4ce-aae7-c07e-2fb72e0a19bd  
srcGlobal            : true  
logicalCompId        : edge-0503a72f-955c-4a96-85f6-20b1306c24fc  
logicalCompName      : udlr1  
otherLogicalCompId   : universalwire-18  
otherLogicalCompName : NSXRouted-1-a1190f55-ba2f-4554-a474-57cc2af19d7a

roundId         : 00000000-0000-0000-0000-00007ef51715  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : FW  
compDisplayName : Firewall  
hopCount        : 8  
ruleId          : 1021  

As you can see, the results contain the exact steps taken through the NSX network. It shows the firewall rule being hit (ID 1021), the traffic being routed through the UDLR called udlr1, and on the receiving side the traffic hitting the same DFW rule again.

To validate that this is indeed the rule, we can use get-nsxfirewallrule to retrieve the rule and get various properties, in this case we get the name.

(Get-NsxFirewallRule |? {$_.id -eq 1021}).Name
allow https for traceflow  

Now we run the same with port 80 tcp as the destination port:

Start-NSXTraceflow -SourcevNic $nic1 -Protocol tcp -TrafficType l3-unicast -destinationvNic $nic2 -SourcePort 12345 -destinationPort 80  

` This time, the traceflowresults show that the traffic was not delivered:

Get-NSXTraceflowResult 00000000-0000-0000-0000-000032bc3c80


operState             : COMPLETE  
vnicId                : 5021ddf8-a7f9-7da3-66f6-f17319970ccd.000  
id                    : 00000000-0000-0000-0000-000032bc3c80  
receivedCount         : 1  
forwardedCount        : 0  
deliveredCount        : 0  
logicalReceivedCount  : 1  
logicalDroppedCount   : 1  
logicalForwardedCount : 0  
timeout               : 10000  
completeAvailable     : true  
result                : FAILURE  
resultSummary         : Traceflow dropped observation(s) reported  
srcIp                 : 172.20.205.11  
srcMac                : 00:50:56:a1:99:b2  
dstMac                : 192.168.10.50  

Now, when we look at the traceflow observations we see that the object contains a new object called traceflowObservationLogicalDropped containing the following:

roundId         : 00000000-0000-0000-0000-000032bc3c80  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : FW  
compDisplayName : Firewall  
hopCount        : 2  
ruleId          : 1022  
dropReason      : FW_RULE  

This shows us why it was dropped, at what phase it was dropped (the hopcount can be used for this combined with the other Traceflow results), and the ruleId that dropped it.

when we look at the traceflowObservationLogicalReceived we can see that the last (and in this case, first) step in the flow was the firewall rule shown above. So now we know when in the process it was dropped. For comparison, if we were to apply the rule only to the destination VM, we'd see the following observationlogicalreceived:

roundId         : 00000000-0000-0000-0000-00004dc95248  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : FW  
compDisplayName : Firewall  
hopCount        : 1

roundId              : 00000000-0000-0000-0000-00004dc95248  
transportNodeId      : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName             : esxi.int.vxsan.com  
hostId               : host-29  
component            : LR  
compDisplayName      : udlr1  
hopCount             : 4  
vni                  : 10006  
lifName              : 27100000000b  
compId               : 10000  
srcNsxManager        : 422185e4-b4ce-aae7-c07e-2fb72e0a19bd  
srcGlobal            : true  
compName             : default+edge-0503a72f-955c-4a96-85f6-20b1306c24fc  
logicalCompId        : edge-0503a72f-955c-4a96-85f6-20b1306c24fc  
logicalCompName      : udlr1  
otherLogicalCompId   : universalwire-22  
otherLogicalCompName : LB

roundId         : 00000000-0000-0000-0000-00004dc95248  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : LS  
compDisplayName : NSXRouted-1-a1190f55-ba2f-4554-a474-57cc2af19d7a  
hopCount        : 6  
vni             : 10004  
logicalCompId   : universalwire-18  
logicalCompName : NSXRouted-1-a1190f55-ba2f-4554-a474-57cc2af19d7a

roundId         : 00000000-0000-0000-0000-00004dc95248  
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555  
hostName        : esxi.int.vxsan.com  
hostId          : host-29  
component       : FW  
compDisplayName : Firewall  
hopCount        : 7  

This shows that the traffic is actually allowed out of the source VM, routed through the DLR, but dropped at the same rule when it arrives at the source VM.

Hopefully i've shown you the power of the NSX API, traceflow and automating this. While this is specifically written in powershell, the API calls are language-independent and could be repurposed for various purposes. Think vRealize Orchestrator, VMware's Houdini product for change validation, testing on-demand firewall rules as a day-2 operation from vRealize Automation, integration with your CI/CD system, or even automated security auditing and policy validation.

The next part of this blogpost will be about the actual process how we used the NSX traceflow API to validate a complex set of service composer based policies to prove to the customer that the security policies created in NSX. Since the environment will go live straight from handover, there can be no mistakes in the firewall configuration as any change may take weeks after the go-live date. As such, we're automating all steps to provide a report proving that our firewall rules do what they are expected to do.

I hope you'll enjoy the script i've provided, and possibly you might be able to use it in your own environment, and as soon as i find the time i'll provide part two of this series. Until then, happy powershelling!