Automated NSX DFW validation with PowerNSX

For a project i'm currently working on, we need to provide a full end-to-end validation of the NSX routing, distributed firewall rules and VXLAN functionality. As the amount and complexity of firewall rules are quite significant, i've written a script that allows you to run and retrieve data from NSX Traceflow in an automated fashion, which can be found at https://bitbucket.org/srobroek/pnsx-traceflow/. (Note: this is still very preliminary, needs some feature updates and hopefully will at some point be merged into PowerNSX).

In addition i'd like to warn that this is not the lightest reading material. It presumes the reader has sufficient knowledge in NSX, the distributed firewall, some PowerShell and API usage. It doesn't contain any pictures, but it does contain some amazing code ;)

The functionality of the script is relatively simple: One can call the start-NSXTraceFlow command with a variety of options. All of the options can be found in the sourcecode or by running the cmdlet, in this example we're using the following ones:

  • SourcevNic - This is mandatory as the NSX API requires a vNic connected to a VXLAN as a source. Currently no check is performed if the vNic is connected to a VXLAN, but this should be relatively simple and can be considered a future improvement.
  • Protocol - This determines the protocol and determines dynamic parameters related to the specific protocol.
  • TrafficType - This determines l2, l3, unicast, multicast or broadcast. Currently only unicast is guaranteed to be working, multicast should work and broadcast needs to be implemented.
  • Destination - This can either be a DestinationvNic, DestinationIP or DestinationMac, depending on the TrafficType.
  • SourcePort - The source port for tcp or udp protocols
  • DestinationPort - The destination port for tcp or udp protocols.

As an example, the way i've used this in my lab is as follows:

Start-NSXTraceflow -SourcevNic $nic1 -Protocol tcp -TrafficType l2-unicast -destinationvNic $nic2 -SourcePort 12345 -destinationPort 443.

When ran, this provides an object in the following format:

TraceFlowId
-----------
00000000-0000-0000-0000-00000e3ebc30

Which in turn can be utilised by either Get-NSXTraceFlowResult, providing the following resulting object:

vnicId                : 502147c0-ad14-6de3-7abc-ef2250633df8.000
id                    : 00000000-0000-0000-0000-00000e3ebc30
receivedCount         : 1
forwardedCount        : 0
deliveredCount        : 1
logicalReceivedCount  : 2
logicalDroppedCount   : 0
logicalForwardedCount : 2
timeout               : 10000
completeAvailable     : true
result                : SUCCESS
resultSummary         : Traceflow delivered observation(s) reported
srcIp                 : 172.20.205.10
srcMac                : 00:50:56:a1:49:79
dstMac                : 172.20.205.11

Or Get-NSXTraceFlowObservations, which provides a much more extensive object, of which i'm showing a sample of the data below:

The main object

pagingInfo                           : pagingInfo
traceflowObservationReceived         : traceflowObservationReceived
traceflowObservationLogicalReceived  : {traceflowObservationLogicalReceived, traceflowObservationLogicalReceived}
traceflowObservationLogicalForwarded : {traceflowObservationLogicalForwarded, traceflowObservationLogicalForwarded}
traceflowObservationDelivered        : traceflowObservationDelivered

The logical forwarding overview of the traceflow

roundId         : 00000000-0000-0000-0000-00000e3ebc30
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : FW
compDisplayName : Firewall
hopCount        : 2
ruleId          : 1006

roundId         : 00000000-0000-0000-0000-00000e3ebc30
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : FW
compDisplayName : Firewall
hopCount        : 4
ruleId          : 1006

As you can see above, this is in a single host in my lab, without any kind of routing. The traceflow hits the default firewall rule as it exits the source VM, and hits the default firewall rule again as it reaches the destination VM, so it's not very exciting.

Let's show the same result for two VMs on a different VXLAN with two different types of firewall rules scoped for both individual VMs, one deny and one allow rule:

First off, we start with a connection on port 443, which is allowed:

Start-NSXTraceflow -SourcevNic $nic1 -Protocol tcp -TrafficType l3-unicast -destinationvNic $nic2 -SourcePort 12345 -destinationPort 443

Now, when we look at the traceflowObservatioLogicalFowarded Object in our results we see the following:

roundId         : 00000000-0000-0000-0000-00007ef51715
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : FW
compDisplayName : Firewall
hopCount        : 2
ruleId          : 1021

roundId         : 00000000-0000-0000-0000-00007ef51715
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : LS
compDisplayName : LB
hopCount        : 3
vni             : 10006
logicalCompId   : universalwire-22
logicalCompName : LB

roundId              : 00000000-0000-0000-0000-00007ef51715
transportNodeId      : 40e3459c-d04e-4453-89ff-7ae299542555
hostName             : esxi.int.vxsan.com
hostId               : host-29
component            : LR
compDisplayName      : udlr1
hopCount             : 5
vni                  : 10004
lifName              : 27100000000c
compId               : 10000
compName             : default+edge-0503a72f-955c-4a96-85f6-20b1306c24fc
srcNsxManager        : 422185e4-b4ce-aae7-c07e-2fb72e0a19bd
srcGlobal            : true
logicalCompId        : edge-0503a72f-955c-4a96-85f6-20b1306c24fc
logicalCompName      : udlr1
otherLogicalCompId   : universalwire-18
otherLogicalCompName : NSXRouted-1-a1190f55-ba2f-4554-a474-57cc2af19d7a

roundId         : 00000000-0000-0000-0000-00007ef51715
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : FW
compDisplayName : Firewall
hopCount        : 8
ruleId          : 1021

As you can see, the results contain the exact steps taken through the NSX network. It shows the firewall rule being hit (ID 1021), the traffic being routed through the UDLR called udlr1, and on the receiving side the traffic hitting the same DFW rule again.

To validate that this is indeed the rule, we can use get-nsxfirewallrule to retrieve the rule and get various properties, in this case we get the name.

(Get-NsxFirewallRule |? {$_.id -eq 1021}).Name
allow https for traceflow

Now we run the same with port 80 tcp as the destination port:

Start-NSXTraceflow -SourcevNic $nic1 -Protocol tcp -TrafficType l3-unicast -destinationvNic $nic2 -SourcePort 12345 -destinationPort 80

This time, the traceflowresults show that the traffic was not delivered:

Get-NSXTraceflowResult 00000000-0000-0000-0000-000032bc3c80


operState             : COMPLETE
vnicId                : 5021ddf8-a7f9-7da3-66f6-f17319970ccd.000
id                    : 00000000-0000-0000-0000-000032bc3c80
receivedCount         : 1
forwardedCount        : 0
deliveredCount        : 0
logicalReceivedCount  : 1
logicalDroppedCount   : 1
logicalForwardedCount : 0
timeout               : 10000
completeAvailable     : true
result                : FAILURE
resultSummary         : Traceflow dropped observation(s) reported
srcIp                 : 172.20.205.11
srcMac                : 00:50:56:a1:99:b2
dstMac                : 192.168.10.50

Now, when we look at the traceflow observations we see that the object contains a new object called traceflowObservationLogicalDropped containing the following:

roundId         : 00000000-0000-0000-0000-000032bc3c80
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : FW
compDisplayName : Firewall
hopCount        : 2
ruleId          : 1022
dropReason      : FW_RULE

This shows us why it was dropped, at what phase it was dropped (the hopcount can be used for this combined with the other Traceflow results), and the ruleId that dropped it.

when we look at the traceflowObservationLogicalReceived we can see that the last (and in this case, first) step in the flow was the firewall rule shown above. So now we know when in the process it was dropped. For comparison, if we were to apply the rule only to the destination VM, we'd see the following observationlogicalreceived:

roundId         : 00000000-0000-0000-0000-00004dc95248
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : FW
compDisplayName : Firewall
hopCount        : 1

roundId              : 00000000-0000-0000-0000-00004dc95248
transportNodeId      : 40e3459c-d04e-4453-89ff-7ae299542555
hostName             : esxi.int.vxsan.com
hostId               : host-29
component            : LR
compDisplayName      : udlr1
hopCount             : 4
vni                  : 10006
lifName              : 27100000000b
compId               : 10000
srcNsxManager        : 422185e4-b4ce-aae7-c07e-2fb72e0a19bd
srcGlobal            : true
compName             : default+edge-0503a72f-955c-4a96-85f6-20b1306c24fc
logicalCompId        : edge-0503a72f-955c-4a96-85f6-20b1306c24fc
logicalCompName      : udlr1
otherLogicalCompId   : universalwire-22
otherLogicalCompName : LB

roundId         : 00000000-0000-0000-0000-00004dc95248
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : LS
compDisplayName : NSXRouted-1-a1190f55-ba2f-4554-a474-57cc2af19d7a
hopCount        : 6
vni             : 10004
logicalCompId   : universalwire-18
logicalCompName : NSXRouted-1-a1190f55-ba2f-4554-a474-57cc2af19d7a

roundId         : 00000000-0000-0000-0000-00004dc95248
transportNodeId : 40e3459c-d04e-4453-89ff-7ae299542555
hostName        : esxi.int.vxsan.com
hostId          : host-29
component       : FW
compDisplayName : Firewall
hopCount        : 7

This shows that the traffic is actually allowed out of the source VM, routed through the DLR, but dropped at the same rule when it arrives at the source VM.

Hopefully i've shown you the power of the NSX API, traceflow and automating this. While this is specifically written in powershell, the API calls are language-independent and could be repurposed for various purposes. Think vRealize Orchestrator, VMware's Houdini product for change validation, testing on-demand firewall rules as a day-2 operation from vRealize Automation, integration with your CI/CD system, or even automated security auditing and policy validation.

The next part of this blogpost will be about the actual process how we used the NSX traceflow API to validate a complex set of service composer based policies to prove to the customer that the security policies created in NSX. Since the environment will go live straight from handover, there can be no mistakes in the firewall configuration as any change may take weeks after the go-live date. As such, we're automating all steps to provide a report proving that our firewall rules do what they are expected to do.

I hope you'll enjoy the script i've provided, and possibly you might be able to use it in your own environment, and as soon as i find the time i'll provide part two of this series. Until then, happy powershelling!